What transcription models are available for Voice Notes?

On Device AI offers several local transcription engines including Whisper, Apple STT, Parakeet, Nemotron, and Qwen3-ASR. Batch models like Parakeet TDT v3 (~700 MB) and Qwen3-ASR handle full files with high accuracy, while streaming models like Parakeet EOU 120M (free, ~150 MB) transcribe in real time.

Can I use a custom vocabulary for transcription?

Yes, you can use the custom vocabulary feature with the Parakeet TDT-CTC 110M model. This allows the app to recognize domain-specific terms like product names, acronyms, or jargon with high precision.

Voice Notes

Never miss a detail again. Instantly capture, transcribe, label speakers, and summarize meetings, lectures, or brilliant ideas with word-level precision—all securely processed on your device.

On this page

Recording Audio
Transcription
Transcription models
Speaker Diarization
Re-Transcription
AI Processing
Word-Level Navigation
Language Support

Recording Audio & Import

Navigate to the AI Voice Note tab and tap the record button to start capturing audio. The app uses either WhisperKit or Apple STT for on-device speech recognition.

Real-time transcription: Text appears as you speak
Background recording: On iOS, recording continues when the app is in the background
Import audio (PRO): Turn your past recordings into searchable, summarizable text with clear, step-by-step progress tracking so you're never left guessing.

ℹ️ Privacy

All audio processing happens entirely on your device using WhisperKit and Apple Speech frameworks. No audio data is sent to any server.

Transcription

Transcription runs locally using one of several engines: Whisper models, Apple STT, Parakeet, Nemotron, or Qwen3-ASR. Pick a model in Settings → Voice based on your language needs and device. Features include:

Word-level timestamps: Each word is timestamped for precise navigation
Multiple languages: Support for many languages including English, Chinese, Japanese, Spanish, French, and more
Automatic punctuation: The model adds punctuation and sentence structure

Transcription models

In addition to Whisper and Apple STT, several newer models are available. They run entirely on-device and are downloaded once before first use.

Batch transcription

These models process a recorded or imported audio file after recording finishes. They tend to be more accurate than streaming models because they can look at the full audio context.

Model	Languages	Runs on	Notes
Parakeet TDT v2	English	Mac (8 GB+)	Highest English accuracy in this group. ~700 MB download.
Parakeet TDT v3	25 European languages	iPad / Mac	Same architecture as v2 with multilingual support. ~700 MB download.
Parakeet TDT-CTC 110M	English	iPhone / iPad / Mac	Smaller model (~407 MB). The only Parakeet model that supports custom vocabulary.
Parakeet CTC Japanese	Japanese	iPad / Mac	~700 MB download. CER around 6.85%.
Parakeet CTC Chinese	Mandarin Chinese	iPad / Mac	Available in Int8 (~550 MB) and full-precision (~1.1 GB) variants.
Qwen3-ASR	30 languages	iPad / Mac	Covers CJK, Southeast Asian, and European languages. 30-second clip limit per segment. Available in Int8 (~800 MB) and F32 (~1.5 GB).

ℹ️ Custom vocabulary

Parakeet TDT-CTC 110M is the only model in this group that works with the custom vocabulary feature. If you need the app to recognize domain-specific terms (product names, acronyms, jargon), use this model.

Streaming recognition

These models transcribe in real time while you record, so text appears on screen as you speak.

Model	Languages	Runs on	Notes
Parakeet EOU 120M	35+ languages	iPhone / iPad / Mac	Free for all users. ~150 MB download. End-of-utterance detection for natural segmentation.
Nemotron Streaming	English	iPad / Mac	Lower error rate than Parakeet EOU for English (~2% WER). Available in 560 ms and 1120 ms chunk variants. ~600 MB download.

💡 Tip

Parakeet EOU 120M is free and covers 35+ languages. It's a good default for real-time transcription on any device, including iPhone.

All Parakeet, Nemotron, and Qwen3 models require a Pro subscription except Parakeet EOU 120M, which is free.

Speaker Diarization PRO

Automatically label who said what in your voice notes and imported recordings.

Model download: Download speaker models once directly from the Voice Note UI.
Label Speakers toggle: Turn on diarization to instantly split transcripts by speaker.
Precision matching: Advanced tuning controls guarantee perfectly accurate speaker labels, even when people talk over each other.
Display Speakers toggle: Switch cleanly between speaker labels and timestamp-only views.
Persisted labels: Speaker labels are securely saved and instantly restored when you reopen a recording.

Re-Transcription

Use Re-Transcript to regenerate transcript text from existing audio with Whisper or Apple STT.

Model switching: Re-run with a different transcription model for better results
Optional speaker labeling: Apply diarization automatically after re-transcription
Safe overwrite: New transcript and speaker labels replace older results for the same audio

AI Processing

After transcription, you can process the text with AI for:

Summarization: Get concise summaries of meetings or lectures
Translation: Translate the transcription to another language
Key points: Extract action items and key takeaways
Speaker-aware analysis: Your AI assistant knows exactly who said what. Ask it to "Summarize Sarah's points" or "List action items for John" for deeply personalized insights.
Custom processing: Use any prompt to analyze your transcript however you need.

Tap any word in the transcription to jump to that exact moment in the recording. This makes it easy to:

Verify specific quotes or statements
Re-listen to important sections
Navigate long recordings efficiently

Organization & Renaming

Keep your brilliant ideas and crucial meetings perfectly organized.

Custom naming: Easily rename any recording so you can identify important lectures, interviews, and brainstorms at a single glance.

Language Support

Whisper and Apple STT support transcription in many languages. The newer Parakeet and Qwen3 models expand coverage to include Japanese, Mandarin Chinese, Korean, Thai, Vietnamese, and dozens of other languages not previously available. You can let the model detect spoken language automatically or set it manually for better accuracy.

💡 Tip

For best transcription quality, use a quiet environment and speak clearly. External microphones also improve accuracy significantly.

Voice Notes

Recording Audio & Import

Transcription

Transcription models

Batch transcription

Streaming recognition

Speaker Diarization PRO

Re-Transcription

AI Processing

Word-Level Navigation

Organization & Renaming

Language Support