Releases

ElevenLabs Launches Scribe v2: The Most Accurate AI Transcription Model Released

ElevenLabs launched Scribe v2 on January 9, 2026 — the most accurate transcription model on industry benchmarks, with 90+ language support, speaker diarization for up to 48 speakers, keyterm prompting, and a real-time variant at 150ms latency.

January 9, 20265 min read

ElevenLabs launched Scribe v2 on January 9, 2026 — two versions simultaneously, solving two different problems. Scribe v2 Realtime (released January 6) is built for live applications: voice agents, real-time captioning, meeting transcription, at under 150ms latency. Scribe v2 Batch (January 9) is built for long-form audio: podcast episodes, interview recordings, video subtitles, any content where you're processing a complete file rather than a live stream.

Both rank first on accuracy benchmarks. On the FLEURS multilingual benchmark across 30 languages, Scribe v2 Realtime achieved 93.5% accuracy with the lowest Word Error Rate of any low-latency ASR model, outperforming Google Gemini Flash, OpenAI GPT-4o Mini, and Deepgram Nova 3. Scribe v2 Batch achieved the lowest WER recorded on industry-standard benchmarks for long-form transcription.


Why This Release Matters

ElevenLabs started as a text-to-speech company. Scribe v2 completes the other half of the audio loop. You can now generate speech with ElevenLabs TTS, publish or record it, and transcribe it back with Scribe — all within the same platform, under the same subscription.

For creators using ElevenLabs TTS on Cliprise to narrate AI videos, Scribe v2 closes the workflow: generate narration, record it into the video, then use Scribe to extract a timestamped transcript for subtitles, repurposed written content, or editorial editing. Previously, you needed a separate transcription service for this.

The competitive framing is direct. ElevenLabs described Scribe v2 as entering the market against OpenAI's Whisper, Google's speech recognition, and enterprise services like Rev and Otter.ai — and winning on accuracy benchmarks across each comparison.


Scribe v2 Batch: For Long-Form Content

Scribe v2 Batch is optimized for complete audio files — long podcast episodes, full interview recordings, conference sessions, meeting recordings, video files up to feature length.

Key capabilities:

Speaker diarization for up to 48 distinct speakers. The transcript labels which speaker said what, with timestamps for each segment. For multi-participant recordings, this is the difference between a usable transcript and a wall of unattributed text.

Keyterm prompting for up to 100 specific terms. Supply brand names, product names, technical vocabulary, or proper nouns before transcription, and the model biases toward transcribing those terms correctly when they appear in context. This addresses the consistent failure mode where AI transcription renders "Cliprise" as "Clip Rise" or a technical product name as a phonetic approximation.

Entity detection across 56 categories — PII, health data, payment details, and others — with exact timestamps. For compliance workflows, legal transcription, or any content requiring sensitive information identification before distribution.

Multi-language handling without manual segmentation. A recording that switches between English and Spanish mid-conversation is transcribed correctly without needing to separate the audio or specify language switches manually.

Output format: Word-level timestamps for every word in the transcript. Importable as SRT or WebVTT for subtitle workflows.


Scribe v2 Realtime: For Live Applications

Scribe v2 Realtime operates at 30-80ms latency — fast enough for real-time conversational AI agents where human-speed response matters.

The 150ms headline figure is the maximum. In practice, the model uses predictive transcription — anticipating the next likely word and punctuation based on context — which produces partial results before the speaker finishes the phrase. This "negative latency" design is what makes real-time conversational agents feel natural rather than slightly delayed.

The model automatically detects language, handles code-switching between languages mid-conversation, and adapts to accents and background noise without configuration changes. On ElevenLabs' internal benchmark of 500 hard samples with background noise and complex information, Scribe v2 Realtime significantly outperformed competing real-time ASR models.

The use case ElevenLabs focuses on most: AI agents. When building a voice assistant that listens and responds, the transcription quality and latency of the STT layer directly determines how natural the agent feels. Scribe v2 Realtime integrates directly into ElevenLabs Agents as an optional upgrade from the default model.


Scribe v2 on Cliprise

ElevenLabs Speech to Text — powered by Scribe v2 — is available on Cliprise alongside ElevenLabs TTS, Sound Effects v2, and v3 Text to Dialogue.

For video production workflows: generate your AI video, add ElevenLabs TTS narration, then use Scribe v2 to extract the transcript for subtitle generation in CapCut. The workflow eliminates the manual transcription step that previously sat between video generation and subtitle publishing.

Full capabilities and workflow guide: ElevenLabs Speech to Text: Complete Guide →


Ready to Create?

Put your new knowledge into practice with Cliprise.

Start Creating
Featured on Super Launch