There is a moment in the development of most technology categories when the capability that defines it gets commoditized. It happened to image generation when Stable Diffusion went open-source. It happened to text generation when GPT-3 became widely available. It is happening to speech-to-text now, and the company setting the new baseline is not Google, Amazon, or OpenAI - it is ElevenLabs, a company that many people still think of primarily as a voice synthesis tool.
On January 9, 2026, ElevenLabs released Scribe v2. Not one model, but two - each optimized for a fundamentally different use case, each setting a new benchmark on its respective evaluation metric, and together making the case that ElevenLabs has finished the transition from a text-to-speech specialist into something that deserves to be called a full-stack audio AI platform.
Understanding why this matters requires stepping back from the specific release to see what it represents. ElevenLabs built its reputation generating voice. Scribe v2 is about understanding it. The combination is not incidental.
Two Models, Two Different Problems
The release comprises Scribe v2 Batch, launched January 9, and Scribe v2 Realtime, launched three days earlier on January 6. They share a name and a generation, but they are architecturally distinct and designed for different workflows that have historically required entirely different tools.
Scribe v2 Batch is for complete audio files. Upload a podcast episode, a recorded interview, a meeting recording, a video file, a court deposition, a medical consultation, a sales call - any recorded audio content where accuracy matters more than speed, and where you will wait for a complete transcript rather than receiving it word by word as the audio plays. The model is optimized for the full range of real-world audio conditions: diverse speakers, different accents, overlapping speech, background noise, extended silences, tonal variation across a long recording, and the kind of audio degradation that happens when you are not recording in a studio.
The benchmark figure ElevenLabs cites is the lowest Word Error Rate recorded on industry-standard benchmarks for long-form transcription. Independent evaluations published around the launch date confirmed this against OpenAI Whisper, Google's transcription APIs, and the major enterprise transcription services. The margin varied by language and audio condition, but the direction was consistent.
Scribe v2 Realtime is a different product solving a different problem. It transcribes speech in 30 to 80 milliseconds. The headline latency figure of under 150 milliseconds covers the tail of the distribution - the average is faster. This is not a batch system with faster hardware. It is architecturally different from Batch: it uses predictive transcription, generating partial results based on context before the speaker finishes their utterance, anticipating the most probable next words and punctuation before they are spoken. When it works correctly - which it does at a rate that produces the lowest Word Error Rate among all low-latency ASR models on the FLEURS multilingual benchmark at 30 languages tested - it feels less like transcription and more like the system reading along with you in real time.
The latency specification for Realtime is not an academic result. It is the threshold that makes conversational AI agents feel natural. Human response time in conversation begins at approximately 200 milliseconds and becomes uncomfortable above 500 milliseconds. A transcription system with 150ms maximum latency leaves enough headroom for the inference call that follows it - the language model generating the response - to fit inside a response time the human interlocutor reads as natural. Systems with transcription latencies above 300 milliseconds cannot produce natural-feeling conversation regardless of how fast the language model is.
What Scribe v2 Batch Can Do That Competitors Cannot
The differentiating capabilities in Scribe v2 Batch go well beyond raw accuracy. They reflect a serious attempt to make transcription useful for professional production workflows, not just accurate at the raw speech-to-text task.
Speaker diarization for up to 48 speakers. This is the process of identifying which speaker said which words - labeling the transcript with speaker attribution so you know who is talking throughout. Most transcription services support diarization for small groups. 48-speaker diarization is a different category of capability, relevant for conference recordings, panel discussions, town hall meetings, and any content where many people speak.
Keyterm prompting for up to 100 specific terms. You supply a list of words - brand names, product names, technical vocabulary, proper nouns, industry jargon - and the model weights its transcription toward correctly recognizing those terms when they appear. This addresses the persistent failure mode where AI transcription renders a product name as a common word, a person's name as a phonetic approximation, or a technical term as something vaguely similar that means something completely different. For any professional workflow involving specialized vocabulary, this feature alone justifies the upgrade from general-purpose transcription tools.
Entity detection across 56 categories with timestamps. The model identifies instances of specific information types - personally identifiable information, health data, payment details, and 53 other categories - and returns their exact timestamps alongside the transcript. For compliance workflows, legal transcription, medical documentation, or any content that needs to be reviewed for sensitive information before distribution, this transforms transcription from a raw text output into a structured analysis.
Multi-language handling without segmentation. A recording that switches between English and Spanish mid-conversation transcribes correctly without any manual configuration. The model detects the language transition automatically and applies the appropriate transcription model to each segment. For international teams, multilingual content, or any context where speakers might naturally switch languages - which is common in code-switching communities, international business contexts, and multilingual households - this removes a manual step that previously required either separate recording sessions or post-hoc segmentation.
The Realtime Use Case Is Bigger Than It Looks
Scribe v2 Realtime is being positioned primarily for voice agent development. ElevenLabs has integrated it directly into their Agents platform as an optional upgrade. This makes sense as a starting point because voice agents are the most obvious application for ultra-low-latency transcription.
But the actual addressable use case is larger. Any live application where a human speaks and software needs to understand them in real time is a candidate. Real-time captioning for live events. Meeting transcription that populates notes as the meeting happens rather than after it ends. Live translation where the source transcript is the first step in a pipeline. Customer service tools that surface information to human agents as the customer is still speaking. Accessibility applications that provide real-time captions for people with hearing impairments. Broadcast production tools that flag content in live feeds.
The 93.5% accuracy figure on the FLEURS multilingual benchmark across 30 languages is meaningful specifically because it is measured on a diverse language set rather than English-only. AI transcription that is excellent in English and degraded in non-English languages is not a global product. Accuracy at 93.5% across 30 languages including both European and Asian languages - tested by ElevenLabs against Google Gemini Flash, OpenAI GPT-4o Mini, and Deepgram Nova 3, with Scribe v2 Realtime leading on that benchmark - is a global product.
What This Means for the ElevenLabs Stack
ElevenLabs started as a company that could make AI voice sound human. The version of ElevenLabs that shipped Scribe v2 is a company that can generate voice, clone voice, create sound effects and music, run multi-speaker dialogue from text, and now transcribe speech back to text at the highest accuracy levels in the industry.
The loop is complete. Generate audio with ElevenLabs TTS. Build multi-speaker conversations with v3 Text to Dialogue. Add environmental audio with Sound Effects v2. Then transcribe it all back to text with Scribe v2 for subtitles, captions, repurposed written content, or editorial review.
For video production workflows - particularly the combination of AI-generated video with AI-generated voice narration that has become a primary use case on Cliprise - Scribe v2 closes the last gap in the production pipeline. Generate the video. Add TTS narration. Use Scribe to extract a timestamped transcript for subtitle generation. Publish with captions. Previously, each of these steps required a different tool.
ElevenLabs Speech to Text on Cliprise
ElevenLabs Speech to Text - powered by Scribe v2 - is available on Cliprise alongside ElevenLabs TTS, Sound Effects v2, and v3 Text to Dialogue. The model is accessible through the AI Voice Generator feature as part of the standard Cliprise credit system.
For creators who work with long-form audio content, the batch transcription workflow with speaker diarization and keyterm prompting is the primary use case. For developers building any product that involves voice agents or real-time audio processing, the Realtime API documentation at elevenlabs.io provides the WebSocket implementation details.
The full workflow breakdown - including how to integrate Scribe v2 into a video production pipeline, how to configure keyterm prompting for brand terminology, and how Scribe v2 compares to the other transcription options in the Cliprise stack - is in the ElevenLabs Speech to Text complete guide.
For context on the broader ElevenLabs suite and how the different models fit together, the ElevenLabs voice-over complete guide covers the full stack, and the AI voice generator guide for 2026 places it in the context of all current AI audio options.
The transcription problem is not glamorous. It does not generate viral demo videos or social media hype. But for anyone building a real product that involves audio - and in 2026, that is most content workflows - it is the part of the stack that determines whether the rest of the pipeline is actually useful.
