Every piece of audio content you produce — podcast episodes, interview recordings, AI-generated video narration, voiceover recordings — has a text version locked inside it. ElevenLabs Speech to Text extracts that text in accurate, timestamped, speaker-labeled form, ready for subtitles, captions, repurposing, or editing.
The model powering it — Scribe v2, released January 2026 — handles 90+ languages, up to 48 distinct speakers in a single recording, and audio conditions that trip up older transcription tools: background noise, accented speech, fast delivery, overlapping voices.

What ElevenLabs Speech to Text Produces
Upload an audio or video file. The model returns:
- Full transcript — everything spoken in the recording, in order
- Word-level timestamps — start and end time for every word, usable for subtitle generation or content editing
- Speaker diarization — which speaker said each segment, labeled and timestamped
- Non-speech event detection — laughter, applause, footsteps, background noise — tagged in the transcript where they occur
- Language detection — automatic identification of what language is being spoken, including mid-audio language switches
The output is clean, structured text that can go directly into a subtitle editor, a CMS, a teleprompter, or a document editor without heavy post-processing.
Speaker Diarization: Why It Matters
A single-speaker recording produces a straightforward continuous transcript. Multi-speaker recordings — podcast interviews, panel discussions, meeting recordings, dialogue scenes — need speaker attribution to be usable.
Without diarization, a two-person interview transcript looks like a wall of alternating speech with no indication of who said what. With diarization, each segment is labeled by speaker with its timestamp:
[00:00:14] Speaker 1: The main thing we noticed was that...
[00:01:02] Speaker 2: Right, and that's exactly why we changed the approach.
[00:01:18] Speaker 1: Exactly. What we ended up doing was...
Scribe v2 supports up to 48 distinct speakers in a single recording. For typical content — interviews, podcasts, roundtables — this is well above what is needed. The speaker labels are automatic; you can rename them to actual names in post-processing.
Keyterm Prompting: Handling Brand Names and Technical Terms
Standard transcription models struggle with proper nouns, brand names, and industry-specific terminology. They produce phonetically plausible spellings that are wrong — "Cliprise" becomes "Clip Rise" or "Clip Rays," a technical product name becomes an approximation.
Scribe v2 supports keyterm prompting — you supply up to 100 specific words or phrases before transcription, and the model biases toward transcribing those terms correctly when it hears them in context.
Practical use:
- Supply your brand name, product names, and key terminology
- Supply interviewee names and company names before transcribing an interview
- Supply technical terms for industry-specific recordings — medical, legal, scientific, or technical content
This single feature meaningfully improves transcript accuracy for any recording that contains non-standard vocabulary.
Audio Quality and Accuracy
The most significant variable affecting transcript quality is audio quality, not model capability. Scribe v2 handles real-world audio conditions better than older transcription models, but the relationship between input quality and output accuracy still holds.
What produces accurate transcripts:
- Clean recording with minimal background noise
- Close-mic or good-quality recording hardware
- Speakers talking one at a time rather than over each other
- Consistent volume level throughout
What degrades accuracy:
- Heavy background noise — music, traffic, crowd sounds
- Multiple speakers talking simultaneously
- Very low bitrate or compressed audio files
- Highly accented speech in a non-dominant accent for that language
For recordings that were captured in difficult conditions — outdoor events, phone calls, older recordings — run the audio through ElevenLabs Audio Isolation first to clean up background noise before transcription. The improvement in transcript accuracy from clean audio is more significant than any model-level difference. See ElevenLabs Audio Isolation →
Where Speech to Text Fits in Production Workflows
Podcast Transcription and Repurposing
The complete podcast-to-content workflow:
- Record and edit your podcast episode
- Upload to ElevenLabs Speech to Text on Cliprise
- Get full transcript with speaker labels and timestamps
- Use the transcript as the source for: show notes (summarize the transcript), blog post (edit and expand the transcript), social media quotes (pull the best lines), email newsletter (key takeaways)
- For YouTube episodes: use the timestamped transcript to generate SRT subtitle files
One recording — multiple content formats from one transcription pass.
AI Video Caption and Subtitle Generation
Every AI-generated video on Cliprise that includes narration benefits from captions. For platforms where video often plays muted (LinkedIn, Instagram feed), captions are the difference between content that communicates and content that gets scrolled past.
Workflow:
- Generate video narration with ElevenLabs TTS
- Upload the TTS audio to ElevenLabs Speech to Text
- Get a timestamped transcript of the exact narration
- Import the transcript into CapCut as captions, or convert to SRT for other editors
- Captions are word-accurate and precisely timed to the audio
This is more reliable than auto-caption tools in video editors, which sometimes misfire on AI-generated voices with unusual cadence or pacing.
Interview and Meeting Transcription
For content producers who conduct interviews for blogs, newsletters, or research:
- Record the interview (Zoom, in-person with a recorder, phone call)
- Upload to ElevenLabs Speech to Text with speaker diarization enabled and keyterms set for any proper nouns
- Get a speaker-labeled, timestamped transcript
- Edit the transcript directly — tighten the quotes, pull the key insights, structure into an article
Editing a transcript is faster than transcribing manually and faster than working from memory or notes. The timestamped output means you can go back to the exact moment in the recording if you need to verify a quote or pull more context.
Multilingual Content Workflows
For content produced in multiple languages:
Generate the content in Language A, transcribe it with ElevenLabs Speech to Text, use the English transcript as the base for translation into other languages, generate new audio from the translated text with ElevenLabs TTS in a native voice for each language.
The auto-language detection means you do not need to specify which language a recording is in — the model identifies it and transcribes appropriately. For recordings that mix languages — an English speaker with occasional phrases in another language, an interview where participants switch languages — the model handles this without manual segmentation.
Non-Speech Event Detection
Scribe v2 tags non-speech events in the transcript — not just what was said, but what else happened in the audio. Categories include laughter, applause, music, footsteps, and ambient sounds.
For podcast and interview content, this is useful context in the transcript:
[00:04:23] Speaker 1: And that's when everything fell apart — [laughter]
[00:04:29] Speaker 2: Right, at the worst possible moment.
For content analysis and moderation workflows, event detection provides structured metadata about what a recording contains beyond just the words.
Note
ElevenLabs Speech to Text is available on Cliprise alongside ElevenLabs TTS, Audio Isolation, and 45+ other models. Try Cliprise Free →
Related Articles
ElevenLabs tools on Cliprise:
- ElevenLabs TTS: Complete Guide →
- ElevenLabs v3 Text to Dialogue Guide →
- ElevenLabs Sound Effects Complete Guide →
- ElevenLabs on Cliprise: Complete Voice-Over Guide →
Voice and audio guides:
- AI Voice Generator Guide 2026 →
- AI Music Video Production →
- AI Lyric Video: Seedance 2.0 + Audio Sync →
Video production workflows:
Models on Cliprise:
- ElevenLabs Speech to Text →
- ElevenLabs TTS →
- ElevenLabs Audio Isolation →
- ElevenLabs v3 Text to Dialogue →
