Transcribing millions of podcast episodes is not a simple task. Audio quality varies wildly — from studio-recorded interviews to phone calls recorded over noisy connections. Speakers have different accents, vocabularies, and speaking speeds. Episodes range from 2-minute news briefs to 4-hour deep dives. Building a pipeline that handles all of this reliably, at scale, and with high accuracy required solving some interesting engineering challenges.
The Pipeline
Our transcription pipeline is built as a series of discrete stages, connected by a job queue system powered by Redis and BullMQ. This architecture lets us scale each stage independently, retry failed jobs, and monitor throughput in real time.
Audio Discovery
We monitor podcast RSS feeds for new episodes. When a new episode appears, we extract the audio URL from the enclosure tag and create a transcription job.
Audio Download & Preprocessing
The audio file is downloaded and converted to 16kHz mono WAV format — the optimal input format for Whisper. We also detect silence and segment boundaries to split long episodes into manageable chunks.
Speech Recognition
Each audio chunk is processed through Whisper, producing time-stamped text segments. We use the 'base' model for initial processing — it provides a good balance of speed and accuracy for English content.
Post-Processing
Raw transcription output is cleaned up: punctuation is normalized, speaker labels are inferred where possible, and segments are merged into coherent paragraphs. The full text is assembled and stored alongside the individual segments.
Search Indexing
The transcript is written to PostgreSQL, where database triggers automatically build a tsvector search index. The episode becomes instantly searchable — no manual sync needed.
Whisper: The Model Behind It All
We use OpenAI's Whisper, an open-source automatic speech recognition (ASR) model trained on 680,000 hours of multilingual audio. Whisper excels at handling diverse audio conditions — background music, cross-talk, varying recording quality — making it ideal for the messy reality of podcast audio.
Whisper comes in several sizes, each trading speed for accuracy:
| Model | Parameters | Speed | Accuracy (WER) |
|---|---|---|---|
| tiny | 39M | ~32x realtime | ~14% |
| base | 74M | ~16x realtime | ~10% |
| small | 244M | ~6x realtime | ~7% |
| medium | 769M | ~2x realtime | ~5% |
| large-v3 | 1.55B | ~1x realtime | ~3% |
We currently use the base model as our default for its excellent throughput-to-accuracy ratio. For priority content or non-English episodes, we can re-process with larger models to improve accuracy.
Handling Scale
The key to processing millions of episodes is horizontal scaling. Our BullMQ-based job queue lets us spin up multiple worker instances, each pulling jobs from the queue independently. If a worker crashes or a job fails, it's automatically retried with exponential backoff.
We track transcription status per-episode (PENDING, PROCESSING, COMPLETED, FAILED) so users always know the state of any episode. Episodes that haven't been transcribed yet show a “Transcribe” button, letting users request on-demand processing.
What's Next
We're actively working on speaker diarization (identifying who said what), improved language detection, and fine-tuning Whisper on podcast-specific vocabulary. We're also exploring ways to leverage the transcripts for AI-powered summarization and topic extraction. Stay tuned.