How We Transcribe Millions of Podcast Episodes

Transcribing millions of podcast episodes is not a simple task. Audio quality varies wildly — from studio-recorded interviews to phone calls recorded over noisy connections. Speakers have different accents, vocabularies, and speaking speeds. Episodes range from 2-minute news briefs to 4-hour deep dives. Building a pipeline that handles all of this reliably, at scale, and with high accuracy required solving some interesting engineering challenges.

The Pipeline

Our transcription pipeline is built as a series of discrete stages, connected by a job queue system powered by Redis and BullMQ. This architecture lets us scale each stage independently, retry failed jobs, and monitor throughput in real time.

Audio Discovery

We monitor podcast RSS feeds for new episodes. When a new episode appears, we extract the audio URL from the enclosure tag and create a transcription job.

Audio Download & Preprocessing

The audio file is downloaded and converted to 16kHz mono WAV format — the optimal input format for Whisper. We also detect silence and segment boundaries to split long episodes into manageable chunks.

Speech Recognition

Each audio chunk is processed through Whisper, producing time-stamped text segments. We use the 'base' model for initial processing — it provides a good balance of speed and accuracy for English content.

Post-Processing

Raw transcription output is cleaned up: punctuation is normalized, speaker labels are inferred where possible, and segments are merged into coherent paragraphs. The full text is assembled and stored alongside the individual segments.

Search Indexing

The transcript is written to PostgreSQL, where database triggers automatically build a tsvector search index. The episode becomes instantly searchable — no manual sync needed.

Whisper: The Model Behind It All

We use OpenAI's Whisper, an open-source automatic speech recognition (ASR) model trained on 680,000 hours of multilingual audio. Whisper excels at handling diverse audio conditions — background music, cross-talk, varying recording quality — making it ideal for the messy reality of podcast audio.

Whisper comes in several sizes, each trading speed for accuracy:

Model	Parameters	Speed	Accuracy (WER)
tiny	39M	~32x realtime	~14%
base	74M	~16x realtime	~10%
small	244M	~6x realtime	~7%
medium	769M	~2x realtime	~5%
large-v3	1.55B	~1x realtime	~3%

We currently use the base model as our default for its excellent throughput-to-accuracy ratio. For priority content or non-English episodes, we can re-process with larger models to improve accuracy.

Handling Scale

The key to processing millions of episodes is horizontal scaling. Our BullMQ-based job queue lets us spin up multiple worker instances, each pulling jobs from the queue independently. If a worker crashes or a job fails, it's automatically retried with exponential backoff.

We track transcription status per-episode (PENDING, PROCESSING, COMPLETED, FAILED) so users always know the state of any episode. Episodes that haven't been transcribed yet show a “Transcribe” button, letting users request on-demand processing.

What's Next

We're actively working on speaker diarization (identifying who said what), improved language detection, and fine-tuning Whisper on podcast-specific vocabulary. We're also exploring ways to leverage the transcripts for AI-powered summarization and topic extraction. Stay tuned.

The Pipeline

Audio Discovery

We monitor podcast RSS feeds for new episodes. When a new episode appears, we extract the audio URL from the enclosure tag and create a transcription job.

Audio Download & Preprocessing

Speech Recognition

Post-Processing

Search Indexing

The transcript is written to PostgreSQL, where database triggers automatically build a tsvector search index. The episode becomes instantly searchable — no manual sync needed.

Whisper: The Model Behind It All

Whisper comes in several sizes, each trading speed for accuracy:

Model	Parameters	Speed	Accuracy (WER)
tiny	39M	~32x realtime	~14%
base	74M	~16x realtime	~10%
small	244M	~6x realtime	~7%
medium	769M	~2x realtime	~5%
large-v3	1.55B	~1x realtime	~3%

Handling Scale

How We Transcribe Millions of Podcast Episodes

The Pipeline

Audio Discovery

Audio Download & Preprocessing

Speech Recognition

Post-Processing

Search Indexing

Whisper: The Model Behind It All

Handling Scale

What's Next

Related Articles

Building Fast Full-Text Search with PostgreSQL

Introducing PodSearch.io: AI-Powered Podcast Search

The Podcast Discovery Problem (And How We're Solving It)

Try PodSearch.io

How We Transcribe Millions of Podcast Episodes

The Pipeline

Audio Discovery

Audio Download & Preprocessing

Speech Recognition

Post-Processing

Search Indexing

Whisper: The Model Behind It All

Handling Scale

What's Next

Related Articles

Building Fast Full-Text Search with PostgreSQL

Introducing PodSearch.io: AI-Powered Podcast Search

The Podcast Discovery Problem (And How We're Solving It)

Try PodSearch.io