When we set out to build search for PodSearch.io, we had a clear set of requirements: sub-second query times across millions of transcript segments, relevance-ranked results with highlighted snippets, and the ability to sort by date, duration, or relevance. We evaluated several options — Elasticsearch, Meilisearch, Typesense — and ultimately chose PostgreSQL's built-in full-text search. Here's why.
Why PostgreSQL Over Dedicated Search Engines?
| Criteria | Elasticsearch | PostgreSQL FTS |
|---|---|---|
| Infrastructure | Separate cluster to manage | Same database, zero extra ops |
| Data consistency | Eventually consistent (sync lag) | Immediately consistent (triggers) |
| Query flexibility | Own query DSL | SQL — join with any table |
| Snippet highlighting | Built-in | ts_headline with custom tags |
| Relevance ranking | BM25 + boosting | ts_rank_cd with weight arrays |
| Memory usage | JVM heap (2-8 GB+) | Shared buffers (normal PG tuning) |
The deciding factor was operational simplicity. With PostgreSQL FTS, our search index lives in the same database as our data. There's no sync pipeline to break, no separate cluster to monitor, and no consistency gaps. When a transcript is inserted, database triggers immediately update the search index in the same transaction.
How tsvector Search Works
PostgreSQL full-text search has two core concepts: tsvector (the indexed document) and tsquery (the search query).
A tsvector is a sorted list of lexemes (normalized words) with position information. PostgreSQL strips stop words, applies stemming, and records where each word appears. This compact representation is stored in a column on the table and indexed with a GIN (Generalized Inverted Index) for fast lookup.
-- Example: building a tsvector with weighted fields
SELECT
setweight(to_tsvector('english', title), 'A') ||
setweight(to_tsvector('english', coalesce(description, '')), 'B') ||
setweight(to_tsvector('english', coalesce(author, '')), 'C')
FROM "Podcast";The weight labels (A, B, C, D) let us boost matches in more important fields. A match in the title (weight A) scores higher than a match in the description (weight B).
Our Search Architecture
Podcast Index
title (A) + description (B) + author (C)
Episode Index
title + transcript (A) + description (B) + podcast title (C) + author (D)
GIN Indexes
O(1) lookup for any lexeme across millions of rows
Trigger-Based Index Maintenance
Instead of manually syncing search indexes, we use PostgreSQL triggers that fire automatically when data changes:
- Podcast trigger — Fires on INSERT/UPDATE of title, description, or author. Rebuilds the podcast's search vector.
- Episode trigger — Fires on INSERT/UPDATE of title or description. Rebuilds the episode's search vector by joining with its podcast and transcription.
- Transcription trigger — Fires when a transcript is created or updated. Refreshes the parent episode's search vector.
- Cascade trigger — Fires when a podcast's title or author changes. Refreshes all child episode search vectors.
This means search indexes are always up to date. There's no background job, no eventual consistency, no sync lag. The moment data changes, the index reflects it.
Ranking with ts_rank_cd
We use ts_rank_cd (cover density ranking) to score results. Unlike simple term frequency, cover density considers how close matching terms are to each other — documents where search terms appear near each other score higher. We pass weight arrays to boost matches in important fields:
-- Ranking query with weighted fields
SELECT title,
ts_rank_cd(
'{0.2, 0.3, 0.5, 1.0}'::float4[],
"searchVector",
websearch_to_tsquery('english', $1)
) AS rank
FROM "Episode"
WHERE "searchVector" @@ websearch_to_tsquery('english', $1)
ORDER BY rank DESC
LIMIT 20;Snippet Highlighting
PostgreSQL's ts_headline function generates highlighted snippets, wrapping matched terms in custom HTML tags. We use <mark> tags that are styled in our CSS with a purple highlight, giving users immediate visual feedback about why a result matched their query.
Performance
With GIN indexes, full-text queries across millions of rows typically complete in 50-200ms. Combined with connection pooling via Prisma and PostgreSQL's query planner, we consistently deliver sub-second search results even for complex queries with sorting and pagination.