Building Fast Full-Text Search with PostgreSQL

When we set out to build search for PodSearch.io, we had a clear set of requirements: sub-second query times across millions of transcript segments, relevance-ranked results with highlighted snippets, and the ability to sort by date, duration, or relevance. We evaluated several options — Elasticsearch, Meilisearch, Typesense — and ultimately chose PostgreSQL's built-in full-text search. Here's why.

Why PostgreSQL Over Dedicated Search Engines?

Criteria	Elasticsearch	PostgreSQL FTS
Infrastructure	Separate cluster to manage	Same database, zero extra ops
Data consistency	Eventually consistent (sync lag)	Immediately consistent (triggers)
Query flexibility	Own query DSL	SQL — join with any table
Snippet highlighting	Built-in	ts_headline with custom tags
Relevance ranking	BM25 + boosting	ts_rank_cd with weight arrays
Memory usage	JVM heap (2-8 GB+)	Shared buffers (normal PG tuning)

The deciding factor was operational simplicity. With PostgreSQL FTS, our search index lives in the same database as our data. There's no sync pipeline to break, no separate cluster to monitor, and no consistency gaps. When a transcript is inserted, database triggers immediately update the search index in the same transaction.

How tsvector Search Works

PostgreSQL full-text search has two core concepts: tsvector (the indexed document) and tsquery (the search query).

A tsvector is a sorted list of lexemes (normalized words) with position information. PostgreSQL strips stop words, applies stemming, and records where each word appears. This compact representation is stored in a column on the table and indexed with a GIN (Generalized Inverted Index) for fast lookup.

-- Example: building a tsvector with weighted fields
SELECT
  setweight(to_tsvector('english', title), 'A') ||
  setweight(to_tsvector('english', coalesce(description, '')), 'B') ||
  setweight(to_tsvector('english', coalesce(author, '')), 'C')
FROM "Podcast";

The weight labels (A, B, C, D) let us boost matches in more important fields. A match in the title (weight A) scores higher than a match in the description (weight B).

Our Search Architecture

Podcast Index

title (A) + description (B) + author (C)

Episode Index

title + transcript (A) + description (B) + podcast title (C) + author (D)

GIN Indexes

O(1) lookup for any lexeme across millions of rows

Trigger-Based Index Maintenance

Instead of manually syncing search indexes, we use PostgreSQL triggers that fire automatically when data changes:

Podcast trigger — Fires on INSERT/UPDATE of title, description, or author. Rebuilds the podcast's search vector.
Episode trigger — Fires on INSERT/UPDATE of title or description. Rebuilds the episode's search vector by joining with its podcast and transcription.
Transcription trigger — Fires when a transcript is created or updated. Refreshes the parent episode's search vector.
Cascade trigger — Fires when a podcast's title or author changes. Refreshes all child episode search vectors.

This means search indexes are always up to date. There's no background job, no eventual consistency, no sync lag. The moment data changes, the index reflects it.

Ranking with ts_rank_cd

We use ts_rank_cd (cover density ranking) to score results. Unlike simple term frequency, cover density considers how close matching terms are to each other — documents where search terms appear near each other score higher. We pass weight arrays to boost matches in important fields:

-- Ranking query with weighted fields
SELECT title,
  ts_rank_cd(
    '{0.2, 0.3, 0.5, 1.0}'::float4[],
    "searchVector",
    websearch_to_tsquery('english', $1)
  ) AS rank
FROM "Episode"
WHERE "searchVector" @@ websearch_to_tsquery('english', $1)
ORDER BY rank DESC
LIMIT 20;

Snippet Highlighting

PostgreSQL's ts_headline function generates highlighted snippets, wrapping matched terms in custom HTML tags. We use <mark> tags that are styled in our CSS with a purple highlight, giving users immediate visual feedback about why a result matched their query.

Performance

With GIN indexes, full-text queries across millions of rows typically complete in 50-200ms. Combined with connection pooling via Prisma and PostgreSQL's query planner, we consistently deliver sub-second search results even for complex queries with sorting and pagination.

Why PostgreSQL Over Dedicated Search Engines?

Criteria	Elasticsearch	PostgreSQL FTS
Infrastructure	Separate cluster to manage	Same database, zero extra ops
Data consistency	Eventually consistent (sync lag)	Immediately consistent (triggers)
Query flexibility	Own query DSL	SQL — join with any table
Snippet highlighting	Built-in	ts_headline with custom tags
Relevance ranking	BM25 + boosting	ts_rank_cd with weight arrays
Memory usage	JVM heap (2-8 GB+)	Shared buffers (normal PG tuning)

How tsvector Search Works

PostgreSQL full-text search has two core concepts: tsvector (the indexed document) and tsquery (the search query).

-- Example: building a tsvector with weighted fields
SELECT
  setweight(to_tsvector('english', title), 'A') ||
  setweight(to_tsvector('english', coalesce(description, '')), 'B') ||
  setweight(to_tsvector('english', coalesce(author, '')), 'C')
FROM "Podcast";

The weight labels (A, B, C, D) let us boost matches in more important fields. A match in the title (weight A) scores higher than a match in the description (weight B).

Our Search Architecture

Podcast Index

title (A) + description (B) + author (C)

Episode Index

title + transcript (A) + description (B) + podcast title (C) + author (D)

GIN Indexes

O(1) lookup for any lexeme across millions of rows

Trigger-Based Index Maintenance

Instead of manually syncing search indexes, we use PostgreSQL triggers that fire automatically when data changes:

Podcast trigger — Fires on INSERT/UPDATE of title, description, or author. Rebuilds the podcast's search vector.
Episode trigger — Fires on INSERT/UPDATE of title or description. Rebuilds the episode's search vector by joining with its podcast and transcription.
Transcription trigger — Fires when a transcript is created or updated. Refreshes the parent episode's search vector.
Cascade trigger — Fires when a podcast's title or author changes. Refreshes all child episode search vectors.

This means search indexes are always up to date. There's no background job, no eventual consistency, no sync lag. The moment data changes, the index reflects it.

Ranking with ts_rank_cd

-- Ranking query with weighted fields
SELECT title,
  ts_rank_cd(
    '{0.2, 0.3, 0.5, 1.0}'::float4[],
    "searchVector",
    websearch_to_tsquery('english', $1)
  ) AS rank
FROM "Episode"
WHERE "searchVector" @@ websearch_to_tsquery('english', $1)
ORDER BY rank DESC
LIMIT 20;

Building Fast Full-Text Search with PostgreSQL

Why PostgreSQL Over Dedicated Search Engines?

How tsvector Search Works

Our Search Architecture

Trigger-Based Index Maintenance

Ranking with ts_rank_cd

Snippet Highlighting

Performance

Related Articles

How We Transcribe Millions of Podcast Episodes

Introducing PodSearch.io: AI-Powered Podcast Search

The Podcast Discovery Problem (And How We're Solving It)

Try PodSearch.io

Building Fast Full-Text Search with PostgreSQL

Why PostgreSQL Over Dedicated Search Engines?

How tsvector Search Works

Our Search Architecture

Trigger-Based Index Maintenance

Ranking with ts_rank_cd

Snippet Highlighting

Performance

Related Articles

How We Transcribe Millions of Podcast Episodes

Introducing PodSearch.io: AI-Powered Podcast Search

The Podcast Discovery Problem (And How We're Solving It)

Try PodSearch.io