Advanced Similarity Search: Algorithms and Optimization in PostgreSQL
When you have a database with millions of vectors and need to find the 10 most similar to a query in under 50 milliseconds, similarity search becomes a serious engineering challenge. Exact (brute-force) search is precise but does not scale: comparing every vector against every query has O(n*d) complexity where n is the number of vectors and d the dimensionality. With 10 million vectors at 1536 dimensions, each query would require 15 billion floating-point operations.
The solution is ANN (Approximate Nearest Neighbor) algorithms: they trade the guarantee of finding the exact result for answers in O(log n) or even amortized O(1), with practical accuracy of 95-99%. PostgreSQL with pgvector implements the two most widely used ANN algorithms: HNSW and IVFFlat.
In this article we analyze how these algorithms work internally, when to use each one, how to optimize similarity search queries in PostgreSQL, how to combine vector search with metadata filters, and advanced techniques like MMR for diversified results.
Series Overview
| # | Article | Focus |
|---|---|---|
| 1 | pgvector | Installation, operators, indexing |
| 2 | Embeddings Deep Dive | Models, distances, generation |
| 3 | RAG with PostgreSQL | End-to-end RAG pipeline |
| 4 | You are here - Similarity Search | Algorithms and optimization |
| 5 | HNSW and IVFFlat Indexing | Advanced indexing strategies |
| 6 | RAG in Production | Scalability and performance |
What You Will Learn
- The difference between exact search and approximate nearest neighbor (ANN)
- How HNSW works internally: hierarchical navigable small world graphs
- How IVFFlat works: K-means clustering and probe search
- When to use HNSW vs IVFFlat vs brute force
- Query optimization: ef_search and probes parameters
- Hybrid filters: combining metadata filtering with vector search
- Distance operators: cosine, L2, and inner product compared
- Benchmarking and recall measurement with Python code
- Maximal Marginal Relevance (MMR) for diversified results
- Semantic search vs keyword search: when to use which
Exact Search vs Approximate Search
Brute Force (Exact Search)
Exact search compares the query against every single vector in the database. It guarantees 100% recall but has linear complexity. It is the right choice only for small datasets or when absolute precision is a non-negotiable requirement:
-- Exact search: no index used, full table scan
-- Useful for small datasets (< 100K vectors) or when precision is critical
SELECT id, content, embedding <=> query_vec AS distance
FROM documents
ORDER BY embedding <=> query_vec -- sequential scan over the entire dataset
LIMIT 10;
-- Force brute force even when an ANN index exists:
SET enable_indexscan = off;
SELECT id, content, embedding <=> query_vec AS distance
FROM documents
ORDER BY embedding <=> query_vec
LIMIT 10;
SET enable_indexscan = on; -- restore
-- Comparative benchmark (real dataset on NVMe SSD):
-- Dataset: 1M vectors, 1536 dim, PostgreSQL 16, 32GB RAM, 16 CPU
-- Brute force (seq scan): ~2000ms per query -- 2 seconds!
-- HNSW (ef_search=40): ~10ms per query -- 200x faster
-- HNSW (ef_search=100): ~25ms per query -- 80x faster
-- IVFFlat (probes=50): ~28ms per query -- 71x faster
-- 100K vector dataset (brute force acceptable):
-- Brute force: ~50ms per query -- acceptable
-- HNSW: ~3ms per query -- still better when available
ANN: The Recall/Performance Trade-off
-- Verify whether the query is using the ANN index
EXPLAIN (ANALYZE, BUFFERS, FORMAT TEXT)
SELECT id, content, embedding <=> '[0.1, 0.2, ...]'::vector AS distance
FROM documents
ORDER BY embedding <=> '[0.1, 0.2, ...]'::vector
LIMIT 10;
-- Desired output (uses HNSW index):
-- Index Scan using documents_embedding_hnsw on documents
-- Index Cond: (embedding <=> '...'::vector)
-- Limit: 10
-- Buffers: shared hit=347 (everything from cache = excellent!)
-- -> Execution Time: 8.3 ms
-- Undesired output (brute force):
-- Seq Scan on documents
-- Sort Key: (embedding <=> '...'::vector) -- sorts after full scan
-- Buffers: shared hit=234 read=8921 (many disk reads = slow!)
-- -> Execution Time: 2341 ms
-- WHY PostgreSQL may not use the index:
-- 1. LIMIT too large (>10% of rows for small tables)
-- 2. Statistics are stale: run ANALYZE documents;
-- 3. Index does not exist: check with \d documents
-- 4. enable_indexscan accidentally set to off
Distance Operators in pgvector
pgvector supports three distance operators, each appropriate for different types of embeddings. Choosing the wrong operator can significantly reduce search quality and cause PostgreSQL to ignore your index entirely.
| Operator | Type | Range | When to Use | Index Support |
|---|---|---|---|---|
<=> |
Cosine distance | [0, 2] | Text embeddings (OpenAI, Sentence Transformers) - DEFAULT choice | HNSW, IVFFlat (vector_cosine_ops) |
<-> |
Euclidean (L2) distance | [0, inf) | Vectors where magnitude matters (images, audio features) | HNSW, IVFFlat (vector_l2_ops) |
<#> |
Negative inner product | (-inf, 0] | Pre-normalized embeddings, maximum inner product search | HNSW, IVFFlat (vector_ip_ops) |
-- Cosine distance (most common for text embeddings):
SELECT id, content,
embedding <=> query_vec AS cosine_distance,
1 - (embedding <=> query_vec) AS cosine_similarity
FROM documents
ORDER BY embedding <=> query_vec
LIMIT 5;
-- L2 distance (Euclidean, for images/audio):
SELECT id, content,
embedding <-> query_vec AS l2_distance
FROM documents
ORDER BY embedding <-> query_vec
LIMIT 5;
-- Inner product (for normalized embeddings):
SELECT id, content,
(embedding <#> query_vec) * -1 AS inner_product -- negated so higher = more similar
FROM documents
ORDER BY embedding <#> query_vec -- ORDER BY works because <#> is negative
LIMIT 5;
-- Create index with the CORRECT operator (MUST match the query operator!):
-- For cosine distance (default for text):
CREATE INDEX ON documents USING hnsw (embedding vector_cosine_ops);
-- For L2 distance:
CREATE INDEX ON documents USING hnsw (embedding vector_l2_ops);
-- For inner product:
CREATE INDEX ON documents USING hnsw (embedding vector_ip_ops);
-- COMMON MISTAKE: index with vector_cosine_ops but query uses <->
-- Result: PostgreSQL ignores the index and falls back to brute force!
-- Always verify with EXPLAIN ANALYZE that the index is being used.
HNSW: Hierarchical Navigable Small World
HNSW is the most popular ANN algorithm in 2025-2026. It is based on the theory of small world graphs: structures where every node is reachable from any other in a small number of hops (the "six degrees of separation" phenomenon). The hierarchical structure allows rapid navigation toward the most relevant region of vector space, using upper layers as "highways" and the base layer for precise search.
How HNSW Works Internally
HNSW builds a hierarchical multi-layer structure where each layer is a subgraph of the layer below:
- Layer 0 (base): All vectors, connected to their nearest neighbors. Dense graph with the highest connectivity.
- Upper layers (1, 2, ...): Progressively smaller subsets of vectors. Sparse graphs for fast long-range navigation.
- Entry point: Search always starts at the highest layer (fewest vectors) and greedily descends toward the target.
-- HNSW conceptual structure:
--
-- Layer 2: o-----------o-----------o (few nodes, long jumps - fast navigation)
-- \ | /
-- Layer 1: o---o---o---o---o---o---o (intermediate navigation)
-- \ | | /
-- Layer 0: o-o-o-o-o-o-o-o-o-o-o-o (all vectors, precise search)
--
-- Search algorithm for query Q:
-- 1. Start from the highest layer with a fixed entry point
-- 2. Greedy descent: move toward the node closest to Q at each layer
-- 3. Use that node as the entry point for the layer below
-- 4. Repeat until layer 0
-- 5. At layer 0: beam search with ef_search candidates
-- 6. Return top-k from the explored candidates
-- Theoretical complexity:
-- Build: O(n * log(n)) -- sub-linear in number of vectors
-- Query: O(log(n)) -- logarithmic! vs O(n) for brute force
-- Creating an HNSW index (optimal values for most cases):
CREATE INDEX documents_hnsw_idx
ON documents
USING hnsw (embedding vector_cosine_ops)
WITH (
m = 16, -- connections per node (default 16, range 4-64)
ef_construction = 64 -- candidates during build (default 64, range 16-200)
);
-- Set ef_search at query time (no index rebuild required):
SET hnsw.ef_search = 60; -- recommended value for production RAG
HNSW Parameters Reference Table
| Parameter | Description | Default | Range | Effect of Increasing |
|---|---|---|---|---|
m |
Max connections per node per layer | 16 | 4-64 | Better recall, more memory, slower build |
ef_construction |
Candidates considered during index build | 64 | 16-200 | Better potential recall, much slower build |
ef_search |
Candidates considered during query (runtime) | 40 | 1-ef_construction | Better recall, slower query |
-- Configuring ef_search for queries (runtime parameter, no rebuild)
SET hnsw.ef_search = 40; -- default, good balance
-- High precision (enterprise RAG, medical, legal applications):
SET hnsw.ef_search = 100;
-- High speed (autocomplete, recommendations, low-stakes search):
SET hnsw.ef_search = 20;
-- Benchmark reference (1M vectors, 1536 dim, m=16, ef_construction=64):
-- ef_search=20: ~5ms/query, recall@10 ~85%
-- ef_search=40: ~10ms/query, recall@10 ~92%
-- ef_search=100: ~25ms/query, recall@10 ~97%
-- ef_search=200: ~50ms/query, recall@10 ~99%
-- Set for current session (does not persist across sessions):
SELECT set_config('hnsw.ef_search', '100', false);
-- Key insight: ef_search is the most powerful tuning lever
-- because it can be changed per-query without any index modification.
-- Use a higher value for critical searches, lower for bulk operations.
IVFFlat: Inverted File with Flat Compression
IVFFlat uses a different approach: it divides the vector space into clusters (cells) using K-means, and during a query searches only the most promising clusters. It is simpler than HNSW but requires existing data for K-means training and degrades more quickly with incremental inserts.
How IVFFlat Works
- Training phase: K-means clustering divides vectors into
listscentroids - Build phase: Each vector is assigned to the nearest cluster (centroid)
- Query time: Finds the
probesnearest clusters to the query, then searches exactly within those clusters
-- IVFFlat requires data before creating the index
-- (it needs to perform K-means clustering first)
-- PREREQUISITE: the table must have at least a few hundred rows
-- Rule of thumb: lists = sqrt(n_rows), with a minimum of 100
-- Creating an IVFFlat index
CREATE INDEX documents_ivfflat_idx
ON documents
USING ivfflat (embedding vector_cosine_ops)
WITH (
lists = 100 -- number of clusters (default 100, recommended: sqrt(n_rows))
);
-- For 1M rows: lists = sqrt(1000000) = 1000
CREATE INDEX ON documents USING ivfflat (embedding vector_cosine_ops) WITH (lists = 1000);
-- Configure probes (how many clusters to search at query time)
SET ivfflat.probes = 10; -- default, searches 10 clusters out of 1000
-- High precision mode:
SET ivfflat.probes = 50; -- searches 50/1000 clusters = 5% of the space
-- Benchmark reference (1M vectors, lists=1000):
-- probes=5: ~3ms/query, recall@10 ~72%
-- probes=10: ~6ms/query, recall@10 ~82%
-- probes=30: ~18ms/query, recall@10 ~94%
-- probes=100: ~60ms/query, recall@10 ~99%
-- IVFFlat vs HNSW at equivalent recall ~92%:
-- IVFFlat (probes=50): ~28ms -- slower!
-- HNSW (ef_search=40): ~10ms -- faster
-- Conclusion: for high recall, HNSW is typically more efficient
HNSW vs IVFFlat: Choosing the Right Algorithm
| Characteristic | HNSW | IVFFlat |
|---|---|---|
| Query latency | Faster (often 2-5x at same recall) | Slower at high recall targets |
| Build time | Slower (complex graph construction) | Much faster (~4x) |
| Index memory | Larger (~2-4x) | Smaller |
| Recall at same latency | Better (generally) | Slightly lower |
| Incremental inserts | Excellent (no retraining needed) | Degrades: clusters do not readjust |
| Data required for build | None (can start empty) | Requires existing data (K-means) |
| Typical use case | Growing datasets, high precision, RAG | Static datasets, fast build is priority |
Practical Decision Guide
- Use HNSW in most situations: better recall, faster queries, handles inserts gracefully. This is the correct choice for 90% of RAG systems in production.
- Use IVFFlat when: you have billions of vectors and memory for HNSW is insufficient, or when you need to frequently rebuild the index on completely replaced data.
- Use brute force (no index) when: you have fewer than 50K vectors, or you require 100% guaranteed recall (e.g., legal search, compliance, deduplication).
Optimizing Queries with Hybrid Filters
One of the most common production challenges is combining metadata filters with vector search. pgvector 0.7+ handles this with iterative scan, but it is important to understand how to structure queries correctly to avoid the post-filtering problem where you lose results.
The Post-Filtering Problem and Its Solution
-- WRONG: post-filtering - ANN finds global top-k, then filters
-- If the top-k ANN results do not satisfy the filter, you get fewer than k results!
SELECT id, content, embedding <=> query_vec AS dist
FROM documents
ORDER BY embedding <=> query_vec
LIMIT 10
-- The problem: if these 10 results do not satisfy source_type='pdf',
-- you get potentially 0 PDF results instead of 10!
-- CORRECT: pre-filtering in pgvector with iterative scan
-- pgvector 0.7+ supports indexed scan with WHERE filters
SELECT id, content, embedding <=> query_vec AS dist
FROM documents
WHERE source_type = 'pdf' -- pre-filter: applied BEFORE ANN search
AND language = 'en'
ORDER BY embedding <=> query_vec -- ANN on the filtered subset
LIMIT 10;
-- EXPLAIN verifies that the index is being used with the filter:
EXPLAIN (ANALYZE, FORMAT TEXT)
SELECT id, content, embedding <=> '[...]'::vector AS dist
FROM documents
WHERE source_type = 'pdf'
ORDER BY embedding <=> '[...]'::vector
LIMIT 10;
-- Should show: "Index Scan using ... with Filter: (source_type = 'pdf')"
-- If you see Seq Scan, the filter may reduce the dataset too aggressively
-- In that case, a partial index is the optimal solution
Partial Indexes for Frequent Filters
-- If you always filter by source_type, create a dedicated partial index
-- Benefits: much smaller and faster than the global index
CREATE INDEX documents_hnsw_pdf
ON documents
USING hnsw (embedding vector_cosine_ops)
WITH (m = 16, ef_construction = 64)
WHERE source_type = 'pdf'; -- only PDF documents
CREATE INDEX documents_hnsw_recent
ON documents
USING hnsw (embedding vector_cosine_ops)
WITH (m = 16, ef_construction = 64)
WHERE created_at > NOW() - INTERVAL '30 days'; -- only recent documents
-- Queries automatically leverage partial indexes:
SELECT id, content, embedding <=> query_vec AS dist
FROM documents
WHERE source_type = 'pdf' -- automatically activates documents_hnsw_pdf
ORDER BY embedding <=> query_vec
LIMIT 10;
-- Size comparison: partial index vs full index
SELECT
indexname,
pg_size_pretty(pg_relation_size(indexrelid)) AS size,
ROUND(pg_relation_size(indexrelid)::numeric /
(SELECT pg_relation_size(indexrelid) FROM pg_stat_user_indexes
WHERE indexname = 'documents_hnsw_total') * 100, 1) AS pct_of_full
FROM pg_stat_user_indexes
JOIN pg_class ON pg_class.relname = pg_stat_user_indexes.indexname
WHERE tablename = 'documents';
-- documents_hnsw_total: 2.1 GB (100%)
-- documents_hnsw_pdf: 487 MB (23% - only PDFs)
Similarity Threshold Queries
-- Filter by minimum similarity: prevents irrelevant results
-- cosine distance < 0.4 means cosine similarity > 0.6
SELECT
id,
source_path,
content,
1 - (embedding <=> query_vec::vector) AS similarity,
embedding <=> query_vec::vector AS distance
FROM documents
WHERE
embedding <=> query_vec::vector < 0.4 -- max distance (= min 60% similarity)
ORDER BY embedding <=> query_vec::vector
LIMIT 10;
-- Distance-to-similarity conversion for cosine:
-- cosine_distance = 1 - cosine_similarity
-- distance 0.0 = similarity 1.0 (identical vectors)
-- distance 0.2 = similarity 0.8 (very similar)
-- distance 0.4 = similarity 0.6 (sufficiently similar for RAG)
-- distance 0.7 = similarity 0.3 (probably not relevant)
-- distance 1.0 = similarity 0.0 (unrelated)
-- distance 2.0 = similarity -1.0 (opposite)
-- Recommended thresholds by use case:
-- Technical document RAG: < 0.35 (similarity > 0.65)
-- FAQ answering: < 0.30 (similarity > 0.70)
-- Product recommendation: < 0.50 (similarity > 0.50)
-- Duplicate detection: < 0.10 (similarity > 0.90)
Semantic Search vs Similarity Search
These terms are often used interchangeably, but there is an important distinction that influences system architecture decisions:
| Type | Goal | Example | Metric |
|---|---|---|---|
| Similarity Search | Find vectors close to a given vector | Images similar to this image | L2 distance, cosine distance |
| Semantic Search | Find documents with similar meaning to a text query | "How do I install PostgreSQL?" finds guides without that exact phrase | Cosine similarity on text embeddings |
| Keyword Search | Exact keyword matching | "PostgreSQL 16" finds only documents containing that string | TF-IDF, BM25 |
| Hybrid Search | Combines semantic and keyword signals | Balances semantic relevance with exact match for technical terms | RRF, weighted sum |
Complete Semantic Search Implementation
import psycopg2
from openai import OpenAI
client = OpenAI()
def semantic_search(
conn,
query: str,
top_k: int = 10,
source_type: str = None,
min_similarity: float = 0.6,
include_metadata: bool = True
) -> list[dict]:
"""
Semantic search with optional filters and quality threshold.
Args:
conn: psycopg2 connection to PostgreSQL
query: The question or text to search for
top_k: Number of results to return
source_type: Filter by document type ('pdf', 'md', 'html')
min_similarity: Minimum cosine similarity threshold (0.0-1.0)
include_metadata: Whether to include the JSONB metadata field
Returns:
List of dicts with id, source_path, content, similarity, metadata
"""
# 1. Generate query embedding
resp = client.embeddings.create(
input=[query.replace("\n", " ")],
model="text-embedding-3-small"
)
query_vec = resp.data[0].embedding
max_distance = 1 - min_similarity # convert similarity -> distance
# 2. Build dynamic SQL query with optional filters
params = [query_vec, max_distance]
filter_clauses = ["embedding <=> %s::vector < %s"]
if source_type:
filter_clauses.append("source_type = %s")
params.append(source_type)
where = " AND ".join(filter_clauses)
metadata_col = "metadata" if include_metadata else "NULL::jsonb"
sql = f"""
SELECT
id,
source_path,
source_type,
chunk_index,
title,
content,
1 - (embedding <=> %s::vector) AS similarity,
{metadata_col} AS metadata
FROM documents
WHERE {where}
ORDER BY embedding <=> %s::vector
LIMIT %s
"""
params_final = [query_vec] + params + [query_vec, top_k]
with conn.cursor() as cur:
cur.execute(sql, params_final)
rows = cur.fetchall()
return [
{
"id": r[0],
"source_path": r[1],
"source_type": r[2],
"chunk_index": r[3],
"title": r[4],
"content": r[5],
"similarity": round(r[6], 4),
"metadata": r[7]
}
for r in rows
]
# Usage example
conn = psycopg2.connect("postgresql://postgres:pass@localhost/vectordb")
results = semantic_search(
conn,
query="How to optimize PostgreSQL query performance for large datasets",
top_k=5,
source_type="pdf",
min_similarity=0.65
)
for r in results:
print(f"[{r['similarity']:.3f}] {r['title']} - {r['content'][:100]}...")
Hybrid Search: Combining Vector and Full-Text
One of PostgreSQL's great strengths for RAG is the ability to combine semantic search (vector) with classical full-text search (BM25-like) in a single query. This is particularly useful for queries that contain precise technical terms like proper names, codes, version numbers, or product identifiers that might not have strong semantic neighbors.
-- Pure SQL Hybrid Search: vector + full-text with Reciprocal Rank Fusion (RRF)
-- RRF Score = sum(1/(k + rank)) for each result list
-- k=60 is the standard RRF constant from the original paper
WITH vector_results AS (
-- Semantic search: top 20 by vector similarity
SELECT
id, content, source_path,
ROW_NUMBER() OVER (ORDER BY embedding <=> $1::vector) AS v_rank,
1 - (embedding <=> $1::vector) AS vector_score
FROM documents
ORDER BY embedding <=> $1::vector
LIMIT 20
),
fts_results AS (
-- Full-text search: top 20 by ts_rank relevance
SELECT
id, content, source_path,
ROW_NUMBER() OVER (
ORDER BY ts_rank(to_tsvector('english', content),
plainto_tsquery('english', $2)) DESC
) AS f_rank,
ts_rank(to_tsvector('english', content),
plainto_tsquery('english', $2)) AS fts_score
FROM documents
WHERE to_tsvector('english', content) @@ plainto_tsquery('english', $2)
LIMIT 20
),
rrf_combined AS (
-- RRF fusion: combines both rankings
SELECT
COALESCE(v.id, f.id) AS id,
COALESCE(v.content, f.content) AS content,
COALESCE(v.source_path, f.source_path) AS source_path,
-- Weight: 70% vector + 30% full-text (tune for your use case)
COALESCE(0.7 / (60.0 + v.v_rank), 0) +
COALESCE(0.3 / (60.0 + f.f_rank), 0) AS rrf_score,
v.vector_score,
f.fts_score
FROM vector_results v
FULL OUTER JOIN fts_results f ON v.id = f.id
)
SELECT id, content, source_path, ROUND(rrf_score::numeric, 6) AS score
FROM rrf_combined
ORDER BY rrf_score DESC
LIMIT 5;
Benchmarking: Measuring Search Quality
import time
import psycopg2
from statistics import mean, quantiles
def benchmark_vector_search(conn, query_vectors: list, config: dict) -> dict:
"""
Run a latency benchmark over a set of vector queries.
Args:
conn: PostgreSQL connection
query_vectors: List of query vectors (1536 dimensions)
config: Dict with ef_search or probes to set
Returns:
Dict with latency metrics (p50, p95, p99, mean, min, max)
"""
# Set ANN parameters for this benchmark run
with conn.cursor() as cur:
if "ef_search" in config:
cur.execute(f"SET hnsw.ef_search = {config['ef_search']}")
if "probes" in config:
cur.execute(f"SET ivfflat.probes = {config['probes']}")
latencies = []
for query_vec in query_vectors:
start = time.perf_counter()
with conn.cursor() as cur:
cur.execute("""
SELECT id, embedding <=> %s::vector AS dist
FROM documents
ORDER BY embedding <=> %s::vector
LIMIT 10
""", (query_vec, query_vec))
cur.fetchall()
elapsed_ms = (time.perf_counter() - start) * 1000
latencies.append(elapsed_ms)
p_values = quantiles(latencies, n=100)
return {
"config": config,
"n_queries": len(query_vectors),
"p50_ms": round(p_values[49], 2),
"p95_ms": round(p_values[94], 2),
"p99_ms": round(p_values[98], 2),
"mean_ms": round(mean(latencies), 2),
"min_ms": round(min(latencies), 2),
"max_ms": round(max(latencies), 2)
}
# Compare different HNSW configurations
configs = [
{"name": "HNSW ef=20 (fast)", "ef_search": 20},
{"name": "HNSW ef=40 (default)", "ef_search": 40},
{"name": "HNSW ef=100 (precise)", "ef_search": 100},
]
# Use real test queries (embeddings generated with the same model as documents)
test_queries = [...] # list of 1536-dim vectors
print(f"Benchmark with {len(test_queries)} test queries:")
for cfg in configs:
result = benchmark_vector_search(conn, test_queries[:100], cfg)
print(f"{cfg['name']}:")
print(f" p50={result['p50_ms']}ms, p95={result['p95_ms']}ms, p99={result['p99_ms']}ms")
Recall Measurement: Verifying ANN Quality
def measure_recall_at_k(conn, query_vectors: list, k: int = 10) -> float:
"""
Measure Recall@K by comparing ANN results with brute force ground truth.
Recall@K = (correct ANN results) / k
A Recall@10 of 0.95 means the ANN finds
9.5 of the 10 exact results on average.
Note: use a sample of 50-100 queries for statistical stability.
Run this periodically in production to detect index degradation.
"""
total_recall = 0.0
for query_vec in query_vectors:
# Exact results (brute force) - ground truth
with conn.cursor() as cur:
cur.execute("SET enable_indexscan = off")
cur.execute("""
SELECT id FROM documents
ORDER BY embedding <=> %s::vector
LIMIT %s
""", (query_vec, k))
exact_ids = set(r[0] for r in cur.fetchall())
cur.execute("SET enable_indexscan = on")
# ANN results (with HNSW/IVFFlat index)
with conn.cursor() as cur:
cur.execute("""
SELECT id FROM documents
ORDER BY embedding <=> %s::vector
LIMIT %s
""", (query_vec, k))
ann_ids = set(r[0] for r in cur.fetchall())
# Recall = overlap / k
overlap = len(exact_ids & ann_ids)
total_recall += overlap / k
avg_recall = total_recall / len(query_vectors)
print(f"Recall@{k}: {avg_recall:.4f} ({avg_recall*100:.1f}%)")
return avg_recall
# Recall targets by use case:
# Recall@10 > 0.90 for general RAG applications
# Recall@10 > 0.95 for enterprise RAG (medical, legal)
# Recall@10 > 0.85 acceptable for autocomplete/recommendation
# With HNSW m=16, ef_construction=64, ef_search=40: typically 0.92-0.95
# With HNSW m=16, ef_construction=64, ef_search=100: typically 0.97-0.99
Advanced Query Optimization: Planning and Diagnostics
-- 1. Updated statistics: essential for good query plans
ANALYZE documents;
-- 2. Check index size and usage statistics
SELECT
indexname,
pg_size_pretty(pg_relation_size(indexrelid)) AS index_size,
idx_scan AS times_used,
idx_tup_read AS tuples_read,
idx_tup_fetch AS tuples_fetched,
ROUND(idx_tup_fetch::numeric / NULLIF(idx_scan, 0), 1) AS avg_tuples_per_scan
FROM pg_stat_user_indexes
WHERE tablename = 'documents'
ORDER BY idx_scan DESC;
-- 3. Memory parameters for optimal performance
SET maintenance_work_mem = '512MB'; -- for HNSW index BUILD (temporary)
SET work_mem = '64MB'; -- for query sort operations
-- 4. Parallel query for large datasets
SET max_parallel_workers_per_gather = 4; -- use 4 workers per query
-- 5. Verify shared_buffers is adequate
-- The HNSW index should fit in memory for optimal performance
SHOW shared_buffers;
-- Rule of thumb: shared_buffers = 25% of total RAM
-- pg_relation_size(HNSW index) should be much less than shared_buffers
-- 6. Monitor cache hit ratio for the table and index
SELECT
relname,
heap_blks_read AS disk_reads,
heap_blks_hit AS cache_hits,
ROUND(heap_blks_hit::numeric / NULLIF(heap_blks_read + heap_blks_hit, 0) * 100, 2)
AS cache_hit_pct
FROM pg_statio_user_tables
WHERE relname = 'documents';
-- Target: cache_hit_pct > 95% for optimal performance
-- 7. Identify unused indexes (candidates for DROP):
SELECT indexname, idx_scan
FROM pg_stat_user_indexes
WHERE tablename = 'documents'
AND idx_scan = 0
AND indexname NOT LIKE '%pkey%'; -- exclude primary keys
-- 8. Index cache hit ratio (how much of the index is served from cache)
SELECT
indexrelname,
idx_blks_read AS index_disk_reads,
idx_blks_hit AS index_cache_hits,
ROUND(idx_blks_hit::numeric / NULLIF(idx_blks_read + idx_blks_hit, 0) * 100, 2)
AS index_cache_pct
FROM pg_statio_user_indexes
WHERE indexrelname LIKE '%hnsw%';
Maximal Marginal Relevance (MMR)
A common problem in similarity search is result redundancy: the top-k chunks may all be very similar to each other, covering the same information. This degrades RAG quality because the model receives repeated information in its context window. MMR balances relevance and diversity by progressively selecting the next chunk that is relevant to the query but different from the already selected ones.
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity
def maximal_marginal_relevance(
query_vec: list[float],
candidate_vecs: list[list[float]],
candidate_data: list[dict],
k: int = 5,
lambda_param: float = 0.5
) -> list[dict]:
"""
MMR: select k results balancing relevance and diversity.
Algorithm:
1. Always select the most relevant candidate to the query first
2. For each subsequent candidate, use the MMR score which penalizes
redundancy with respect to already selected results
lambda_param:
1.0 = relevance only (equivalent to standard top-k)
0.0 = diversity only (maximum variety)
0.5 = optimal balance for RAG (default)
0.6 = slightly more relevance than diversity
Returns:
List of k diversified yet relevant candidates
"""
query_array = np.array(query_vec).reshape(1, -1)
candidate_array = np.array(candidate_vecs)
# Query-to-candidate similarity (relevance scores)
query_sims = cosine_similarity(query_array, candidate_array)[0]
selected = []
selected_indices = []
remaining_indices = list(range(len(candidate_vecs)))
while len(selected) < k and remaining_indices:
if not selected:
# First iteration: choose the most relevant
best_idx = int(np.argmax(query_sims))
else:
# Subsequent iterations: MMR score
selected_array = candidate_array[selected_indices]
mmr_scores = []
for i in remaining_indices:
relevance = query_sims[i]
# Maximum similarity with already selected (redundancy)
max_redundancy = float(np.max(
cosine_similarity(candidate_array[i:i+1], selected_array)[0]
))
# MMR balances relevance and redundancy
mmr = lambda_param * relevance - (1 - lambda_param) * max_redundancy
mmr_scores.append((i, mmr))
best_idx = max(mmr_scores, key=lambda x: x[1])[0]
selected.append(candidate_data[best_idx])
selected_indices.append(best_idx)
remaining_indices.remove(best_idx)
return selected
# Usage pattern with PostgreSQL results:
# First retrieve more candidates (top-20), then apply MMR for top-5 diversified
chunks_with_vecs = searcher.vector_search_with_embeddings(query, top_k=20)
embeddings = [c['embedding'] for c in chunks_with_vecs]
data = [{'id': c['id'], 'content': c['content']} for c in chunks_with_vecs]
diverse_chunks = maximal_marginal_relevance(
query_vec=query_embedding,
candidate_vecs=embeddings,
candidate_data=data,
k=5,
lambda_param=0.6 # slightly oriented toward relevance
)
# Result: 5 chunks that are relevant AND have diverse content
Common Mistakes in Similarity Search
- ANN not being used: If EXPLAIN shows a Seq Scan, check that the LIMIT is reasonable relative to the table size, that the index exists (\d documents), and that statistics are current (ANALYZE).
- ef_search too low: With ef_search=10, you might get Recall@10 of only 70-75%. For production, use at least 40; prefer 60-100 for RAG applications.
- Post-filtering that empties results: If you apply WHERE filters after ORDER BY (without pre-filtering), the ANN returns global top-k and then filters, potentially returning 0 results. Always use pre-filtering.
- Wrong distance operator: Using <-> (L2) instead of <=> (cosine) for text embeddings dramatically reduces search quality. Verify that the operator type in the index matches the operator in the query.
- No similarity threshold: Returning all top-k results even when distance is very high leads to irrelevant answers in RAG responses.
- Ignoring redundancy: The top-5 chunks might all cover the same paragraph of a document. Use MMR for diverse context in RAG responses.
Conclusions and Next Steps
Similarity search in PostgreSQL has many tuning parameters, but the fundamental trade-off
is straightforward: the more candidates you examine, the higher the recall but the
higher the latency. For most production RAG systems, HNSW with
ef_search=60-100 offers an excellent balance between 10ms and 50ms latency
with Recall@10 of 92-97%.
The key takeaway for 2026 is that PostgreSQL with pgvector can handle similarity search at production scale without a separate vector database. The combination of HNSW indexing, hybrid search with RRF, and MMR for diversity gives you a search system competitive with specialized vector databases like Pinecone or Qdrant, while keeping your entire data infrastructure in a single, battle-tested system.
The next article deepens advanced indexing strategies: how to choose optimal HNSW and IVFFlat parameters for your specific use case, how to monitor index degradation over time, and techniques for incremental updates without performance loss. Includes: complete PostgreSQL configuration, parallel index build, and how to schedule REINDEX without downtime.
Advanced Query Patterns for Production RAG
Beyond the fundamentals, production RAG systems often need more sophisticated query patterns. Here are three patterns that address real production challenges:
Pattern 1: Two-Phase Retrieval (Coarse-to-Fine)
-- Two-phase retrieval: fast coarse search followed by precise re-ranking
-- Phase 1: Fast ANN search with low ef_search to get candidates quickly
-- Phase 2: Exact re-ranking of candidates using more expensive computations
-- Phase 1: coarse retrieval (fast, lower recall)
SET hnsw.ef_search = 20; -- fast but less precise
WITH candidates AS (
SELECT id, content, source_path, chunk_index, metadata,
embedding <=> $1::vector AS coarse_distance
FROM documents
WHERE source_type = $2 -- pre-filter for relevant document type
ORDER BY embedding <=> $1::vector
LIMIT 50 -- retrieve 50 candidates (10x the final top-k)
),
-- Phase 2: exact re-ranking on candidates only
reranked AS (
SELECT
c.*,
-- Combine vector similarity with metadata relevance signals
c.coarse_distance * 0.8 +
(CASE
WHEN c.metadata->>'quality_score' IS NOT NULL
THEN (1.0 - (c.metadata->>'quality_score')::float) * 0.2
ELSE 0
END) AS final_score,
-- Full-text re-rank signal on small candidate set
ts_rank(to_tsvector('english', c.content),
plainto_tsquery('english', $3)) AS fts_boost
FROM candidates c
)
SELECT id, content, source_path, chunk_index, metadata,
final_score - (fts_boost * 0.1) AS combined_score
FROM reranked
ORDER BY combined_score ASC
LIMIT 5;
Pattern 2: Time-Decayed Similarity Search
-- Time-decayed search: recent documents receive a boost
-- Useful for news, documentation, or frequently updated knowledge bases
SELECT
id,
content,
source_path,
created_at,
embedding <=> $1::vector AS vector_distance,
-- Exponential decay: documents older than 30 days receive diminishing boost
-- decay_factor: 1.0 for today, ~0.5 for 30 days ago, ~0.13 for 90 days ago
EXP(-0.023 * EXTRACT(EPOCH FROM (NOW() - created_at)) / 86400) AS time_decay,
-- Combined score: lower is better (distance minimized, decay maximized)
(embedding <=> $1::vector) * (2.0 - EXP(-0.023 * EXTRACT(EPOCH FROM (NOW() - created_at)) / 86400))
AS time_adjusted_score
FROM documents
WHERE created_at > NOW() - INTERVAL '180 days' -- limit to last 6 months
ORDER BY time_adjusted_score ASC
LIMIT 10;
-- Parameter explanation:
-- 0.023 = -ln(0.5)/30 = decay constant for half-life of 30 days
-- 86400 = seconds in a day (convert EPOCH to days)
-- The factor 2.0 ensures the score stays positive even for old documents
Pattern 3: Semantic Deduplication Before Indexing
-- Detect near-duplicate chunks before inserting into the knowledge base
-- Prevents redundant context in RAG retrieval
-- Find near-duplicate chunks in the existing knowledge base
-- (cosine distance < 0.05 = extremely similar content)
WITH new_chunk AS (
SELECT $1::vector AS embedding -- the new chunk's embedding
)
SELECT
d.id,
d.source_path,
d.content,
1 - (d.embedding <=> nc.embedding) AS similarity
FROM documents d, new_chunk nc
WHERE 1 - (d.embedding <=> nc.embedding) > 0.95 -- 95%+ similar = near-duplicate
ORDER BY similarity DESC
LIMIT 5;
-- If this returns results, skip insertion (duplicate content)
-- If empty, proceed with insertion
-- Python implementation:
def insert_if_not_duplicate(conn, content: str, embedding: list[float],
source_path: str, threshold: float = 0.95) -> bool:
"""
Insert chunk only if no near-duplicate exists in the database.
Returns True if inserted, False if skipped as duplicate.
"""
with conn.cursor() as cur:
cur.execute("""
SELECT COUNT(*) FROM documents
WHERE 1 - (embedding <=> %s::vector) > %s
""", (embedding, threshold))
duplicate_count = cur.fetchone()[0]
if duplicate_count > 0:
print(f"Skipping duplicate chunk from {source_path}")
return False
# Insert the chunk
with conn.cursor() as cur:
cur.execute("""
INSERT INTO documents (content, embedding, source_path)
VALUES (%s, %s::vector, %s)
""", (content, embedding, source_path))
conn.commit()
return True
Performance Summary: Choosing the Right Configuration
| Use Case | Algorithm | ef_search / probes | Expected Latency | Expected Recall@10 |
|---|---|---|---|---|
| Real-time autocomplete | HNSW | ef_search=20 | 2-5ms | ~85% |
| General RAG (production) | HNSW | ef_search=60 | 10-20ms | ~94% |
| Medical/Legal RAG | HNSW | ef_search=100 | 20-40ms | ~97% |
| Batch re-ranking | Brute force | N/A | 1-5s/batch | 100% |
| Large static dataset | IVFFlat | probes=30 | 15-30ms | ~92% |







