Ciao! Sono

Federico Calò

Sviluppatore Software | Divulgatore Tecnico

Creo applicazioni web moderne e strumenti digitali personalizzati per aiutare le attività a crescere attraverso l'innovazione tecnologica. La mia passione è unire informatica ed economia per generare valore reale.

Contattami

Chi Sono

La mia passione per l'informatica è nata tra i banchi dell'Istituto Tecnico Commerciale di Maglie, dove ho scoperto il potere della programmazione e il fascino di creare soluzioni digitali. Fin da subito, ho capito che l'informatica non era solo codice, ma uno strumento straordinario per trasformare idee in realtà.

Durante gli studi superiori in Sistemi Informativi Aziendali, ho iniziato a intrecciare informatica ed economia, comprendendo come la tecnologia possa essere il motore della crescita per qualsiasi attività. Questa visione mi ha accompagnato all'Università degli Studi di Bari, dove ho conseguito la Laurea in Informatica, approfondendo le mie competenze tecniche e la mia passione per lo sviluppo software.

Oggi metto questa esperienza al servizio di imprese, professionisti e startup, creando soluzioni digitali su misura che automatizzano processi, ottimizzano risorse e aprono nuove opportunità di business. Perché la vera innovazione inizia quando la tecnologia incontra le esigenze reali delle persone.

Le Mie Competenze

Analisi Dati & Modelli Previsionali

Trasformo i dati in insights strategici con analisi approfondite e modelli predittivi per decisioni informate

Automazione Processi

Creo strumenti personalizzati che automatizzano operazioni ripetitive e liberano tempo per attività a valore aggiunto

Sistemi Custom

Sviluppo sistemi software su misura, dalle integrazioni tra piattaforme alle dashboard personalizzate

const federico = {
  nome: "Federico Calò",
  ruolo: "Sviluppatore Software",
  città: "Bari, Italia",
  missione: "Aiutare attraverso l'informatica",
  passioni: [
    "Codice Pulito",
    "Innovazione",
    "Crescita Continua"
  ]
};

La Mia Missione

Credo fermamente che l'informatica sia lo strumento più potente per trasformare le idee in realtà e migliorare la vita delle persone.

🚀

Democratizzare la Tecnologia

La mia missione è rendere l'informatica accessibile a tutti: dalle piccole imprese locali alle startup innovative, fino ai professionisti che vogliono digitalizzare la propria attività. Ogni realtà merita di sfruttare le potenzialità del digitale.

💡

Unire Informatica ed Economia

Non è solo questione di scrivere codice: è capire come la tecnologia possa generare valore reale. Intrecciando competenze informatiche e visione economica, aiuto le attività a crescere, ottimizzare processi e raggiungere nuovi traguardi di efficienza e redditività.

🎯

Creare Soluzioni su Misura

Ogni attività è unica, e così devono esserlo le soluzioni. Sviluppo strumenti personalizzati che rispondono alle esigenze specifiche di ciascun cliente, automatizzando processi ripetitivi e liberando tempo per ciò che conta davvero: far crescere il business.

Trasforma la Tua Attività con la Tecnologia

Dicembre 2024

Visualizza

Master SQL

RoadMap.sh

Novembre 2024

Visualizza

Oracle Certified Foundations Associate

Oracle

Ottobre 2024

Visualizza

People Leadership Credential

Connect

Settembre 2024

💻 Linguaggi & Tecnologie

☕Java

🐍Python

📜JavaScript

🅰️Angular

⚛️React

🔷TypeScript

🗄️SQL

🐘PHP

🎨CSS/SCSS

🔧Node.js

🐳Docker

🌿Git

💼

12/2024 - Presente

Custom Software Engineering Analyst

Accenture

Bari, Puglia, Italia · Ibrida Analisi e sviluppo di sistemi informatici attraverso l'utilizzo di Java e Quarkus in Health and Public Sector. Formazione continua su tecnologie moderne per la creazione di soluzioni software personalizzate ed efficienti e sugli agenti.

💼

06/2022 - 12/2024

Analista software e Back End Developer Associate Consultant

Links Management and Technology SpA

Esperienza nell'analisi di sistemi software as-is e flussi ETL utilizzando PowerCenter. Formazione completata su Spring Boot per lo sviluppo di applicazioni backend moderne e scalabili. Sviluppatore Backend specializzato in Spring Boot, con esperienza in progettazione di database, analisi, sviluppo e testing dei task assegnati.

💼

02/2021 - 10/2021

Programmatore software

Adesso.it (prima era WebScience srl)

Esperienza nell'analisi AS-IS e TO-BE, evoluzioni SEO ed evoluzioni website per migliorare le performance e l'engagement degli utenti.

🎓

2018 - 2025

Laurea in Informatica

Università degli Studi di Bari Aldo Moro

Bachelor's degree in Computer Science, focusing on software engineering, algorithms, and modern development practices.

📚

2013 - 2018

Diploma - Sistemi Informativi Aziendali

Istituto Tecnico Commerciale di Maglie

Technical diploma specializing in Business Information Systems, combining IT knowledge with business management.

Contattami

Hai un progetto in mente? Parliamone! Compila il form qui sotto e ti risponderò al più presto.

* Campi obbligatori. I tuoi dati saranno utilizzati solo per rispondere alla tua richiesta.

Advanced Similarity Search: Algorithms and Optimization in PostgreSQL

When you have a database with millions of vectors and need to find the 10 most similar to a query in under 50 milliseconds, similarity search becomes a serious engineering challenge. Exact (brute-force) search is precise but does not scale: comparing every vector against every query has O(n*d) complexity where n is the number of vectors and d the dimensionality. With 10 million vectors at 1536 dimensions, each query would require 15 billion floating-point operations.

The solution is ANN (Approximate Nearest Neighbor) algorithms: they trade the guarantee of finding the exact result for answers in O(log n) or even amortized O(1), with practical accuracy of 95-99%. PostgreSQL with pgvector implements the two most widely used ANN algorithms: HNSW and IVFFlat.

In this article we analyze how these algorithms work internally, when to use each one, how to optimize similarity search queries in PostgreSQL, how to combine vector search with metadata filters, and advanced techniques like MMR for diversified results.

Series Overview

#	Article	Focus
1	pgvector	Installation, operators, indexing
2	Embeddings Deep Dive	Models, distances, generation
3	RAG with PostgreSQL	End-to-end RAG pipeline
4	You are here - Similarity Search	Algorithms and optimization
5	HNSW and IVFFlat Indexing	Advanced indexing strategies
6	RAG in Production	Scalability and performance

What You Will Learn

The difference between exact search and approximate nearest neighbor (ANN)
How HNSW works internally: hierarchical navigable small world graphs
How IVFFlat works: K-means clustering and probe search
When to use HNSW vs IVFFlat vs brute force
Query optimization: ef_search and probes parameters
Hybrid filters: combining metadata filtering with vector search
Distance operators: cosine, L2, and inner product compared
Benchmarking and recall measurement with Python code
Maximal Marginal Relevance (MMR) for diversified results
Semantic search vs keyword search: when to use which

Exact Search vs Approximate Search

Brute Force (Exact Search)

Exact search compares the query against every single vector in the database. It guarantees 100% recall but has linear complexity. It is the right choice only for small datasets or when absolute precision is a non-negotiable requirement:

-- Exact search: no index used, full table scan
-- Useful for small datasets (< 100K vectors) or when precision is critical
SELECT id, content, embedding <=> query_vec AS distance
FROM documents
ORDER BY embedding <=> query_vec  -- sequential scan over the entire dataset
LIMIT 10;

-- Force brute force even when an ANN index exists:
SET enable_indexscan = off;
SELECT id, content, embedding <=> query_vec AS distance
FROM documents
ORDER BY embedding <=> query_vec
LIMIT 10;
SET enable_indexscan = on;  -- restore

-- Comparative benchmark (real dataset on NVMe SSD):
-- Dataset: 1M vectors, 1536 dim, PostgreSQL 16, 32GB RAM, 16 CPU
-- Brute force (seq scan): ~2000ms per query  -- 2 seconds!
-- HNSW (ef_search=40):    ~10ms per query    -- 200x faster
-- HNSW (ef_search=100):   ~25ms per query    -- 80x faster
-- IVFFlat (probes=50):    ~28ms per query    -- 71x faster

-- 100K vector dataset (brute force acceptable):
-- Brute force: ~50ms per query  -- acceptable
-- HNSW:        ~3ms per query   -- still better when available

ANN: The Recall/Performance Trade-off

-- Verify whether the query is using the ANN index
EXPLAIN (ANALYZE, BUFFERS, FORMAT TEXT)
SELECT id, content, embedding <=> '[0.1, 0.2, ...]'::vector AS distance
FROM documents
ORDER BY embedding <=> '[0.1, 0.2, ...]'::vector
LIMIT 10;

-- Desired output (uses HNSW index):
-- Index Scan using documents_embedding_hnsw on documents
--   Index Cond: (embedding <=> '...'::vector)
--   Limit: 10
--   Buffers: shared hit=347  (everything from cache = excellent!)
--   -> Execution Time: 8.3 ms

-- Undesired output (brute force):
-- Seq Scan on documents
--   Sort Key: (embedding <=> '...'::vector)  -- sorts after full scan
--   Buffers: shared hit=234 read=8921  (many disk reads = slow!)
--   -> Execution Time: 2341 ms

-- WHY PostgreSQL may not use the index:
-- 1. LIMIT too large (>10% of rows for small tables)
-- 2. Statistics are stale: run ANALYZE documents;
-- 3. Index does not exist: check with \d documents
-- 4. enable_indexscan accidentally set to off

Distance Operators in pgvector

pgvector supports three distance operators, each appropriate for different types of embeddings. Choosing the wrong operator can significantly reduce search quality and cause PostgreSQL to ignore your index entirely.

Operator	Type	Range	When to Use	Index Support
`<=>`	Cosine distance	[0, 2]	Text embeddings (OpenAI, Sentence Transformers) - DEFAULT choice	HNSW, IVFFlat (vector_cosine_ops)
`<->`	Euclidean (L2) distance	[0, inf)	Vectors where magnitude matters (images, audio features)	HNSW, IVFFlat (vector_l2_ops)
`<#>`	Negative inner product	(-inf, 0]	Pre-normalized embeddings, maximum inner product search	HNSW, IVFFlat (vector_ip_ops)

-- Cosine distance (most common for text embeddings):
SELECT id, content,
    embedding <=> query_vec AS cosine_distance,
    1 - (embedding <=> query_vec) AS cosine_similarity
FROM documents
ORDER BY embedding <=> query_vec
LIMIT 5;

-- L2 distance (Euclidean, for images/audio):
SELECT id, content,
    embedding <-> query_vec AS l2_distance
FROM documents
ORDER BY embedding <-> query_vec
LIMIT 5;

-- Inner product (for normalized embeddings):
SELECT id, content,
    (embedding <#> query_vec) * -1 AS inner_product  -- negated so higher = more similar
FROM documents
ORDER BY embedding <#> query_vec  -- ORDER BY works because <#> is negative
LIMIT 5;

-- Create index with the CORRECT operator (MUST match the query operator!):
-- For cosine distance (default for text):
CREATE INDEX ON documents USING hnsw (embedding vector_cosine_ops);
-- For L2 distance:
CREATE INDEX ON documents USING hnsw (embedding vector_l2_ops);
-- For inner product:
CREATE INDEX ON documents USING hnsw (embedding vector_ip_ops);

-- COMMON MISTAKE: index with vector_cosine_ops but query uses <->
-- Result: PostgreSQL ignores the index and falls back to brute force!
-- Always verify with EXPLAIN ANALYZE that the index is being used.

HNSW: Hierarchical Navigable Small World

HNSW is the most popular ANN algorithm in 2025-2026. It is based on the theory of small world graphs: structures where every node is reachable from any other in a small number of hops (the "six degrees of separation" phenomenon). The hierarchical structure allows rapid navigation toward the most relevant region of vector space, using upper layers as "highways" and the base layer for precise search.

How HNSW Works Internally

HNSW builds a hierarchical multi-layer structure where each layer is a subgraph of the layer below:

Layer 0 (base): All vectors, connected to their nearest neighbors. Dense graph with the highest connectivity.
Upper layers (1, 2, ...): Progressively smaller subsets of vectors. Sparse graphs for fast long-range navigation.
Entry point: Search always starts at the highest layer (fewest vectors) and greedily descends toward the target.

-- HNSW conceptual structure:
--
-- Layer 2:  o-----------o-----------o   (few nodes, long jumps - fast navigation)
--               \         |         /
-- Layer 1:  o---o---o---o---o---o---o   (intermediate navigation)
--                \   |   |   /
-- Layer 0:  o-o-o-o-o-o-o-o-o-o-o-o   (all vectors, precise search)
--
-- Search algorithm for query Q:
-- 1. Start from the highest layer with a fixed entry point
-- 2. Greedy descent: move toward the node closest to Q at each layer
-- 3. Use that node as the entry point for the layer below
-- 4. Repeat until layer 0
-- 5. At layer 0: beam search with ef_search candidates
-- 6. Return top-k from the explored candidates

-- Theoretical complexity:
-- Build: O(n * log(n))  -- sub-linear in number of vectors
-- Query: O(log(n))      -- logarithmic! vs O(n) for brute force

-- Creating an HNSW index (optimal values for most cases):
CREATE INDEX documents_hnsw_idx
ON documents
USING hnsw (embedding vector_cosine_ops)
WITH (
    m = 16,              -- connections per node (default 16, range 4-64)
    ef_construction = 64 -- candidates during build (default 64, range 16-200)
);

-- Set ef_search at query time (no index rebuild required):
SET hnsw.ef_search = 60;  -- recommended value for production RAG

HNSW Parameters Reference Table

Parameter	Description	Default	Range	Effect of Increasing
`m`	Max connections per node per layer	16	4-64	Better recall, more memory, slower build
`ef_construction`	Candidates considered during index build	64	16-200	Better potential recall, much slower build
`ef_search`	Candidates considered during query (runtime)	40	1-ef_construction	Better recall, slower query

-- Configuring ef_search for queries (runtime parameter, no rebuild)
SET hnsw.ef_search = 40;   -- default, good balance

-- High precision (enterprise RAG, medical, legal applications):
SET hnsw.ef_search = 100;

-- High speed (autocomplete, recommendations, low-stakes search):
SET hnsw.ef_search = 20;

-- Benchmark reference (1M vectors, 1536 dim, m=16, ef_construction=64):
-- ef_search=20:  ~5ms/query,  recall@10 ~85%
-- ef_search=40:  ~10ms/query, recall@10 ~92%
-- ef_search=100: ~25ms/query, recall@10 ~97%
-- ef_search=200: ~50ms/query, recall@10 ~99%

-- Set for current session (does not persist across sessions):
SELECT set_config('hnsw.ef_search', '100', false);

-- Key insight: ef_search is the most powerful tuning lever
-- because it can be changed per-query without any index modification.
-- Use a higher value for critical searches, lower for bulk operations.

IVFFlat: Inverted File with Flat Compression

IVFFlat uses a different approach: it divides the vector space into clusters (cells) using K-means, and during a query searches only the most promising clusters. It is simpler than HNSW but requires existing data for K-means training and degrades more quickly with incremental inserts.

How IVFFlat Works

Training phase: K-means clustering divides vectors into lists centroids
Build phase: Each vector is assigned to the nearest cluster (centroid)
Query time: Finds the probes nearest clusters to the query, then searches exactly within those clusters

-- IVFFlat requires data before creating the index
-- (it needs to perform K-means clustering first)

-- PREREQUISITE: the table must have at least a few hundred rows
-- Rule of thumb: lists = sqrt(n_rows), with a minimum of 100

-- Creating an IVFFlat index
CREATE INDEX documents_ivfflat_idx
ON documents
USING ivfflat (embedding vector_cosine_ops)
WITH (
    lists = 100   -- number of clusters (default 100, recommended: sqrt(n_rows))
);

-- For 1M rows: lists = sqrt(1000000) = 1000
CREATE INDEX ON documents USING ivfflat (embedding vector_cosine_ops) WITH (lists = 1000);

-- Configure probes (how many clusters to search at query time)
SET ivfflat.probes = 10;  -- default, searches 10 clusters out of 1000

-- High precision mode:
SET ivfflat.probes = 50;  -- searches 50/1000 clusters = 5% of the space

-- Benchmark reference (1M vectors, lists=1000):
-- probes=5:   ~3ms/query,  recall@10 ~72%
-- probes=10:  ~6ms/query,  recall@10 ~82%
-- probes=30:  ~18ms/query, recall@10 ~94%
-- probes=100: ~60ms/query, recall@10 ~99%

-- IVFFlat vs HNSW at equivalent recall ~92%:
-- IVFFlat (probes=50): ~28ms  -- slower!
-- HNSW (ef_search=40): ~10ms  -- faster
-- Conclusion: for high recall, HNSW is typically more efficient

HNSW vs IVFFlat: Choosing the Right Algorithm

Characteristic	HNSW	IVFFlat
Query latency	Faster (often 2-5x at same recall)	Slower at high recall targets
Build time	Slower (complex graph construction)	Much faster (~4x)
Index memory	Larger (~2-4x)	Smaller
Recall at same latency	Better (generally)	Slightly lower
Incremental inserts	Excellent (no retraining needed)	Degrades: clusters do not readjust
Data required for build	None (can start empty)	Requires existing data (K-means)
Typical use case	Growing datasets, high precision, RAG	Static datasets, fast build is priority

Practical Decision Guide

Use HNSW in most situations: better recall, faster queries, handles inserts gracefully. This is the correct choice for 90% of RAG systems in production.
Use IVFFlat when: you have billions of vectors and memory for HNSW is insufficient, or when you need to frequently rebuild the index on completely replaced data.
Use brute force (no index) when: you have fewer than 50K vectors, or you require 100% guaranteed recall (e.g., legal search, compliance, deduplication).

Optimizing Queries with Hybrid Filters

One of the most common production challenges is combining metadata filters with vector search. pgvector 0.7+ handles this with iterative scan, but it is important to understand how to structure queries correctly to avoid the post-filtering problem where you lose results.

The Post-Filtering Problem and Its Solution

-- WRONG: post-filtering - ANN finds global top-k, then filters
-- If the top-k ANN results do not satisfy the filter, you get fewer than k results!
SELECT id, content, embedding <=> query_vec AS dist
FROM documents
ORDER BY embedding <=> query_vec
LIMIT 10
-- The problem: if these 10 results do not satisfy source_type='pdf',
-- you get potentially 0 PDF results instead of 10!

-- CORRECT: pre-filtering in pgvector with iterative scan
-- pgvector 0.7+ supports indexed scan with WHERE filters
SELECT id, content, embedding <=> query_vec AS dist
FROM documents
WHERE source_type = 'pdf'           -- pre-filter: applied BEFORE ANN search
  AND language = 'en'
ORDER BY embedding <=> query_vec   -- ANN on the filtered subset
LIMIT 10;

-- EXPLAIN verifies that the index is being used with the filter:
EXPLAIN (ANALYZE, FORMAT TEXT)
SELECT id, content, embedding <=> '[...]'::vector AS dist
FROM documents
WHERE source_type = 'pdf'
ORDER BY embedding <=> '[...]'::vector
LIMIT 10;
-- Should show: "Index Scan using ... with Filter: (source_type = 'pdf')"
-- If you see Seq Scan, the filter may reduce the dataset too aggressively
-- In that case, a partial index is the optimal solution

Partial Indexes for Frequent Filters

-- If you always filter by source_type, create a dedicated partial index
-- Benefits: much smaller and faster than the global index

CREATE INDEX documents_hnsw_pdf
ON documents
USING hnsw (embedding vector_cosine_ops)
WITH (m = 16, ef_construction = 64)
WHERE source_type = 'pdf';  -- only PDF documents

CREATE INDEX documents_hnsw_recent
ON documents
USING hnsw (embedding vector_cosine_ops)
WITH (m = 16, ef_construction = 64)
WHERE created_at > NOW() - INTERVAL '30 days';  -- only recent documents

-- Queries automatically leverage partial indexes:
SELECT id, content, embedding <=> query_vec AS dist
FROM documents
WHERE source_type = 'pdf'  -- automatically activates documents_hnsw_pdf
ORDER BY embedding <=> query_vec
LIMIT 10;

-- Size comparison: partial index vs full index
SELECT
    indexname,
    pg_size_pretty(pg_relation_size(indexrelid)) AS size,
    ROUND(pg_relation_size(indexrelid)::numeric /
          (SELECT pg_relation_size(indexrelid) FROM pg_stat_user_indexes
           WHERE indexname = 'documents_hnsw_total') * 100, 1) AS pct_of_full
FROM pg_stat_user_indexes
JOIN pg_class ON pg_class.relname = pg_stat_user_indexes.indexname
WHERE tablename = 'documents';
-- documents_hnsw_total: 2.1 GB  (100%)
-- documents_hnsw_pdf:   487 MB  (23% - only PDFs)

Similarity Threshold Queries

-- Filter by minimum similarity: prevents irrelevant results
-- cosine distance < 0.4 means cosine similarity > 0.6

SELECT
    id,
    source_path,
    content,
    1 - (embedding <=> query_vec::vector) AS similarity,
    embedding <=> query_vec::vector AS distance
FROM documents
WHERE
    embedding <=> query_vec::vector < 0.4  -- max distance (= min 60% similarity)
ORDER BY embedding <=> query_vec::vector
LIMIT 10;

-- Distance-to-similarity conversion for cosine:
-- cosine_distance = 1 - cosine_similarity
-- distance 0.0 = similarity 1.0 (identical vectors)
-- distance 0.2 = similarity 0.8 (very similar)
-- distance 0.4 = similarity 0.6 (sufficiently similar for RAG)
-- distance 0.7 = similarity 0.3 (probably not relevant)
-- distance 1.0 = similarity 0.0 (unrelated)
-- distance 2.0 = similarity -1.0 (opposite)

-- Recommended thresholds by use case:
-- Technical document RAG:  < 0.35 (similarity > 0.65)
-- FAQ answering:           < 0.30 (similarity > 0.70)
-- Product recommendation:  < 0.50 (similarity > 0.50)
-- Duplicate detection:     < 0.10 (similarity > 0.90)

Semantic Search vs Similarity Search

These terms are often used interchangeably, but there is an important distinction that influences system architecture decisions:

Type	Goal	Example	Metric
Similarity Search	Find vectors close to a given vector	Images similar to this image	L2 distance, cosine distance
Semantic Search	Find documents with similar meaning to a text query	"How do I install PostgreSQL?" finds guides without that exact phrase	Cosine similarity on text embeddings
Keyword Search	Exact keyword matching	"PostgreSQL 16" finds only documents containing that string	TF-IDF, BM25
Hybrid Search	Combines semantic and keyword signals	Balances semantic relevance with exact match for technical terms	RRF, weighted sum

Complete Semantic Search Implementation

import psycopg2
from openai import OpenAI

client = OpenAI()

def semantic_search(
    conn,
    query: str,
    top_k: int = 10,
    source_type: str = None,
    min_similarity: float = 0.6,
    include_metadata: bool = True
) -> list[dict]:
    """
    Semantic search with optional filters and quality threshold.

    Args:
        conn: psycopg2 connection to PostgreSQL
        query: The question or text to search for
        top_k: Number of results to return
        source_type: Filter by document type ('pdf', 'md', 'html')
        min_similarity: Minimum cosine similarity threshold (0.0-1.0)
        include_metadata: Whether to include the JSONB metadata field

    Returns:
        List of dicts with id, source_path, content, similarity, metadata
    """
    # 1. Generate query embedding
    resp = client.embeddings.create(
        input=[query.replace("\n", " ")],
        model="text-embedding-3-small"
    )
    query_vec = resp.data[0].embedding
    max_distance = 1 - min_similarity  # convert similarity -> distance

    # 2. Build dynamic SQL query with optional filters
    params = [query_vec, max_distance]
    filter_clauses = ["embedding <=> %s::vector < %s"]

    if source_type:
        filter_clauses.append("source_type = %s")
        params.append(source_type)

    where = " AND ".join(filter_clauses)
    metadata_col = "metadata" if include_metadata else "NULL::jsonb"

    sql = f"""
        SELECT
            id,
            source_path,
            source_type,
            chunk_index,
            title,
            content,
            1 - (embedding <=> %s::vector) AS similarity,
            {metadata_col} AS metadata
        FROM documents
        WHERE {where}
        ORDER BY embedding <=> %s::vector
        LIMIT %s
    """

    params_final = [query_vec] + params + [query_vec, top_k]

    with conn.cursor() as cur:
        cur.execute(sql, params_final)
        rows = cur.fetchall()

    return [
        {
            "id": r[0],
            "source_path": r[1],
            "source_type": r[2],
            "chunk_index": r[3],
            "title": r[4],
            "content": r[5],
            "similarity": round(r[6], 4),
            "metadata": r[7]
        }
        for r in rows
    ]

# Usage example
conn = psycopg2.connect("postgresql://postgres:pass@localhost/vectordb")

results = semantic_search(
    conn,
    query="How to optimize PostgreSQL query performance for large datasets",
    top_k=5,
    source_type="pdf",
    min_similarity=0.65
)

for r in results:
    print(f"[{r['similarity']:.3f}] {r['title']} - {r['content'][:100]}...")

Hybrid Search: Combining Vector and Full-Text

One of PostgreSQL's great strengths for RAG is the ability to combine semantic search (vector) with classical full-text search (BM25-like) in a single query. This is particularly useful for queries that contain precise technical terms like proper names, codes, version numbers, or product identifiers that might not have strong semantic neighbors.

-- Pure SQL Hybrid Search: vector + full-text with Reciprocal Rank Fusion (RRF)
-- RRF Score = sum(1/(k + rank)) for each result list
-- k=60 is the standard RRF constant from the original paper

WITH vector_results AS (
    -- Semantic search: top 20 by vector similarity
    SELECT
        id, content, source_path,
        ROW_NUMBER() OVER (ORDER BY embedding <=> $1::vector) AS v_rank,
        1 - (embedding <=> $1::vector) AS vector_score
    FROM documents
    ORDER BY embedding <=> $1::vector
    LIMIT 20
),
fts_results AS (
    -- Full-text search: top 20 by ts_rank relevance
    SELECT
        id, content, source_path,
        ROW_NUMBER() OVER (
            ORDER BY ts_rank(to_tsvector('english', content),
                             plainto_tsquery('english', $2)) DESC
        ) AS f_rank,
        ts_rank(to_tsvector('english', content),
                plainto_tsquery('english', $2)) AS fts_score
    FROM documents
    WHERE to_tsvector('english', content) @@ plainto_tsquery('english', $2)
    LIMIT 20
),
rrf_combined AS (
    -- RRF fusion: combines both rankings
    SELECT
        COALESCE(v.id, f.id) AS id,
        COALESCE(v.content, f.content) AS content,
        COALESCE(v.source_path, f.source_path) AS source_path,
        -- Weight: 70% vector + 30% full-text (tune for your use case)
        COALESCE(0.7 / (60.0 + v.v_rank), 0) +
        COALESCE(0.3 / (60.0 + f.f_rank), 0) AS rrf_score,
        v.vector_score,
        f.fts_score
    FROM vector_results v
    FULL OUTER JOIN fts_results f ON v.id = f.id
)
SELECT id, content, source_path, ROUND(rrf_score::numeric, 6) AS score
FROM rrf_combined
ORDER BY rrf_score DESC
LIMIT 5;

Benchmarking: Measuring Search Quality

import time
import psycopg2
from statistics import mean, quantiles

def benchmark_vector_search(conn, query_vectors: list, config: dict) -> dict:
    """
    Run a latency benchmark over a set of vector queries.

    Args:
        conn: PostgreSQL connection
        query_vectors: List of query vectors (1536 dimensions)
        config: Dict with ef_search or probes to set

    Returns:
        Dict with latency metrics (p50, p95, p99, mean, min, max)
    """
    # Set ANN parameters for this benchmark run
    with conn.cursor() as cur:
        if "ef_search" in config:
            cur.execute(f"SET hnsw.ef_search = {config['ef_search']}")
        if "probes" in config:
            cur.execute(f"SET ivfflat.probes = {config['probes']}")

    latencies = []
    for query_vec in query_vectors:
        start = time.perf_counter()
        with conn.cursor() as cur:
            cur.execute("""
                SELECT id, embedding <=> %s::vector AS dist
                FROM documents
                ORDER BY embedding <=> %s::vector
                LIMIT 10
            """, (query_vec, query_vec))
            cur.fetchall()
        elapsed_ms = (time.perf_counter() - start) * 1000
        latencies.append(elapsed_ms)

    p_values = quantiles(latencies, n=100)
    return {
        "config": config,
        "n_queries": len(query_vectors),
        "p50_ms": round(p_values[49], 2),
        "p95_ms": round(p_values[94], 2),
        "p99_ms": round(p_values[98], 2),
        "mean_ms": round(mean(latencies), 2),
        "min_ms": round(min(latencies), 2),
        "max_ms": round(max(latencies), 2)
    }

# Compare different HNSW configurations
configs = [
    {"name": "HNSW ef=20 (fast)", "ef_search": 20},
    {"name": "HNSW ef=40 (default)", "ef_search": 40},
    {"name": "HNSW ef=100 (precise)", "ef_search": 100},
]

# Use real test queries (embeddings generated with the same model as documents)
test_queries = [...]  # list of 1536-dim vectors

print(f"Benchmark with {len(test_queries)} test queries:")
for cfg in configs:
    result = benchmark_vector_search(conn, test_queries[:100], cfg)
    print(f"{cfg['name']}:")
    print(f"  p50={result['p50_ms']}ms, p95={result['p95_ms']}ms, p99={result['p99_ms']}ms")

Recall Measurement: Verifying ANN Quality

def measure_recall_at_k(conn, query_vectors: list, k: int = 10) -> float:
    """
    Measure Recall@K by comparing ANN results with brute force ground truth.

    Recall@K = (correct ANN results) / k

    A Recall@10 of 0.95 means the ANN finds
    9.5 of the 10 exact results on average.

    Note: use a sample of 50-100 queries for statistical stability.
    Run this periodically in production to detect index degradation.
    """
    total_recall = 0.0

    for query_vec in query_vectors:
        # Exact results (brute force) - ground truth
        with conn.cursor() as cur:
            cur.execute("SET enable_indexscan = off")
            cur.execute("""
                SELECT id FROM documents
                ORDER BY embedding <=> %s::vector
                LIMIT %s
            """, (query_vec, k))
            exact_ids = set(r[0] for r in cur.fetchall())
            cur.execute("SET enable_indexscan = on")

        # ANN results (with HNSW/IVFFlat index)
        with conn.cursor() as cur:
            cur.execute("""
                SELECT id FROM documents
                ORDER BY embedding <=> %s::vector
                LIMIT %s
            """, (query_vec, k))
            ann_ids = set(r[0] for r in cur.fetchall())

        # Recall = overlap / k
        overlap = len(exact_ids & ann_ids)
        total_recall += overlap / k

    avg_recall = total_recall / len(query_vectors)
    print(f"Recall@{k}: {avg_recall:.4f} ({avg_recall*100:.1f}%)")
    return avg_recall

# Recall targets by use case:
# Recall@10 > 0.90 for general RAG applications
# Recall@10 > 0.95 for enterprise RAG (medical, legal)
# Recall@10 > 0.85 acceptable for autocomplete/recommendation

# With HNSW m=16, ef_construction=64, ef_search=40: typically 0.92-0.95
# With HNSW m=16, ef_construction=64, ef_search=100: typically 0.97-0.99

Advanced Query Optimization: Planning and Diagnostics

-- 1. Updated statistics: essential for good query plans
ANALYZE documents;

-- 2. Check index size and usage statistics
SELECT
    indexname,
    pg_size_pretty(pg_relation_size(indexrelid)) AS index_size,
    idx_scan AS times_used,
    idx_tup_read AS tuples_read,
    idx_tup_fetch AS tuples_fetched,
    ROUND(idx_tup_fetch::numeric / NULLIF(idx_scan, 0), 1) AS avg_tuples_per_scan
FROM pg_stat_user_indexes
WHERE tablename = 'documents'
ORDER BY idx_scan DESC;

-- 3. Memory parameters for optimal performance
SET maintenance_work_mem = '512MB';  -- for HNSW index BUILD (temporary)
SET work_mem = '64MB';               -- for query sort operations

-- 4. Parallel query for large datasets
SET max_parallel_workers_per_gather = 4;  -- use 4 workers per query

-- 5. Verify shared_buffers is adequate
-- The HNSW index should fit in memory for optimal performance
SHOW shared_buffers;
-- Rule of thumb: shared_buffers = 25% of total RAM
-- pg_relation_size(HNSW index) should be much less than shared_buffers

-- 6. Monitor cache hit ratio for the table and index
SELECT
    relname,
    heap_blks_read AS disk_reads,
    heap_blks_hit AS cache_hits,
    ROUND(heap_blks_hit::numeric / NULLIF(heap_blks_read + heap_blks_hit, 0) * 100, 2)
        AS cache_hit_pct
FROM pg_statio_user_tables
WHERE relname = 'documents';
-- Target: cache_hit_pct > 95% for optimal performance

-- 7. Identify unused indexes (candidates for DROP):
SELECT indexname, idx_scan
FROM pg_stat_user_indexes
WHERE tablename = 'documents'
  AND idx_scan = 0
  AND indexname NOT LIKE '%pkey%';  -- exclude primary keys

-- 8. Index cache hit ratio (how much of the index is served from cache)
SELECT
    indexrelname,
    idx_blks_read AS index_disk_reads,
    idx_blks_hit AS index_cache_hits,
    ROUND(idx_blks_hit::numeric / NULLIF(idx_blks_read + idx_blks_hit, 0) * 100, 2)
        AS index_cache_pct
FROM pg_statio_user_indexes
WHERE indexrelname LIKE '%hnsw%';

Maximal Marginal Relevance (MMR)

A common problem in similarity search is result redundancy: the top-k chunks may all be very similar to each other, covering the same information. This degrades RAG quality because the model receives repeated information in its context window. MMR balances relevance and diversity by progressively selecting the next chunk that is relevant to the query but different from the already selected ones.

import numpy as np
from sklearn.metrics.pairwise import cosine_similarity

def maximal_marginal_relevance(
    query_vec: list[float],
    candidate_vecs: list[list[float]],
    candidate_data: list[dict],
    k: int = 5,
    lambda_param: float = 0.5
) -> list[dict]:
    """
    MMR: select k results balancing relevance and diversity.

    Algorithm:
    1. Always select the most relevant candidate to the query first
    2. For each subsequent candidate, use the MMR score which penalizes
       redundancy with respect to already selected results

    lambda_param:
        1.0 = relevance only (equivalent to standard top-k)
        0.0 = diversity only (maximum variety)
        0.5 = optimal balance for RAG (default)
        0.6 = slightly more relevance than diversity

    Returns:
        List of k diversified yet relevant candidates
    """
    query_array = np.array(query_vec).reshape(1, -1)
    candidate_array = np.array(candidate_vecs)

    # Query-to-candidate similarity (relevance scores)
    query_sims = cosine_similarity(query_array, candidate_array)[0]

    selected = []
    selected_indices = []
    remaining_indices = list(range(len(candidate_vecs)))

    while len(selected) < k and remaining_indices:
        if not selected:
            # First iteration: choose the most relevant
            best_idx = int(np.argmax(query_sims))
        else:
            # Subsequent iterations: MMR score
            selected_array = candidate_array[selected_indices]
            mmr_scores = []
            for i in remaining_indices:
                relevance = query_sims[i]
                # Maximum similarity with already selected (redundancy)
                max_redundancy = float(np.max(
                    cosine_similarity(candidate_array[i:i+1], selected_array)[0]
                ))
                # MMR balances relevance and redundancy
                mmr = lambda_param * relevance - (1 - lambda_param) * max_redundancy
                mmr_scores.append((i, mmr))
            best_idx = max(mmr_scores, key=lambda x: x[1])[0]

        selected.append(candidate_data[best_idx])
        selected_indices.append(best_idx)
        remaining_indices.remove(best_idx)

    return selected

# Usage pattern with PostgreSQL results:
# First retrieve more candidates (top-20), then apply MMR for top-5 diversified
chunks_with_vecs = searcher.vector_search_with_embeddings(query, top_k=20)
embeddings = [c['embedding'] for c in chunks_with_vecs]
data = [{'id': c['id'], 'content': c['content']} for c in chunks_with_vecs]

diverse_chunks = maximal_marginal_relevance(
    query_vec=query_embedding,
    candidate_vecs=embeddings,
    candidate_data=data,
    k=5,
    lambda_param=0.6  # slightly oriented toward relevance
)
# Result: 5 chunks that are relevant AND have diverse content

Common Mistakes in Similarity Search

ANN not being used: If EXPLAIN shows a Seq Scan, check that the LIMIT is reasonable relative to the table size, that the index exists (\d documents), and that statistics are current (ANALYZE).
ef_search too low: With ef_search=10, you might get Recall@10 of only 70-75%. For production, use at least 40; prefer 60-100 for RAG applications.
Post-filtering that empties results: If you apply WHERE filters after ORDER BY (without pre-filtering), the ANN returns global top-k and then filters, potentially returning 0 results. Always use pre-filtering.
Wrong distance operator: Using <-> (L2) instead of <=> (cosine) for text embeddings dramatically reduces search quality. Verify that the operator type in the index matches the operator in the query.
No similarity threshold: Returning all top-k results even when distance is very high leads to irrelevant answers in RAG responses.
Ignoring redundancy: The top-5 chunks might all cover the same paragraph of a document. Use MMR for diverse context in RAG responses.

Conclusions and Next Steps

Similarity search in PostgreSQL has many tuning parameters, but the fundamental trade-off is straightforward: the more candidates you examine, the higher the recall but the higher the latency. For most production RAG systems, HNSW with ef_search=60-100 offers an excellent balance between 10ms and 50ms latency with Recall@10 of 92-97%.

The key takeaway for 2026 is that PostgreSQL with pgvector can handle similarity search at production scale without a separate vector database. The combination of HNSW indexing, hybrid search with RRF, and MMR for diversity gives you a search system competitive with specialized vector databases like Pinecone or Qdrant, while keeping your entire data infrastructure in a single, battle-tested system.

The next article deepens advanced indexing strategies: how to choose optimal HNSW and IVFFlat parameters for your specific use case, how to monitor index degradation over time, and techniques for incremental updates without performance loss. Includes: complete PostgreSQL configuration, parallel index build, and how to schedule REINDEX without downtime.

Advanced Query Patterns for Production RAG

Beyond the fundamentals, production RAG systems often need more sophisticated query patterns. Here are three patterns that address real production challenges:

Pattern 1: Two-Phase Retrieval (Coarse-to-Fine)

-- Two-phase retrieval: fast coarse search followed by precise re-ranking
-- Phase 1: Fast ANN search with low ef_search to get candidates quickly
-- Phase 2: Exact re-ranking of candidates using more expensive computations

-- Phase 1: coarse retrieval (fast, lower recall)
SET hnsw.ef_search = 20;  -- fast but less precise

WITH candidates AS (
    SELECT id, content, source_path, chunk_index, metadata,
           embedding <=> $1::vector AS coarse_distance
    FROM documents
    WHERE source_type = $2  -- pre-filter for relevant document type
    ORDER BY embedding <=> $1::vector
    LIMIT 50  -- retrieve 50 candidates (10x the final top-k)
),
-- Phase 2: exact re-ranking on candidates only
reranked AS (
    SELECT
        c.*,
        -- Combine vector similarity with metadata relevance signals
        c.coarse_distance * 0.8 +
        (CASE
            WHEN c.metadata->>'quality_score' IS NOT NULL
            THEN (1.0 - (c.metadata->>'quality_score')::float) * 0.2
            ELSE 0
        END) AS final_score,
        -- Full-text re-rank signal on small candidate set
        ts_rank(to_tsvector('english', c.content),
                plainto_tsquery('english', $3)) AS fts_boost
    FROM candidates c
)
SELECT id, content, source_path, chunk_index, metadata,
       final_score - (fts_boost * 0.1) AS combined_score
FROM reranked
ORDER BY combined_score ASC
LIMIT 5;

Pattern 2: Time-Decayed Similarity Search

-- Time-decayed search: recent documents receive a boost
-- Useful for news, documentation, or frequently updated knowledge bases

SELECT
    id,
    content,
    source_path,
    created_at,
    embedding <=> $1::vector AS vector_distance,
    -- Exponential decay: documents older than 30 days receive diminishing boost
    -- decay_factor: 1.0 for today, ~0.5 for 30 days ago, ~0.13 for 90 days ago
    EXP(-0.023 * EXTRACT(EPOCH FROM (NOW() - created_at)) / 86400) AS time_decay,
    -- Combined score: lower is better (distance minimized, decay maximized)
    (embedding <=> $1::vector) * (2.0 - EXP(-0.023 * EXTRACT(EPOCH FROM (NOW() - created_at)) / 86400))
        AS time_adjusted_score
FROM documents
WHERE created_at > NOW() - INTERVAL '180 days'  -- limit to last 6 months
ORDER BY time_adjusted_score ASC
LIMIT 10;

-- Parameter explanation:
-- 0.023 = -ln(0.5)/30 = decay constant for half-life of 30 days
-- 86400 = seconds in a day (convert EPOCH to days)
-- The factor 2.0 ensures the score stays positive even for old documents

Pattern 3: Semantic Deduplication Before Indexing

-- Detect near-duplicate chunks before inserting into the knowledge base
-- Prevents redundant context in RAG retrieval

-- Find near-duplicate chunks in the existing knowledge base
-- (cosine distance < 0.05 = extremely similar content)
WITH new_chunk AS (
    SELECT $1::vector AS embedding  -- the new chunk's embedding
)
SELECT
    d.id,
    d.source_path,
    d.content,
    1 - (d.embedding <=> nc.embedding) AS similarity
FROM documents d, new_chunk nc
WHERE 1 - (d.embedding <=> nc.embedding) > 0.95  -- 95%+ similar = near-duplicate
ORDER BY similarity DESC
LIMIT 5;

-- If this returns results, skip insertion (duplicate content)
-- If empty, proceed with insertion

-- Python implementation:
def insert_if_not_duplicate(conn, content: str, embedding: list[float],
                             source_path: str, threshold: float = 0.95) -> bool:
    """
    Insert chunk only if no near-duplicate exists in the database.
    Returns True if inserted, False if skipped as duplicate.
    """
    with conn.cursor() as cur:
        cur.execute("""
            SELECT COUNT(*) FROM documents
            WHERE 1 - (embedding <=> %s::vector) > %s
        """, (embedding, threshold))
        duplicate_count = cur.fetchone()[0]

    if duplicate_count > 0:
        print(f"Skipping duplicate chunk from {source_path}")
        return False

    # Insert the chunk
    with conn.cursor() as cur:
        cur.execute("""
            INSERT INTO documents (content, embedding, source_path)
            VALUES (%s, %s::vector, %s)
        """, (content, embedding, source_path))
        conn.commit()
    return True

Performance Summary: Choosing the Right Configuration

Use Case	Algorithm	ef_search / probes	Expected Latency	Expected Recall@10
Real-time autocomplete	HNSW	ef_search=20	2-5ms	~85%
General RAG (production)	HNSW	ef_search=60	10-20ms	~94%
Medical/Legal RAG	HNSW	ef_search=100	20-40ms	~97%
Batch re-ranking	Brute force	N/A	1-5s/batch	100%
Large static dataset	IVFFlat	probes=30	15-30ms	~92%

Continue the Series

Previous: RAG with PostgreSQL: Complete Pipeline
Next: Indexing for Vector Search: HNSW and IVFFlat
Related: AI Engineering: RAG Pipeline
Related: Data and AI Business: Vector Search Use Cases