Ciao! Sono

Federico Calò

Sviluppatore Software | Divulgatore Tecnico

Creo applicazioni web moderne e strumenti digitali personalizzati per aiutare le attività a crescere attraverso l'innovazione tecnologica. La mia passione è unire informatica ed economia per generare valore reale.

Contattami

Chi Sono

La mia passione per l'informatica è nata tra i banchi dell'Istituto Tecnico Commerciale di Maglie, dove ho scoperto il potere della programmazione e il fascino di creare soluzioni digitali. Fin da subito, ho capito che l'informatica non era solo codice, ma uno strumento straordinario per trasformare idee in realtà.

Durante gli studi superiori in Sistemi Informativi Aziendali, ho iniziato a intrecciare informatica ed economia, comprendendo come la tecnologia possa essere il motore della crescita per qualsiasi attività. Questa visione mi ha accompagnato all'Università degli Studi di Bari, dove ho conseguito la Laurea in Informatica, approfondendo le mie competenze tecniche e la mia passione per lo sviluppo software.

Oggi metto questa esperienza al servizio di imprese, professionisti e startup, creando soluzioni digitali su misura che automatizzano processi, ottimizzano risorse e aprono nuove opportunità di business. Perché la vera innovazione inizia quando la tecnologia incontra le esigenze reali delle persone.

Le Mie Competenze

Analisi Dati & Modelli Previsionali

Trasformo i dati in insights strategici con analisi approfondite e modelli predittivi per decisioni informate

Automazione Processi

Creo strumenti personalizzati che automatizzano operazioni ripetitive e liberano tempo per attività a valore aggiunto

Sistemi Custom

Sviluppo sistemi software su misura, dalle integrazioni tra piattaforme alle dashboard personalizzate

const federico = {
  nome: "Federico Calò",
  ruolo: "Sviluppatore Software",
  città: "Bari, Italia",
  missione: "Aiutare attraverso l'informatica",
  passioni: [
    "Codice Pulito",
    "Innovazione",
    "Crescita Continua"
  ]
};

La Mia Missione

Credo fermamente che l'informatica sia lo strumento più potente per trasformare le idee in realtà e migliorare la vita delle persone.

🚀

Democratizzare la Tecnologia

La mia missione è rendere l'informatica accessibile a tutti: dalle piccole imprese locali alle startup innovative, fino ai professionisti che vogliono digitalizzare la propria attività. Ogni realtà merita di sfruttare le potenzialità del digitale.

💡

Unire Informatica ed Economia

Non è solo questione di scrivere codice: è capire come la tecnologia possa generare valore reale. Intrecciando competenze informatiche e visione economica, aiuto le attività a crescere, ottimizzare processi e raggiungere nuovi traguardi di efficienza e redditività.

🎯

Creare Soluzioni su Misura

Ogni attività è unica, e così devono esserlo le soluzioni. Sviluppo strumenti personalizzati che rispondono alle esigenze specifiche di ciascun cliente, automatizzando processi ripetitivi e liberando tempo per ciò che conta davvero: far crescere il business.

Trasforma la Tua Attività con la Tecnologia

Dicembre 2024

Visualizza

Master SQL

RoadMap.sh

Novembre 2024

Visualizza

Oracle Certified Foundations Associate

Oracle

Ottobre 2024

Visualizza

People Leadership Credential

Connect

Settembre 2024

💻 Linguaggi & Tecnologie

☕Java

🐍Python

📜JavaScript

🅰️Angular

⚛️React

🔷TypeScript

🗄️SQL

🐘PHP

🎨CSS/SCSS

🔧Node.js

🐳Docker

🌿Git

💼

12/2024 - Presente

Custom Software Engineering Analyst

Accenture

Bari, Puglia, Italia · Ibrida Analisi e sviluppo di sistemi informatici attraverso l'utilizzo di Java e Quarkus in Health and Public Sector. Formazione continua su tecnologie moderne per la creazione di soluzioni software personalizzate ed efficienti e sugli agenti.

💼

06/2022 - 12/2024

Analista software e Back End Developer Associate Consultant

Links Management and Technology SpA

Esperienza nell'analisi di sistemi software as-is e flussi ETL utilizzando PowerCenter. Formazione completata su Spring Boot per lo sviluppo di applicazioni backend moderne e scalabili. Sviluppatore Backend specializzato in Spring Boot, con esperienza in progettazione di database, analisi, sviluppo e testing dei task assegnati.

💼

02/2021 - 10/2021

Programmatore software

Adesso.it (prima era WebScience srl)

Esperienza nell'analisi AS-IS e TO-BE, evoluzioni SEO ed evoluzioni website per migliorare le performance e l'engagement degli utenti.

🎓

2018 - 2025

Laurea in Informatica

Università degli Studi di Bari Aldo Moro

Bachelor's degree in Computer Science, focusing on software engineering, algorithms, and modern development practices.

📚

2013 - 2018

Diploma - Sistemi Informativi Aziendali

Istituto Tecnico Commerciale di Maglie

Technical diploma specializing in Business Information Systems, combining IT knowledge with business management.

Contattami

Hai un progetto in mente? Parliamone! Compila il form qui sotto e ti risponderò al più presto.

* Campi obbligatori. I tuoi dati saranno utilizzati solo per rispondere alla tua richiesta.

RAG in Production: Scalable Architecture with PostgreSQL

Building a working RAG prototype on your laptop is relatively straightforward. Taking it to production serving hundreds of concurrent users, with sub-second latency, high availability, and controlled costs, is an entirely different engineering challenge.

In this final article of the series, we build a production RAG architecture on PostgreSQL: connection pooling with PgBouncer, read replicas dedicated to vector search, partitioning for datasets with billions of vectors, application-level caching with Redis, monitoring with Prometheus and Grafana, and Kubernetes deployment strategies for zero-downtime updates.

All while maintaining the "Just Use Postgres" philosophy of 2026: no separate vector database, no Pinecone, no Qdrant. PostgreSQL with pgvector, properly configured, beats specialized competitors on cost/performance in 90% of real-world cases.

Series Overview

#	Article	Focus
1	pgvector	Installation, operators, indexing
2	Embeddings Deep Dive	Models, distances, generation
3	RAG with PostgreSQL	End-to-end RAG pipeline
4	Similarity Search	Algorithms and optimization
5	HNSW and IVFFlat	Advanced indexing strategies
6	You are here - RAG in Production	Scalability and performance

What You Will Learn

Reference architecture for enterprise RAG on PostgreSQL
Connection pooling with PgBouncer: why and how to configure it
Read replicas dedicated to vector search workloads
Partitioning for million/billion-vector datasets
Application-level caching with Redis for repeated queries
Monitoring: essential metrics, Grafana dashboards, alerting
Kubernetes deployment: StatefulSets, PVCs, rolling updates
Multi-tenancy: isolating data across different clients
PostgreSQL configuration tuning for production RAG
Autovacuum and maintenance of vector indexes

Production RAG Reference Architecture

## Production RAG Architecture on PostgreSQL

                     ┌─────────────────────────────────────┐
                     │           Load Balancer             │
                     │        (nginx / AWS ALB)            │
                     └──────────────┬──────────────────────┘
                                    │
                     ┌──────────────▼──────────────────────┐
                     │         RAG API Service              │
                     │    (FastAPI, K8s Deployment)         │
                     │    Replicas: 3-10 pods               │
                     └──┬─────────────────────┬────────────┘
                        │                     │
               ┌────────▼────────┐   ┌────────▼────────┐
               │   Redis Cache   │   │  Embedding API  │
               │  (query cache,  │   │ (OpenAI / local │
               │   emb cache)    │   │  inference)     │
               └─────────────────┘   └─────────────────┘
                        │
          ┌─────────────▼───────────────────────────┐
          │             PgBouncer                   │
          │         Connection Pool                  │
          │    (transaction mode, pool_size=100)    │
          └──────────────┬──────────────────────────┘
                         │
          ┌──────────────▼──────────────────────────┐
          │         PostgreSQL Primary               │
          │   (write: ingestion, updates)            │
          │   shared_buffers=32GB, m=16              │
          └──────────────┬──────────────────────────┘
                         │ Streaming Replication
          ┌──────────────▼──────────────────────────┐
          │      PostgreSQL Read Replica              │
          │   (read: vector search, RAG queries)     │
          │   Dedicated for ANN workloads            │
          │   ef_search tuned per use case           │
          └─────────────────────────────────────────┘

Connection Pooling with PgBouncer

PostgreSQL creates a process (fork) for each connection. With 200 concurrent users running vector queries, each connection uses ~5-10MB of RAM just for the process. PgBouncer maintains a pool of real database connections and queues application requests, reducing PostgreSQL connections from hundreds down to a few dozen.

Installation and Configuration

# pgbouncer.ini - Configuration for RAG workload

[databases]
# Primary database for writes (ingestion)
ragdb_write = host=postgres-primary port=5432 dbname=ragdb user=raguser

# Read replica for vector search
ragdb_read = host=postgres-replica port=5432 dbname=ragdb user=raguser

[pgbouncer]
# Pool mode: transaction is most efficient for short queries
# - session: connection reserved for entire session
# - transaction: connection released after each transaction (OPTIMAL for RAG)
# - statement: release after each statement (incompatible with multi-statement txns)
pool_mode = transaction

# Maximum real connections to PostgreSQL per database
max_db_connections = 50

# Connections per pool (per user+database combination)
default_pool_size = 25

# Total client connections toward PgBouncer
max_client_conn = 1000

# Timeout and resilience
query_wait_timeout = 30      # seconds before "query too long" error
server_idle_timeout = 600    # close idle connections after 10 minutes
client_idle_timeout = 0      # no client timeout (managed by the app)

# Authentication
auth_type = md5
auth_file = /etc/pgbouncer/userlist.txt

# Monitoring
stats_period = 60
logfile = /var/log/pgbouncer/pgbouncer.log
pidfile = /var/run/pgbouncer/pgbouncer.pid

listen_addr = 0.0.0.0
listen_port = 5432

Usage in Python with Connection Pool

import psycopg2.pool
from contextlib import contextmanager
import threading

class DatabasePool:
    """
    Thread-safe connection pool toward PgBouncer.
    Separate connections for write (primary) and read (replica).
    """
    def __init__(self, write_dsn: str, read_dsn: str,
                 min_conn: int = 2, max_conn: int = 20):
        self._write_pool = psycopg2.pool.ThreadedConnectionPool(
            minconn=min_conn, maxconn=max_conn, dsn=write_dsn
        )
        self._read_pool = psycopg2.pool.ThreadedConnectionPool(
            minconn=min_conn, maxconn=max_conn, dsn=read_dsn
        )
        self._lock = threading.Lock()

    @contextmanager
    def get_write_conn(self):
        """Connection for write operations (primary)."""
        conn = self._write_pool.getconn()
        try:
            yield conn
            conn.commit()
        except Exception:
            conn.rollback()
            raise
        finally:
            self._write_pool.putconn(conn)

    @contextmanager
    def get_read_conn(self):
        """Connection for read operations (read replica)."""
        conn = self._read_pool.getconn()
        try:
            # Set optimal ef_search for vector queries
            with conn.cursor() as cur:
                cur.execute("SET hnsw.ef_search = 60")
            yield conn
        finally:
            conn.rollback()  # rollback to release locks
            self._read_pool.putconn(conn)

# Initialize the pool once at application startup
db_pool = DatabasePool(
    write_dsn="host=pgbouncer port=5432 dbname=ragdb_write user=raguser password=xxx",
    read_dsn="host=pgbouncer port=5432 dbname=ragdb_read user=raguser password=xxx"
)

# Usage in API endpoints:
async def search_endpoint(query: str):
    with db_pool.get_read_conn() as conn:
        results = semantic_search(conn, query)
    return results

async def ingest_endpoint(document_path: str):
    with db_pool.get_write_conn() as conn:
        ingest_document(conn, document_path)

Monitoring PgBouncer

-- Connect to PgBouncer admin port (default: 6432)
-- psql -h localhost -p 6432 -U pgbouncer pgbouncer

-- Real-time pool statistics
SHOW POOLS;
-- Output:
-- database   | user    | cl_active | cl_waiting | sv_active | sv_idle | maxwait
-- ragdb_read | raguser |        23 |          0 |        23 |       2 |       0
-- ragdb_write| raguser |         5 |          0 |         5 |       0 |       0

-- Aggregate statistics per database
SHOW STATS;
-- total_query_count | total_query_time | avg_query | avg_wait
-- 1248532           | 1832743.2        |      1.46 |     0.02

-- List of current client connections
SHOW CLIENTS;

-- List of current server connections to PostgreSQL
SHOW SERVERS;

-- Alert: if cl_waiting > 0 for more than 5 seconds, the pool is saturated
-- Solution: increase default_pool_size or max_db_connections

Read Replicas for Vector Search

Vector queries (ANN search) are CPU and memory intensive but do not require the freshest data. Isolating them on a dedicated read replica frees the primary for ingestion operations (INSERT) and keeps latency low for both workloads.

Setting Up Read Replicas with Streaming Replication

# On the PRIMARY server
# postgresql.conf
wal_level = replica               # enable WAL for replication
max_wal_senders = 5               # max 5 replicas connected
wal_keep_size = '1GB'             # keep at least 1GB of WAL for replicas

# pg_hba.conf - allow replication connection
host replication repuser replica-host/32 md5

# Create replication user
CREATE USER repuser REPLICATION LOGIN PASSWORD 'strongpassword';

# ===========================
# On the REPLICA server
# ===========================
# Clone the primary:
pg_basebackup -h primary-host -U repuser -D /var/lib/postgresql/16/main -P -Xs -R

# The -R flag automatically creates standby.signal and recovery parameters
# postgresql.conf on the replica - optimized for vector search:
hot_standby = on                    # allows reads on the replica
shared_buffers = '32GB'             # plenty of memory for the HNSW index
effective_cache_size = '96GB'       # hint to the planner about total memory
max_parallel_workers_per_gather = 4 # parallel vector queries
hnsw.ef_search = 60                 # global parameter for the replica
random_page_cost = 1.1              # SSD: low cost for random I/O
effective_io_concurrency = 200      # for NVMe drives

Monitoring the Replica

-- On primary: status of connected replicas
SELECT
    client_addr AS replica_address,
    application_name AS replica_name,
    state AS state,
    sent_lsn AS wal_sent,
    write_lsn AS wal_written,
    flush_lsn AS wal_flushed,
    replay_lsn AS wal_replayed,
    -- Lag in bytes:
    sent_lsn - replay_lsn AS lag_bytes,
    -- Lag in time (approximate):
    now() - write_lag AS write_lag,
    now() - replay_lag AS replay_lag
FROM pg_stat_replication;

-- On the replica: verify it is replicating
SELECT now() - pg_last_xact_replay_timestamp() AS replica_lag_seconds;

-- Alert: if replica lag > 30 seconds, possible replication issue
-- Check pg_stat_wal_receiver on the replica:
SELECT status, received_lsn, last_msg_receipt_time, latest_end_lsn
FROM pg_stat_wal_receiver;

Table Partitioning for Large-Scale Datasets

With tens of millions of vectors, a single table becomes difficult to manage. Partitioning divides the table into physically separate parts, enabling faster queries (partition pruning) and fine-grained maintenance at the partition level.

Time-Based Partitioning (Time-Series RAG)

-- Monthly partitioning: ideal for timestamped documents (news, emails, logs)

CREATE TABLE documents_partitioned (
    id              BIGSERIAL,
    source_path     TEXT NOT NULL,
    source_type     TEXT,
    content         TEXT NOT NULL,
    embedding       vector(1536),
    metadata        JSONB DEFAULT '{}',
    created_at      TIMESTAMPTZ DEFAULT NOW(),
    PRIMARY KEY (id, created_at)  -- partition column must be in PK
) PARTITION BY RANGE (created_at);

-- Create partitions for the last 12 months
DO $
DECLARE
    d DATE;
BEGIN
    FOR i IN 0..11 LOOP
        d := DATE_TRUNC('month', NOW()) - (i || ' months')::INTERVAL;
        EXECUTE format(
            'CREATE TABLE documents_%s PARTITION OF documents_partitioned
             FOR VALUES FROM (%L) TO (%L)',
            TO_CHAR(d, 'YYYY_MM'),
            d,
            d + INTERVAL '1 month'
        );
        -- HNSW index on each partition
        EXECUTE format(
            'CREATE INDEX ON documents_%s USING hnsw (embedding vector_cosine_ops) WITH (m=16, ef_construction=64)',
            TO_CHAR(d, 'YYYY_MM')
        );
        RAISE NOTICE 'Partition created: documents_%', TO_CHAR(d, 'YYYY_MM');
    END LOOP;
END $;

-- Automatic partition pruning (PostgreSQL uses only relevant partitions):
EXPLAIN (ANALYZE)
SELECT id, content, embedding <=> query_vec AS dist
FROM documents_partitioned
WHERE created_at >= '2026-01-01' AND created_at < '2026-02-01'  -- only Jan 2026 partition
ORDER BY embedding <=> query_vec
LIMIT 10;
-- Output: scans only documents_2026_01, ignores the other 11 partitions

Multi-Tenant Partitioning

-- Partition by tenant_id: isolate each client's data
CREATE TABLE documents_multitenant (
    id          BIGSERIAL,
    tenant_id   TEXT NOT NULL,
    content     TEXT NOT NULL,
    embedding   vector(1536),
    metadata    JSONB DEFAULT '{}',
    PRIMARY KEY (id, tenant_id)
) PARTITION BY LIST (tenant_id);

-- Create partition for each tenant
CREATE TABLE docs_tenant_acme    PARTITION OF documents_multitenant FOR VALUES IN ('acme');
CREATE TABLE docs_tenant_globex  PARTITION OF documents_multitenant FOR VALUES IN ('globex');
CREATE TABLE docs_tenant_initech PARTITION OF documents_multitenant FOR VALUES IN ('initech');

-- Default partition for new tenants
CREATE TABLE docs_tenant_default PARTITION OF documents_multitenant DEFAULT;

-- HNSW index for each tenant
CREATE INDEX ON docs_tenant_acme    USING hnsw (embedding vector_cosine_ops) WITH (m=16, ef_construction=64);
CREATE INDEX ON docs_tenant_globex  USING hnsw (embedding vector_cosine_ops) WITH (m=16, ef_construction=64);
CREATE INDEX ON docs_tenant_initech USING hnsw (embedding vector_cosine_ops) WITH (m=16, ef_construction=64);

-- Query with automatic tenant isolation (partition pruning):
SELECT id, content, embedding <=> query_vec AS dist
FROM documents_multitenant
WHERE tenant_id = 'acme'  -- searches ONLY in acme's partition
ORDER BY embedding <=> query_vec
LIMIT 5;

-- Row Level Security (RLS) as a second security layer:
ALTER TABLE documents_multitenant ENABLE ROW LEVEL SECURITY;
CREATE POLICY tenant_isolation ON documents_multitenant
    USING (tenant_id = current_setting('app.current_tenant'));

-- In Python, set the tenant context:
with conn.cursor() as cur:
    cur.execute("SET app.current_tenant = %s", (tenant_id,))
    # All subsequent queries only see the current tenant's documents

Caching with Redis

Many RAG queries repeat: same question, same context. Implementing a two-level cache - embedding cache (avoids recomputing the embedding for the same query) and result cache (avoids re-running the vector search for identical queries) - can reduce latency by 90% and OpenAI costs by 70%.

import redis
import json
import hashlib
from typing import Optional

class RAGCache:
    """
    Two-level cache:
    1. Embedding cache: for query vectors
    2. Result cache: for complete RAG results
    """
    def __init__(self, redis_url: str = "redis://localhost:6379"):
        self.r = redis.from_url(redis_url, decode_responses=True)
        # TTL for different cache types
        self.embedding_ttl = 3600 * 24 * 7    # 7 days (embeddings are stable)
        self.result_ttl = 3600                 # 1 hour (results may become stale)

    def _cache_key(self, prefix: str, *args) -> str:
        """Generate a deterministic cache key."""
        content = json.dumps(args, sort_keys=True)
        hash_val = hashlib.sha256(content.encode()).hexdigest()[:16]
        return f"rag:{prefix}:{hash_val}"

    # === EMBEDDING CACHE ===
    def get_embedding(self, text: str) -> Optional[list[float]]:
        key = self._cache_key("emb", text)
        cached = self.r.get(key)
        if cached:
            return json.loads(cached)
        return None

    def set_embedding(self, text: str, embedding: list[float]) -> None:
        key = self._cache_key("emb", text)
        self.r.setex(key, self.embedding_ttl, json.dumps(embedding))

    # === RESULT CACHE ===
    def get_rag_result(self, query: str, filters: dict = None) -> Optional[dict]:
        key = self._cache_key("result", query, filters or {})
        cached = self.r.get(key)
        if cached:
            result = json.loads(cached)
            result["from_cache"] = True
            return result
        return None

    def set_rag_result(self, query: str, result: dict, filters: dict = None) -> None:
        key = self._cache_key("result", query, filters or {})
        # Do not cache responses with empty sources or errors
        if result.get("sources") and result.get("answer"):
            self.r.setex(key, self.result_ttl, json.dumps(result))

    # === CACHE STATISTICS ===
    def get_stats(self) -> dict:
        info = self.r.info("stats")
        total = info.get("keyspace_hits", 0) + info.get("keyspace_misses", 0)
        hit_rate = info["keyspace_hits"] / total if total > 0 else 0
        return {
            "hit_rate": round(hit_rate, 3),
            "total_queries": total,
            "memory_used": self.r.info("memory")["used_memory_human"],
            "rag_keys": self.r.dbsize()
        }

# Integration in the RAG System
cache = RAGCache("redis://redis:6379")

def cached_embed(text: str, client) -> list[float]:
    """Generate or retrieve from cache the embedding for a text."""
    cached = cache.get_embedding(text)
    if cached:
        return cached
    embedding = client.embeddings.create(
        input=[text], model="text-embedding-3-small"
    ).data[0].embedding
    cache.set_embedding(text, embedding)
    return embedding

def cached_rag_query(rag_system, question: str, filters: dict = None) -> dict:
    """Run a RAG query with result caching."""
    # Check result cache first
    cached = cache.get_rag_result(question, filters)
    if cached:
        return cached

    # Run the full query
    result = rag_system.ask(question, filters=filters)

    # Save to cache
    cache.set_rag_result(question, result, filters)
    return result

FastAPI: Serving RAG Asynchronously

from fastapi import FastAPI, HTTPException, BackgroundTasks
from pydantic import BaseModel
import asyncio
from concurrent.futures import ThreadPoolExecutor

app = FastAPI(title="RAG API", version="1.0.0")
executor = ThreadPoolExecutor(max_workers=10)

class QueryRequest(BaseModel):
    question: str
    top_k: int = 5
    source_type: str = None
    use_hybrid: bool = True
    tenant_id: str = None

class QueryResponse(BaseModel):
    answer: str
    sources: list[dict]
    from_cache: bool = False
    retrieval_ms: float
    generation_ms: float

@app.post("/api/query", response_model=QueryResponse)
async def query_rag(request: QueryRequest):
    """
    Query endpoint with async processing and caching.
    """
    import time

    # 1. Check cache
    cached = cache.get_rag_result(request.question, {"source_type": request.source_type})
    if cached:
        return QueryResponse(
            answer=cached["answer"],
            sources=cached["sources"],
            from_cache=True,
            retrieval_ms=0,
            generation_ms=0
        )

    # 2. Query embedding (with cache)
    t0 = time.time()
    # Run in thread to avoid blocking the event loop
    query_vec = await asyncio.get_event_loop().run_in_executor(
        executor,
        lambda: cached_embed(request.question, openai_client)
    )

    # 3. Vector search on PostgreSQL read replica
    with db_pool.get_read_conn() as conn:
        if request.tenant_id:
            # Set tenant context for RLS
            with conn.cursor() as cur:
                cur.execute("SET app.current_tenant = %s", (request.tenant_id,))
        chunks = hybrid_search(conn, query_vec, request.top_k)
    retrieval_ms = (time.time() - t0) * 1000

    # 4. Response generation
    t1 = time.time()
    response = await asyncio.get_event_loop().run_in_executor(
        executor,
        lambda: rag_generator.generate(request.question, chunks)
    )
    generation_ms = (time.time() - t1) * 1000

    result = {
        "answer": response.answer,
        "sources": response.sources
    }
    cache.set_rag_result(request.question, result, {"source_type": request.source_type})

    return QueryResponse(
        answer=response.answer,
        sources=response.sources,
        from_cache=False,
        retrieval_ms=round(retrieval_ms, 1),
        generation_ms=round(generation_ms, 1)
    )

@app.get("/api/health")
async def health():
    return {"status": "ok", "cache": cache.get_stats()}

Monitoring: Essential Metrics

Prometheus Metrics in Python

from prometheus_client import Counter, Histogram, Gauge, start_http_server

# Application metrics
rag_queries_total = Counter(
    "rag_queries_total",
    "Total number of RAG queries",
    ["status", "from_cache"]
)

rag_query_duration_seconds = Histogram(
    "rag_query_duration_seconds",
    "RAG query duration in seconds",
    ["phase"],  # "retrieval", "generation", "total"
    buckets=[0.05, 0.1, 0.25, 0.5, 1.0, 2.5, 5.0]
)

rag_retrieval_chunks_count = Histogram(
    "rag_retrieval_chunks_count",
    "Number of chunks retrieved per query",
    buckets=[1, 2, 3, 5, 10, 20]
)

rag_retrieval_top_similarity = Gauge(
    "rag_retrieval_top_similarity",
    "Top similarity score of last retrieval"
)

# PostgreSQL metrics (via pg_stat_activity)
db_active_connections = Gauge(
    "db_active_connections",
    "Active database connections",
    ["pool"]  # "write", "read"
)

def track_rag_query(func):
    """Decorator to track RAG query metrics."""
    def wrapper(*args, **kwargs):
        import time
        start = time.time()
        try:
            result = func(*args, **kwargs)
            rag_queries_total.labels(
                status="success",
                from_cache=str(result.get("from_cache", False))
            ).inc()
            return result
        except Exception as e:
            rag_queries_total.labels(status="error", from_cache="false").inc()
            raise
        finally:
            rag_query_duration_seconds.labels("total").observe(time.time() - start)
    return wrapper

# Start Prometheus server (port 9090)
start_http_server(9090)

Grafana Dashboard: SQL Queries for PostgreSQL Metrics

-- Grafana dashboard queries: PostgreSQL vector metrics

-- 1. Query latency by percentile (last 5 minutes)
SELECT
    percentile_cont(0.5)  WITHIN GROUP (ORDER BY total_time) AS p50_ms,
    percentile_cont(0.95) WITHIN GROUP (ORDER BY total_time) AS p95_ms,
    percentile_cont(0.99) WITHIN GROUP (ORDER BY total_time) AS p99_ms,
    COUNT(*) AS n_queries
FROM pg_stat_statements
WHERE query LIKE '%embedding%<=%>%'  -- queries using vector search
  AND calls > 0;

-- 2. Cache hit ratio for vector indexes
SELECT
    indexrelname AS index_name,
    idx_blks_read AS disk_reads,
    idx_blks_hit  AS cache_hits,
    ROUND(idx_blks_hit::numeric / NULLIF(idx_blks_read + idx_blks_hit, 0) * 100, 2)
        AS cache_hit_pct
FROM pg_statio_user_indexes
WHERE indexrelname LIKE '%hnsw%' OR indexrelname LIKE '%ivfflat%';

-- 3. Active connections by state
SELECT
    state,
    COUNT(*) AS connections,
    MAX(now() - state_change) AS max_wait_time
FROM pg_stat_activity
WHERE datname = 'ragdb'
GROUP BY state;

-- 4. Tuples inserted/read per second (load indicator)
SELECT
    relname,
    n_tup_ins + n_tup_upd + n_tup_del AS writes_per_snapshot,
    seq_tup_read + idx_tup_fetch AS reads_per_snapshot
FROM pg_stat_user_tables
WHERE relname LIKE '%document%';

PostgreSQL Configuration for Production RAG

PostgreSQL's default configuration is not optimized for vector workloads. Here are the critical parameters to modify in postgresql.conf for a production RAG system with 32GB+ of RAM:

# postgresql.conf - Production configuration for RAG on PostgreSQL 16
# System: 64GB RAM, 16 CPU cores, NVMe SSD

# === MEMORY ===
shared_buffers = '16GB'            # 25% of total RAM
effective_cache_size = '48GB'      # 75% of RAM (planner hint, not allocated memory)
work_mem = '64MB'                  # for sort/hash in queries (per-connection!)
maintenance_work_mem = '2GB'       # for CREATE INDEX HNSW, VACUUM, pg_dump
max_worker_processes = 16
max_parallel_workers = 8
max_parallel_workers_per_gather = 4

# === STORAGE / I/O ===
random_page_cost = 1.1             # SSD: reduce from 4.0 to favor index scans
effective_io_concurrency = 200     # NVMe: can handle many parallel requests
checkpoint_completion_target = 0.9 # spread checkpoints over time
wal_buffers = '64MB'               # default -1 (auto) is usually sufficient

# === AUTOVACUUM (CRITICAL for vector tables with many INSERT/DELETE) ===
autovacuum_vacuum_cost_delay = '2ms'         # more aggressive than default 20ms
autovacuum_vacuum_scale_factor = 0.05        # trigger at 5% dead tuples (default 20%)
autovacuum_analyze_scale_factor = 0.02       # trigger analyze at 2% new tuples
autovacuum_vacuum_cost_limit = 400           # more work per autovacuum worker

# === LOGGING ===
log_min_duration_statement = 1000  # log queries > 1 second
log_lock_waits = on
track_io_timing = on               # needed for accurate EXPLAIN (BUFFERS)

# === EXTENSIONS ===
shared_preload_libraries = 'pg_stat_statements,vector'

# === pgvector SPECIFIC ===
# hnsw.ef_search: set at session level in application code
# Do not set globally: each use case has its own optimal value

Autovacuum and Vector Index Maintenance

Tables with embeddings undergo frequent UPDATE and DELETE operations during the RAG pipeline (document updates, re-chunking, deletion of obsolete documents). Without properly configured autovacuum, table bloat degrades vector search performance significantly.

-- Check table bloat for the documents table
SELECT
    schemaname,
    tablename,
    n_live_tup AS live_rows,
    n_dead_tup AS dead_rows,
    ROUND(n_dead_tup::numeric / NULLIF(n_live_tup + n_dead_tup, 0) * 100, 2) AS pct_dead,
    last_vacuum,
    last_autovacuum,
    last_analyze
FROM pg_stat_user_tables
WHERE tablename LIKE '%document%';
-- Alert: pct_dead > 5% means a manual VACUUM is needed

-- Manual VACUUM to reclaim space (does not block reads):
VACUUM (VERBOSE, ANALYZE) documents;

-- VACUUM FULL to reclaim physical disk space (BLOCKS writes - run during maintenance window):
VACUUM FULL documents;

-- Set aggressive autovacuum for the documents table
ALTER TABLE documents SET (
    autovacuum_vacuum_scale_factor = 0.02,    -- 2% dead rows = trigger vacuum
    autovacuum_analyze_scale_factor = 0.01,   -- 1% new rows = trigger analyze
    autovacuum_vacuum_cost_delay = '2ms'
);

-- Check HNSW index size after many INSERT/DELETE operations
-- HNSW index does not degrade with inserts, but space is not reclaimed after DELETE
SELECT
    indexname,
    pg_size_pretty(pg_relation_size(indexrelid)) AS index_size,
    pg_size_pretty(pg_total_relation_size(indexrelid)) AS total_size
FROM pg_stat_user_indexes
WHERE tablename = 'documents'
ORDER BY pg_relation_size(indexrelid) DESC;

-- Rebuild index without downtime (to reclaim space after many DELETEs):
REINDEX INDEX CONCURRENTLY documents_hnsw_idx;

Kubernetes Deployment

StatefulSet for PostgreSQL with PVC

## kubernetes/postgres-statefulset.yaml
apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: postgres-vector
  namespace: rag-production
spec:
  serviceName: postgres-vector
  replicas: 2  # primary + 1 read replica
  selector:
    matchLabels:
      app: postgres-vector
  template:
    metadata:
      labels:
        app: postgres-vector
    spec:
      containers:
      - name: postgres
        image: pgvector/pgvector:pg16
        env:
        - name: POSTGRES_DB
          value: ragdb
        - name: POSTGRES_USER
          valueFrom:
            secretKeyRef:
              name: postgres-secret
              key: username
        - name: POSTGRES_PASSWORD
          valueFrom:
            secretKeyRef:
              name: postgres-secret
              key: password
        - name: POSTGRES_SHARED_BUFFERS
          value: "8GB"
        - name: POSTGRES_MAINTENANCE_WORK_MEM
          value: "2GB"
        resources:
          requests:
            memory: "32Gi"
            cpu: "8"
          limits:
            memory: "64Gi"
            cpu: "16"
        volumeMounts:
        - name: postgres-data
          mountPath: /var/lib/postgresql/data
        - name: postgres-config
          mountPath: /etc/postgresql/custom.conf
          subPath: custom.conf
  volumeClaimTemplates:
  - metadata:
      name: postgres-data
    spec:
      accessModes: ["ReadWriteOnce"]
      storageClassName: fast-ssd  # NVMe SSD for optimal performance
      resources:
        requests:
          storage: 500Gi  # 500GB for large HNSW indexes

RAG API Deployment with Rolling Updates

## kubernetes/rag-api-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: rag-api
  namespace: rag-production
spec:
  replicas: 5  # 5 parallel pods
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxSurge: 2        # create 2 new pods before removing old ones
      maxUnavailable: 0  # zero downtime update
  selector:
    matchLabels:
      app: rag-api
  template:
    spec:
      containers:
      - name: rag-api
        image: your-registry/rag-api:latest
        ports:
        - containerPort: 8000
        env:
        - name: DB_WRITE_URL
          valueFrom:
            secretKeyRef:
              name: db-secret
              key: write-url
        - name: DB_READ_URL
          valueFrom:
            secretKeyRef:
              name: db-secret
              key: read-url
        - name: REDIS_URL
          value: "redis://redis-service:6379"
        - name: OPENAI_API_KEY
          valueFrom:
            secretKeyRef:
              name: openai-secret
              key: api-key
        resources:
          requests:
            memory: "2Gi"
            cpu: "1"
          limits:
            memory: "4Gi"
            cpu: "2"
        livenessProbe:
          httpGet:
            path: /api/health
            port: 8000
          initialDelaySeconds: 30
          periodSeconds: 10
        readinessProbe:
          httpGet:
            path: /api/health
            port: 8000
          initialDelaySeconds: 10
          periodSeconds: 5

Cost Comparison: PostgreSQL vs Specialized Vector DBs

Solution	1M vectors/month	10M queries/month	P95 Latency	Notes
PostgreSQL + pgvector (self-hosted)	$200-400 (EC2/GKE)	Included	20-80ms	Fixed cost, full control
Pinecone (managed)	$70-700	$0.04-0.08/1000 queries	10-40ms	Variable cost, less control
Supabase Vector	$25-200	Included	30-100ms	Managed PostgreSQL + pgvector
Qdrant (self-hosted)	$150-300	Included	5-20ms	Faster but additional infrastructure
pgvector (2025 benchmarks)	up to 28x faster than Pinecone	at 16x lower cost	direct comparison	benchmark.vector.dev

When NOT to Use PostgreSQL for Vector Search

Billions of vectors: Beyond 500M-1B vectors, Qdrant or Weaviate become more memory-efficient through native vector quantization.
Sub-millisecond latency: If you need <5ms P99, an in-memory specialized vector database (Qdrant cloud, Pinecone serverless) is more appropriate.
No DBA expertise: If your team lacks PostgreSQL skills, a managed service like Pinecone reduces operational complexity significantly.
Multi-modal search: Images + text + audio with different embedding dimensions require architectures PostgreSQL does not natively handle well.

Backup and Disaster Recovery for RAG Knowledge Base

A RAG knowledge base contains both the original documents and their vector embeddings. Backup strategy must cover both, but the optimal approach depends on dataset size and update frequency:

# Backup strategy for RAG knowledge base on PostgreSQL

# 1. Full logical backup (pg_dump):
# Suitable for databases up to ~50GB. Includes all data, indexes, schema.
pg_dump -h postgres-primary \
  -U raguser \
  -d ragdb \
  --format=custom \      # compressed binary format
  --compress=9 \         # maximum compression
  --no-privileges \      # exclude user privileges
  --file=ragdb_backup_$(date +%Y%m%d_%H%M%S).dump

# Restore:
pg_restore -h postgres-new \
  -U raguser \
  -d ragdb \
  --jobs=8 \             # parallel restore (uses 8 CPU cores)
  ragdb_backup.dump

# 2. Physical backup (pg_basebackup) for large databases:
# Faster but requires more space. Uses WAL for point-in-time recovery.
pg_basebackup -h postgres-primary \
  -U repuser \
  -D /backup/ragdb_base_$(date +%Y%m%d) \
  --format=tar \
  --compress=9 \
  --wal-method=stream \  # include WAL during backup
  --checkpoint=fast

# 3. Embeddings-only export (if source documents are stored elsewhere):
psql -h postgres-primary -U raguser ragdb -c \
  "COPY (SELECT id, source_path, chunk_index, content, embedding::text, metadata, ingested_at
         FROM rag_documents) TO STDOUT CSV HEADER" \
  | gzip > rag_embeddings_$(date +%Y%m%d).csv.gz

# Size estimates:
# 1M vectors at 1536 dim in float32 = 6GB for embeddings alone
# + text content: ~0.5GB for 1M chunks at 800 chars each
# Total expected: ~7GB for 1M chunked documents

Alerting: When Things Go Wrong

-- Alert queries for Prometheus/Grafana alerting
-- Run every 60 seconds as a Prometheus exporter

-- 1. Alert: vector query p99 latency > 500ms (last 5 minutes)
SELECT
    CASE
        WHEN percentile_cont(0.99) WITHIN GROUP (ORDER BY total_time) > 500
        THEN 'ALERT'
        ELSE 'OK'
    END AS status,
    percentile_cont(0.99) WITHIN GROUP (ORDER BY total_time) AS p99_ms
FROM pg_stat_statements
WHERE query LIKE '%embedding%<=%>%'
  AND calls > 0;

-- 2. Alert: replica lag > 30 seconds
SELECT
    CASE
        WHEN EXTRACT(EPOCH FROM (now() - pg_last_xact_replay_timestamp())) > 30
        THEN 'ALERT'
        ELSE 'OK'
    END AS replica_status,
    EXTRACT(EPOCH FROM (now() - pg_last_xact_replay_timestamp())) AS lag_seconds;

-- 3. Alert: too many dead tuples (bloat > 10%)
SELECT
    tablename,
    CASE
        WHEN n_dead_tup::float / NULLIF(n_live_tup + n_dead_tup, 0) > 0.1
        THEN 'ALERT'
        ELSE 'OK'
    END AS bloat_status,
    ROUND(n_dead_tup::numeric / NULLIF(n_live_tup + n_dead_tup, 0) * 100, 2) AS dead_pct
FROM pg_stat_user_tables
WHERE tablename LIKE '%document%';

-- 4. Alert: vector index cache hit ratio < 90%
SELECT
    indexrelname,
    CASE
        WHEN idx_blks_hit::float / NULLIF(idx_blks_read + idx_blks_hit, 0) < 0.9
        THEN 'ALERT'
        ELSE 'OK'
    END AS cache_status,
    ROUND(idx_blks_hit::numeric / NULLIF(idx_blks_read + idx_blks_hit, 0) * 100, 2) AS hit_pct
FROM pg_statio_user_indexes
WHERE indexrelname LIKE '%hnsw%';

-- 5. Alert: connections approaching the limit (>85% of max_connections)
SELECT
    CASE
        WHEN COUNT(*) > (SELECT setting::int FROM pg_settings WHERE name = 'max_connections') * 0.85
        THEN 'ALERT'
        ELSE 'OK'
    END AS conn_status,
    COUNT(*) AS active_connections
FROM pg_stat_activity;

Production Readiness Checklist

-- PRE-DEPLOY CHECKLIST for RAG on PostgreSQL

-- 1. Verify installed extensions
SELECT extname, extversion FROM pg_extension WHERE extname IN ('vector', 'pg_stat_statements');

-- 2. Verify vector indexes and their sizes
SELECT
    tablename, indexname,
    pg_size_pretty(pg_relation_size(indexrelid)) AS size
FROM pg_stat_user_indexes
WHERE indexname LIKE '%hnsw%' OR indexname LIKE '%ivfflat%';

-- 3. Verify memory configuration (for prod with 32GB RAM)
SHOW shared_buffers;          -- should be >= 8GB
SHOW work_mem;                -- should be >= 16MB
SHOW maintenance_work_mem;    -- should be >= 512MB

-- 4. Verify autovacuum is enabled on documents table
SELECT autovacuum_enabled, autovacuum_vacuum_scale_factor
FROM pg_settings
WHERE name = 'autovacuum';

-- 5. Test PgBouncer connection
-- psql -h pgbouncer-host -p 5432 -U raguser ragdb_read -c "SELECT 1"

-- 6. Test Redis cache
-- redis-cli -h redis-host PING  -- should respond PONG

-- 7. Performance benchmark before go-live
-- target: p95 < 100ms for vector search, p99 < 500ms

-- 8. Verify replica lag
SELECT now() - pg_last_xact_replay_timestamp() AS lag;
-- target: lag < 5 seconds under normal operations

Architecture Comparison: PostgreSQL vs Dedicated Vector DBs

Aspect	PostgreSQL + pgvector	Pinecone / Qdrant / Weaviate
Infrastructure complexity	Low: single system	High: additional system to manage
Cost for 10M queries/month	$200-400/month (fixed server)	$400-800/month (variable per query)
P95 latency (1M vectors)	20-80ms (optimized HNSW)	5-40ms (often in-memory)
Maximum scalability	~500M-1B vectors practical	Billions of vectors
Built-in full-text search	Yes (tsvector, GIN index)	No (separate Elasticsearch needed)
ACID transactions	Yes (native)	Limited or absent
Team skill requirement	PostgreSQL DBA knowledge	API client only

Series Summary: PostgreSQL for AI

Throughout this series, we have covered the complete technology stack for using PostgreSQL as the foundation for AI applications:

pgvector: Installation, vector operators, first HNSW index
Embeddings: Models (OpenAI, Sentence Transformers), distances, chunking strategies
RAG: Complete ingestion-retrieval-generation pipeline
Similarity Search: HNSW vs IVFFlat, ANN algorithms, MMR for diversity
Advanced Indexing: Optimal parameters, monitoring, zero-downtime rebuild
Production: PgBouncer, read replicas, partitioning, K8s, monitoring

The central message of 2026 remains: "Just Use Postgres". If you are already using PostgreSQL (and the vast majority of applications are), you already have everything needed to build sophisticated, scalable, and cost-effective RAG systems. Do not add a separate vector database service until you have exceeded 500M vectors or have extreme sub-millisecond latency requirements.

Resources and Documentation

pgvector GitHub: github.com/pgvector/pgvector
MTEB Leaderboard: huggingface.co/spaces/mteb/leaderboard
RAGAS Framework: github.com/explodinggradients/ragas
ANN Benchmarks: ann-benchmarks.com
PgBouncer docs: pgbouncer.org/config.html
Related AI Engineering series: Enterprise RAG Architecture
Related Data and AI Business series: Data Infrastructure for AI