RAG in Production: Scalable Architecture with PostgreSQL
Building a working RAG prototype on your laptop is relatively straightforward. Taking it to production serving hundreds of concurrent users, with sub-second latency, high availability, and controlled costs, is an entirely different engineering challenge.
In this final article of the series, we build a production RAG architecture on PostgreSQL: connection pooling with PgBouncer, read replicas dedicated to vector search, partitioning for datasets with billions of vectors, application-level caching with Redis, monitoring with Prometheus and Grafana, and Kubernetes deployment strategies for zero-downtime updates.
All while maintaining the "Just Use Postgres" philosophy of 2026: no separate vector database, no Pinecone, no Qdrant. PostgreSQL with pgvector, properly configured, beats specialized competitors on cost/performance in 90% of real-world cases.
Series Overview
| # | Article | Focus |
|---|---|---|
| 1 | pgvector | Installation, operators, indexing |
| 2 | Embeddings Deep Dive | Models, distances, generation |
| 3 | RAG with PostgreSQL | End-to-end RAG pipeline |
| 4 | Similarity Search | Algorithms and optimization |
| 5 | HNSW and IVFFlat | Advanced indexing strategies |
| 6 | You are here - RAG in Production | Scalability and performance |
What You Will Learn
- Reference architecture for enterprise RAG on PostgreSQL
- Connection pooling with PgBouncer: why and how to configure it
- Read replicas dedicated to vector search workloads
- Partitioning for million/billion-vector datasets
- Application-level caching with Redis for repeated queries
- Monitoring: essential metrics, Grafana dashboards, alerting
- Kubernetes deployment: StatefulSets, PVCs, rolling updates
- Multi-tenancy: isolating data across different clients
- PostgreSQL configuration tuning for production RAG
- Autovacuum and maintenance of vector indexes
Production RAG Reference Architecture
## Production RAG Architecture on PostgreSQL
┌─────────────────────────────────────┐
│ Load Balancer │
│ (nginx / AWS ALB) │
└──────────────┬──────────────────────┘
│
┌──────────────▼──────────────────────┐
│ RAG API Service │
│ (FastAPI, K8s Deployment) │
│ Replicas: 3-10 pods │
└──┬─────────────────────┬────────────┘
│ │
┌────────▼────────┐ ┌────────▼────────┐
│ Redis Cache │ │ Embedding API │
│ (query cache, │ │ (OpenAI / local │
│ emb cache) │ │ inference) │
└─────────────────┘ └─────────────────┘
│
┌─────────────▼───────────────────────────┐
│ PgBouncer │
│ Connection Pool │
│ (transaction mode, pool_size=100) │
└──────────────┬──────────────────────────┘
│
┌──────────────▼──────────────────────────┐
│ PostgreSQL Primary │
│ (write: ingestion, updates) │
│ shared_buffers=32GB, m=16 │
└──────────────┬──────────────────────────┘
│ Streaming Replication
┌──────────────▼──────────────────────────┐
│ PostgreSQL Read Replica │
│ (read: vector search, RAG queries) │
│ Dedicated for ANN workloads │
│ ef_search tuned per use case │
└─────────────────────────────────────────┘
Connection Pooling with PgBouncer
PostgreSQL creates a process (fork) for each connection. With 200 concurrent users running vector queries, each connection uses ~5-10MB of RAM just for the process. PgBouncer maintains a pool of real database connections and queues application requests, reducing PostgreSQL connections from hundreds down to a few dozen.
Installation and Configuration
# pgbouncer.ini - Configuration for RAG workload
[databases]
# Primary database for writes (ingestion)
ragdb_write = host=postgres-primary port=5432 dbname=ragdb user=raguser
# Read replica for vector search
ragdb_read = host=postgres-replica port=5432 dbname=ragdb user=raguser
[pgbouncer]
# Pool mode: transaction is most efficient for short queries
# - session: connection reserved for entire session
# - transaction: connection released after each transaction (OPTIMAL for RAG)
# - statement: release after each statement (incompatible with multi-statement txns)
pool_mode = transaction
# Maximum real connections to PostgreSQL per database
max_db_connections = 50
# Connections per pool (per user+database combination)
default_pool_size = 25
# Total client connections toward PgBouncer
max_client_conn = 1000
# Timeout and resilience
query_wait_timeout = 30 # seconds before "query too long" error
server_idle_timeout = 600 # close idle connections after 10 minutes
client_idle_timeout = 0 # no client timeout (managed by the app)
# Authentication
auth_type = md5
auth_file = /etc/pgbouncer/userlist.txt
# Monitoring
stats_period = 60
logfile = /var/log/pgbouncer/pgbouncer.log
pidfile = /var/run/pgbouncer/pgbouncer.pid
listen_addr = 0.0.0.0
listen_port = 5432
Usage in Python with Connection Pool
import psycopg2.pool
from contextlib import contextmanager
import threading
class DatabasePool:
"""
Thread-safe connection pool toward PgBouncer.
Separate connections for write (primary) and read (replica).
"""
def __init__(self, write_dsn: str, read_dsn: str,
min_conn: int = 2, max_conn: int = 20):
self._write_pool = psycopg2.pool.ThreadedConnectionPool(
minconn=min_conn, maxconn=max_conn, dsn=write_dsn
)
self._read_pool = psycopg2.pool.ThreadedConnectionPool(
minconn=min_conn, maxconn=max_conn, dsn=read_dsn
)
self._lock = threading.Lock()
@contextmanager
def get_write_conn(self):
"""Connection for write operations (primary)."""
conn = self._write_pool.getconn()
try:
yield conn
conn.commit()
except Exception:
conn.rollback()
raise
finally:
self._write_pool.putconn(conn)
@contextmanager
def get_read_conn(self):
"""Connection for read operations (read replica)."""
conn = self._read_pool.getconn()
try:
# Set optimal ef_search for vector queries
with conn.cursor() as cur:
cur.execute("SET hnsw.ef_search = 60")
yield conn
finally:
conn.rollback() # rollback to release locks
self._read_pool.putconn(conn)
# Initialize the pool once at application startup
db_pool = DatabasePool(
write_dsn="host=pgbouncer port=5432 dbname=ragdb_write user=raguser password=xxx",
read_dsn="host=pgbouncer port=5432 dbname=ragdb_read user=raguser password=xxx"
)
# Usage in API endpoints:
async def search_endpoint(query: str):
with db_pool.get_read_conn() as conn:
results = semantic_search(conn, query)
return results
async def ingest_endpoint(document_path: str):
with db_pool.get_write_conn() as conn:
ingest_document(conn, document_path)
Monitoring PgBouncer
-- Connect to PgBouncer admin port (default: 6432)
-- psql -h localhost -p 6432 -U pgbouncer pgbouncer
-- Real-time pool statistics
SHOW POOLS;
-- Output:
-- database | user | cl_active | cl_waiting | sv_active | sv_idle | maxwait
-- ragdb_read | raguser | 23 | 0 | 23 | 2 | 0
-- ragdb_write| raguser | 5 | 0 | 5 | 0 | 0
-- Aggregate statistics per database
SHOW STATS;
-- total_query_count | total_query_time | avg_query | avg_wait
-- 1248532 | 1832743.2 | 1.46 | 0.02
-- List of current client connections
SHOW CLIENTS;
-- List of current server connections to PostgreSQL
SHOW SERVERS;
-- Alert: if cl_waiting > 0 for more than 5 seconds, the pool is saturated
-- Solution: increase default_pool_size or max_db_connections
Read Replicas for Vector Search
Vector queries (ANN search) are CPU and memory intensive but do not require the freshest data. Isolating them on a dedicated read replica frees the primary for ingestion operations (INSERT) and keeps latency low for both workloads.
Setting Up Read Replicas with Streaming Replication
# On the PRIMARY server
# postgresql.conf
wal_level = replica # enable WAL for replication
max_wal_senders = 5 # max 5 replicas connected
wal_keep_size = '1GB' # keep at least 1GB of WAL for replicas
# pg_hba.conf - allow replication connection
host replication repuser replica-host/32 md5
# Create replication user
CREATE USER repuser REPLICATION LOGIN PASSWORD 'strongpassword';
# ===========================
# On the REPLICA server
# ===========================
# Clone the primary:
pg_basebackup -h primary-host -U repuser -D /var/lib/postgresql/16/main -P -Xs -R
# The -R flag automatically creates standby.signal and recovery parameters
# postgresql.conf on the replica - optimized for vector search:
hot_standby = on # allows reads on the replica
shared_buffers = '32GB' # plenty of memory for the HNSW index
effective_cache_size = '96GB' # hint to the planner about total memory
max_parallel_workers_per_gather = 4 # parallel vector queries
hnsw.ef_search = 60 # global parameter for the replica
random_page_cost = 1.1 # SSD: low cost for random I/O
effective_io_concurrency = 200 # for NVMe drives
Monitoring the Replica
-- On primary: status of connected replicas
SELECT
client_addr AS replica_address,
application_name AS replica_name,
state AS state,
sent_lsn AS wal_sent,
write_lsn AS wal_written,
flush_lsn AS wal_flushed,
replay_lsn AS wal_replayed,
-- Lag in bytes:
sent_lsn - replay_lsn AS lag_bytes,
-- Lag in time (approximate):
now() - write_lag AS write_lag,
now() - replay_lag AS replay_lag
FROM pg_stat_replication;
-- On the replica: verify it is replicating
SELECT now() - pg_last_xact_replay_timestamp() AS replica_lag_seconds;
-- Alert: if replica lag > 30 seconds, possible replication issue
-- Check pg_stat_wal_receiver on the replica:
SELECT status, received_lsn, last_msg_receipt_time, latest_end_lsn
FROM pg_stat_wal_receiver;
Table Partitioning for Large-Scale Datasets
With tens of millions of vectors, a single table becomes difficult to manage. Partitioning divides the table into physically separate parts, enabling faster queries (partition pruning) and fine-grained maintenance at the partition level.
Time-Based Partitioning (Time-Series RAG)
-- Monthly partitioning: ideal for timestamped documents (news, emails, logs)
CREATE TABLE documents_partitioned (
id BIGSERIAL,
source_path TEXT NOT NULL,
source_type TEXT,
content TEXT NOT NULL,
embedding vector(1536),
metadata JSONB DEFAULT '{}',
created_at TIMESTAMPTZ DEFAULT NOW(),
PRIMARY KEY (id, created_at) -- partition column must be in PK
) PARTITION BY RANGE (created_at);
-- Create partitions for the last 12 months
DO $
DECLARE
d DATE;
BEGIN
FOR i IN 0..11 LOOP
d := DATE_TRUNC('month', NOW()) - (i || ' months')::INTERVAL;
EXECUTE format(
'CREATE TABLE documents_%s PARTITION OF documents_partitioned
FOR VALUES FROM (%L) TO (%L)',
TO_CHAR(d, 'YYYY_MM'),
d,
d + INTERVAL '1 month'
);
-- HNSW index on each partition
EXECUTE format(
'CREATE INDEX ON documents_%s USING hnsw (embedding vector_cosine_ops) WITH (m=16, ef_construction=64)',
TO_CHAR(d, 'YYYY_MM')
);
RAISE NOTICE 'Partition created: documents_%', TO_CHAR(d, 'YYYY_MM');
END LOOP;
END $;
-- Automatic partition pruning (PostgreSQL uses only relevant partitions):
EXPLAIN (ANALYZE)
SELECT id, content, embedding <=> query_vec AS dist
FROM documents_partitioned
WHERE created_at >= '2026-01-01' AND created_at < '2026-02-01' -- only Jan 2026 partition
ORDER BY embedding <=> query_vec
LIMIT 10;
-- Output: scans only documents_2026_01, ignores the other 11 partitions
Multi-Tenant Partitioning
-- Partition by tenant_id: isolate each client's data
CREATE TABLE documents_multitenant (
id BIGSERIAL,
tenant_id TEXT NOT NULL,
content TEXT NOT NULL,
embedding vector(1536),
metadata JSONB DEFAULT '{}',
PRIMARY KEY (id, tenant_id)
) PARTITION BY LIST (tenant_id);
-- Create partition for each tenant
CREATE TABLE docs_tenant_acme PARTITION OF documents_multitenant FOR VALUES IN ('acme');
CREATE TABLE docs_tenant_globex PARTITION OF documents_multitenant FOR VALUES IN ('globex');
CREATE TABLE docs_tenant_initech PARTITION OF documents_multitenant FOR VALUES IN ('initech');
-- Default partition for new tenants
CREATE TABLE docs_tenant_default PARTITION OF documents_multitenant DEFAULT;
-- HNSW index for each tenant
CREATE INDEX ON docs_tenant_acme USING hnsw (embedding vector_cosine_ops) WITH (m=16, ef_construction=64);
CREATE INDEX ON docs_tenant_globex USING hnsw (embedding vector_cosine_ops) WITH (m=16, ef_construction=64);
CREATE INDEX ON docs_tenant_initech USING hnsw (embedding vector_cosine_ops) WITH (m=16, ef_construction=64);
-- Query with automatic tenant isolation (partition pruning):
SELECT id, content, embedding <=> query_vec AS dist
FROM documents_multitenant
WHERE tenant_id = 'acme' -- searches ONLY in acme's partition
ORDER BY embedding <=> query_vec
LIMIT 5;
-- Row Level Security (RLS) as a second security layer:
ALTER TABLE documents_multitenant ENABLE ROW LEVEL SECURITY;
CREATE POLICY tenant_isolation ON documents_multitenant
USING (tenant_id = current_setting('app.current_tenant'));
-- In Python, set the tenant context:
with conn.cursor() as cur:
cur.execute("SET app.current_tenant = %s", (tenant_id,))
# All subsequent queries only see the current tenant's documents
Caching with Redis
Many RAG queries repeat: same question, same context. Implementing a two-level cache - embedding cache (avoids recomputing the embedding for the same query) and result cache (avoids re-running the vector search for identical queries) - can reduce latency by 90% and OpenAI costs by 70%.
import redis
import json
import hashlib
from typing import Optional
class RAGCache:
"""
Two-level cache:
1. Embedding cache: for query vectors
2. Result cache: for complete RAG results
"""
def __init__(self, redis_url: str = "redis://localhost:6379"):
self.r = redis.from_url(redis_url, decode_responses=True)
# TTL for different cache types
self.embedding_ttl = 3600 * 24 * 7 # 7 days (embeddings are stable)
self.result_ttl = 3600 # 1 hour (results may become stale)
def _cache_key(self, prefix: str, *args) -> str:
"""Generate a deterministic cache key."""
content = json.dumps(args, sort_keys=True)
hash_val = hashlib.sha256(content.encode()).hexdigest()[:16]
return f"rag:{prefix}:{hash_val}"
# === EMBEDDING CACHE ===
def get_embedding(self, text: str) -> Optional[list[float]]:
key = self._cache_key("emb", text)
cached = self.r.get(key)
if cached:
return json.loads(cached)
return None
def set_embedding(self, text: str, embedding: list[float]) -> None:
key = self._cache_key("emb", text)
self.r.setex(key, self.embedding_ttl, json.dumps(embedding))
# === RESULT CACHE ===
def get_rag_result(self, query: str, filters: dict = None) -> Optional[dict]:
key = self._cache_key("result", query, filters or {})
cached = self.r.get(key)
if cached:
result = json.loads(cached)
result["from_cache"] = True
return result
return None
def set_rag_result(self, query: str, result: dict, filters: dict = None) -> None:
key = self._cache_key("result", query, filters or {})
# Do not cache responses with empty sources or errors
if result.get("sources") and result.get("answer"):
self.r.setex(key, self.result_ttl, json.dumps(result))
# === CACHE STATISTICS ===
def get_stats(self) -> dict:
info = self.r.info("stats")
total = info.get("keyspace_hits", 0) + info.get("keyspace_misses", 0)
hit_rate = info["keyspace_hits"] / total if total > 0 else 0
return {
"hit_rate": round(hit_rate, 3),
"total_queries": total,
"memory_used": self.r.info("memory")["used_memory_human"],
"rag_keys": self.r.dbsize()
}
# Integration in the RAG System
cache = RAGCache("redis://redis:6379")
def cached_embed(text: str, client) -> list[float]:
"""Generate or retrieve from cache the embedding for a text."""
cached = cache.get_embedding(text)
if cached:
return cached
embedding = client.embeddings.create(
input=[text], model="text-embedding-3-small"
).data[0].embedding
cache.set_embedding(text, embedding)
return embedding
def cached_rag_query(rag_system, question: str, filters: dict = None) -> dict:
"""Run a RAG query with result caching."""
# Check result cache first
cached = cache.get_rag_result(question, filters)
if cached:
return cached
# Run the full query
result = rag_system.ask(question, filters=filters)
# Save to cache
cache.set_rag_result(question, result, filters)
return result
FastAPI: Serving RAG Asynchronously
from fastapi import FastAPI, HTTPException, BackgroundTasks
from pydantic import BaseModel
import asyncio
from concurrent.futures import ThreadPoolExecutor
app = FastAPI(title="RAG API", version="1.0.0")
executor = ThreadPoolExecutor(max_workers=10)
class QueryRequest(BaseModel):
question: str
top_k: int = 5
source_type: str = None
use_hybrid: bool = True
tenant_id: str = None
class QueryResponse(BaseModel):
answer: str
sources: list[dict]
from_cache: bool = False
retrieval_ms: float
generation_ms: float
@app.post("/api/query", response_model=QueryResponse)
async def query_rag(request: QueryRequest):
"""
Query endpoint with async processing and caching.
"""
import time
# 1. Check cache
cached = cache.get_rag_result(request.question, {"source_type": request.source_type})
if cached:
return QueryResponse(
answer=cached["answer"],
sources=cached["sources"],
from_cache=True,
retrieval_ms=0,
generation_ms=0
)
# 2. Query embedding (with cache)
t0 = time.time()
# Run in thread to avoid blocking the event loop
query_vec = await asyncio.get_event_loop().run_in_executor(
executor,
lambda: cached_embed(request.question, openai_client)
)
# 3. Vector search on PostgreSQL read replica
with db_pool.get_read_conn() as conn:
if request.tenant_id:
# Set tenant context for RLS
with conn.cursor() as cur:
cur.execute("SET app.current_tenant = %s", (request.tenant_id,))
chunks = hybrid_search(conn, query_vec, request.top_k)
retrieval_ms = (time.time() - t0) * 1000
# 4. Response generation
t1 = time.time()
response = await asyncio.get_event_loop().run_in_executor(
executor,
lambda: rag_generator.generate(request.question, chunks)
)
generation_ms = (time.time() - t1) * 1000
result = {
"answer": response.answer,
"sources": response.sources
}
cache.set_rag_result(request.question, result, {"source_type": request.source_type})
return QueryResponse(
answer=response.answer,
sources=response.sources,
from_cache=False,
retrieval_ms=round(retrieval_ms, 1),
generation_ms=round(generation_ms, 1)
)
@app.get("/api/health")
async def health():
return {"status": "ok", "cache": cache.get_stats()}
Monitoring: Essential Metrics
Prometheus Metrics in Python
from prometheus_client import Counter, Histogram, Gauge, start_http_server
# Application metrics
rag_queries_total = Counter(
"rag_queries_total",
"Total number of RAG queries",
["status", "from_cache"]
)
rag_query_duration_seconds = Histogram(
"rag_query_duration_seconds",
"RAG query duration in seconds",
["phase"], # "retrieval", "generation", "total"
buckets=[0.05, 0.1, 0.25, 0.5, 1.0, 2.5, 5.0]
)
rag_retrieval_chunks_count = Histogram(
"rag_retrieval_chunks_count",
"Number of chunks retrieved per query",
buckets=[1, 2, 3, 5, 10, 20]
)
rag_retrieval_top_similarity = Gauge(
"rag_retrieval_top_similarity",
"Top similarity score of last retrieval"
)
# PostgreSQL metrics (via pg_stat_activity)
db_active_connections = Gauge(
"db_active_connections",
"Active database connections",
["pool"] # "write", "read"
)
def track_rag_query(func):
"""Decorator to track RAG query metrics."""
def wrapper(*args, **kwargs):
import time
start = time.time()
try:
result = func(*args, **kwargs)
rag_queries_total.labels(
status="success",
from_cache=str(result.get("from_cache", False))
).inc()
return result
except Exception as e:
rag_queries_total.labels(status="error", from_cache="false").inc()
raise
finally:
rag_query_duration_seconds.labels("total").observe(time.time() - start)
return wrapper
# Start Prometheus server (port 9090)
start_http_server(9090)
Grafana Dashboard: SQL Queries for PostgreSQL Metrics
-- Grafana dashboard queries: PostgreSQL vector metrics
-- 1. Query latency by percentile (last 5 minutes)
SELECT
percentile_cont(0.5) WITHIN GROUP (ORDER BY total_time) AS p50_ms,
percentile_cont(0.95) WITHIN GROUP (ORDER BY total_time) AS p95_ms,
percentile_cont(0.99) WITHIN GROUP (ORDER BY total_time) AS p99_ms,
COUNT(*) AS n_queries
FROM pg_stat_statements
WHERE query LIKE '%embedding%<=%>%' -- queries using vector search
AND calls > 0;
-- 2. Cache hit ratio for vector indexes
SELECT
indexrelname AS index_name,
idx_blks_read AS disk_reads,
idx_blks_hit AS cache_hits,
ROUND(idx_blks_hit::numeric / NULLIF(idx_blks_read + idx_blks_hit, 0) * 100, 2)
AS cache_hit_pct
FROM pg_statio_user_indexes
WHERE indexrelname LIKE '%hnsw%' OR indexrelname LIKE '%ivfflat%';
-- 3. Active connections by state
SELECT
state,
COUNT(*) AS connections,
MAX(now() - state_change) AS max_wait_time
FROM pg_stat_activity
WHERE datname = 'ragdb'
GROUP BY state;
-- 4. Tuples inserted/read per second (load indicator)
SELECT
relname,
n_tup_ins + n_tup_upd + n_tup_del AS writes_per_snapshot,
seq_tup_read + idx_tup_fetch AS reads_per_snapshot
FROM pg_stat_user_tables
WHERE relname LIKE '%document%';
PostgreSQL Configuration for Production RAG
PostgreSQL's default configuration is not optimized for vector workloads. Here are the
critical parameters to modify in postgresql.conf for a production RAG system
with 32GB+ of RAM:
# postgresql.conf - Production configuration for RAG on PostgreSQL 16
# System: 64GB RAM, 16 CPU cores, NVMe SSD
# === MEMORY ===
shared_buffers = '16GB' # 25% of total RAM
effective_cache_size = '48GB' # 75% of RAM (planner hint, not allocated memory)
work_mem = '64MB' # for sort/hash in queries (per-connection!)
maintenance_work_mem = '2GB' # for CREATE INDEX HNSW, VACUUM, pg_dump
max_worker_processes = 16
max_parallel_workers = 8
max_parallel_workers_per_gather = 4
# === STORAGE / I/O ===
random_page_cost = 1.1 # SSD: reduce from 4.0 to favor index scans
effective_io_concurrency = 200 # NVMe: can handle many parallel requests
checkpoint_completion_target = 0.9 # spread checkpoints over time
wal_buffers = '64MB' # default -1 (auto) is usually sufficient
# === AUTOVACUUM (CRITICAL for vector tables with many INSERT/DELETE) ===
autovacuum_vacuum_cost_delay = '2ms' # more aggressive than default 20ms
autovacuum_vacuum_scale_factor = 0.05 # trigger at 5% dead tuples (default 20%)
autovacuum_analyze_scale_factor = 0.02 # trigger analyze at 2% new tuples
autovacuum_vacuum_cost_limit = 400 # more work per autovacuum worker
# === LOGGING ===
log_min_duration_statement = 1000 # log queries > 1 second
log_lock_waits = on
track_io_timing = on # needed for accurate EXPLAIN (BUFFERS)
# === EXTENSIONS ===
shared_preload_libraries = 'pg_stat_statements,vector'
# === pgvector SPECIFIC ===
# hnsw.ef_search: set at session level in application code
# Do not set globally: each use case has its own optimal value
Autovacuum and Vector Index Maintenance
Tables with embeddings undergo frequent UPDATE and DELETE operations during the RAG pipeline (document updates, re-chunking, deletion of obsolete documents). Without properly configured autovacuum, table bloat degrades vector search performance significantly.
-- Check table bloat for the documents table
SELECT
schemaname,
tablename,
n_live_tup AS live_rows,
n_dead_tup AS dead_rows,
ROUND(n_dead_tup::numeric / NULLIF(n_live_tup + n_dead_tup, 0) * 100, 2) AS pct_dead,
last_vacuum,
last_autovacuum,
last_analyze
FROM pg_stat_user_tables
WHERE tablename LIKE '%document%';
-- Alert: pct_dead > 5% means a manual VACUUM is needed
-- Manual VACUUM to reclaim space (does not block reads):
VACUUM (VERBOSE, ANALYZE) documents;
-- VACUUM FULL to reclaim physical disk space (BLOCKS writes - run during maintenance window):
VACUUM FULL documents;
-- Set aggressive autovacuum for the documents table
ALTER TABLE documents SET (
autovacuum_vacuum_scale_factor = 0.02, -- 2% dead rows = trigger vacuum
autovacuum_analyze_scale_factor = 0.01, -- 1% new rows = trigger analyze
autovacuum_vacuum_cost_delay = '2ms'
);
-- Check HNSW index size after many INSERT/DELETE operations
-- HNSW index does not degrade with inserts, but space is not reclaimed after DELETE
SELECT
indexname,
pg_size_pretty(pg_relation_size(indexrelid)) AS index_size,
pg_size_pretty(pg_total_relation_size(indexrelid)) AS total_size
FROM pg_stat_user_indexes
WHERE tablename = 'documents'
ORDER BY pg_relation_size(indexrelid) DESC;
-- Rebuild index without downtime (to reclaim space after many DELETEs):
REINDEX INDEX CONCURRENTLY documents_hnsw_idx;
Kubernetes Deployment
StatefulSet for PostgreSQL with PVC
## kubernetes/postgres-statefulset.yaml
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: postgres-vector
namespace: rag-production
spec:
serviceName: postgres-vector
replicas: 2 # primary + 1 read replica
selector:
matchLabels:
app: postgres-vector
template:
metadata:
labels:
app: postgres-vector
spec:
containers:
- name: postgres
image: pgvector/pgvector:pg16
env:
- name: POSTGRES_DB
value: ragdb
- name: POSTGRES_USER
valueFrom:
secretKeyRef:
name: postgres-secret
key: username
- name: POSTGRES_PASSWORD
valueFrom:
secretKeyRef:
name: postgres-secret
key: password
- name: POSTGRES_SHARED_BUFFERS
value: "8GB"
- name: POSTGRES_MAINTENANCE_WORK_MEM
value: "2GB"
resources:
requests:
memory: "32Gi"
cpu: "8"
limits:
memory: "64Gi"
cpu: "16"
volumeMounts:
- name: postgres-data
mountPath: /var/lib/postgresql/data
- name: postgres-config
mountPath: /etc/postgresql/custom.conf
subPath: custom.conf
volumeClaimTemplates:
- metadata:
name: postgres-data
spec:
accessModes: ["ReadWriteOnce"]
storageClassName: fast-ssd # NVMe SSD for optimal performance
resources:
requests:
storage: 500Gi # 500GB for large HNSW indexes
RAG API Deployment with Rolling Updates
## kubernetes/rag-api-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: rag-api
namespace: rag-production
spec:
replicas: 5 # 5 parallel pods
strategy:
type: RollingUpdate
rollingUpdate:
maxSurge: 2 # create 2 new pods before removing old ones
maxUnavailable: 0 # zero downtime update
selector:
matchLabels:
app: rag-api
template:
spec:
containers:
- name: rag-api
image: your-registry/rag-api:latest
ports:
- containerPort: 8000
env:
- name: DB_WRITE_URL
valueFrom:
secretKeyRef:
name: db-secret
key: write-url
- name: DB_READ_URL
valueFrom:
secretKeyRef:
name: db-secret
key: read-url
- name: REDIS_URL
value: "redis://redis-service:6379"
- name: OPENAI_API_KEY
valueFrom:
secretKeyRef:
name: openai-secret
key: api-key
resources:
requests:
memory: "2Gi"
cpu: "1"
limits:
memory: "4Gi"
cpu: "2"
livenessProbe:
httpGet:
path: /api/health
port: 8000
initialDelaySeconds: 30
periodSeconds: 10
readinessProbe:
httpGet:
path: /api/health
port: 8000
initialDelaySeconds: 10
periodSeconds: 5
Cost Comparison: PostgreSQL vs Specialized Vector DBs
| Solution | 1M vectors/month | 10M queries/month | P95 Latency | Notes |
|---|---|---|---|---|
| PostgreSQL + pgvector (self-hosted) | $200-400 (EC2/GKE) | Included | 20-80ms | Fixed cost, full control |
| Pinecone (managed) | $70-700 | $0.04-0.08/1000 queries | 10-40ms | Variable cost, less control |
| Supabase Vector | $25-200 | Included | 30-100ms | Managed PostgreSQL + pgvector |
| Qdrant (self-hosted) | $150-300 | Included | 5-20ms | Faster but additional infrastructure |
| pgvector (2025 benchmarks) | up to 28x faster than Pinecone | at 16x lower cost | direct comparison | benchmark.vector.dev |
When NOT to Use PostgreSQL for Vector Search
- Billions of vectors: Beyond 500M-1B vectors, Qdrant or Weaviate become more memory-efficient through native vector quantization.
- Sub-millisecond latency: If you need <5ms P99, an in-memory specialized vector database (Qdrant cloud, Pinecone serverless) is more appropriate.
- No DBA expertise: If your team lacks PostgreSQL skills, a managed service like Pinecone reduces operational complexity significantly.
- Multi-modal search: Images + text + audio with different embedding dimensions require architectures PostgreSQL does not natively handle well.
Backup and Disaster Recovery for RAG Knowledge Base
A RAG knowledge base contains both the original documents and their vector embeddings. Backup strategy must cover both, but the optimal approach depends on dataset size and update frequency:
# Backup strategy for RAG knowledge base on PostgreSQL
# 1. Full logical backup (pg_dump):
# Suitable for databases up to ~50GB. Includes all data, indexes, schema.
pg_dump -h postgres-primary \
-U raguser \
-d ragdb \
--format=custom \ # compressed binary format
--compress=9 \ # maximum compression
--no-privileges \ # exclude user privileges
--file=ragdb_backup_$(date +%Y%m%d_%H%M%S).dump
# Restore:
pg_restore -h postgres-new \
-U raguser \
-d ragdb \
--jobs=8 \ # parallel restore (uses 8 CPU cores)
ragdb_backup.dump
# 2. Physical backup (pg_basebackup) for large databases:
# Faster but requires more space. Uses WAL for point-in-time recovery.
pg_basebackup -h postgres-primary \
-U repuser \
-D /backup/ragdb_base_$(date +%Y%m%d) \
--format=tar \
--compress=9 \
--wal-method=stream \ # include WAL during backup
--checkpoint=fast
# 3. Embeddings-only export (if source documents are stored elsewhere):
psql -h postgres-primary -U raguser ragdb -c \
"COPY (SELECT id, source_path, chunk_index, content, embedding::text, metadata, ingested_at
FROM rag_documents) TO STDOUT CSV HEADER" \
| gzip > rag_embeddings_$(date +%Y%m%d).csv.gz
# Size estimates:
# 1M vectors at 1536 dim in float32 = 6GB for embeddings alone
# + text content: ~0.5GB for 1M chunks at 800 chars each
# Total expected: ~7GB for 1M chunked documents
Alerting: When Things Go Wrong
-- Alert queries for Prometheus/Grafana alerting
-- Run every 60 seconds as a Prometheus exporter
-- 1. Alert: vector query p99 latency > 500ms (last 5 minutes)
SELECT
CASE
WHEN percentile_cont(0.99) WITHIN GROUP (ORDER BY total_time) > 500
THEN 'ALERT'
ELSE 'OK'
END AS status,
percentile_cont(0.99) WITHIN GROUP (ORDER BY total_time) AS p99_ms
FROM pg_stat_statements
WHERE query LIKE '%embedding%<=%>%'
AND calls > 0;
-- 2. Alert: replica lag > 30 seconds
SELECT
CASE
WHEN EXTRACT(EPOCH FROM (now() - pg_last_xact_replay_timestamp())) > 30
THEN 'ALERT'
ELSE 'OK'
END AS replica_status,
EXTRACT(EPOCH FROM (now() - pg_last_xact_replay_timestamp())) AS lag_seconds;
-- 3. Alert: too many dead tuples (bloat > 10%)
SELECT
tablename,
CASE
WHEN n_dead_tup::float / NULLIF(n_live_tup + n_dead_tup, 0) > 0.1
THEN 'ALERT'
ELSE 'OK'
END AS bloat_status,
ROUND(n_dead_tup::numeric / NULLIF(n_live_tup + n_dead_tup, 0) * 100, 2) AS dead_pct
FROM pg_stat_user_tables
WHERE tablename LIKE '%document%';
-- 4. Alert: vector index cache hit ratio < 90%
SELECT
indexrelname,
CASE
WHEN idx_blks_hit::float / NULLIF(idx_blks_read + idx_blks_hit, 0) < 0.9
THEN 'ALERT'
ELSE 'OK'
END AS cache_status,
ROUND(idx_blks_hit::numeric / NULLIF(idx_blks_read + idx_blks_hit, 0) * 100, 2) AS hit_pct
FROM pg_statio_user_indexes
WHERE indexrelname LIKE '%hnsw%';
-- 5. Alert: connections approaching the limit (>85% of max_connections)
SELECT
CASE
WHEN COUNT(*) > (SELECT setting::int FROM pg_settings WHERE name = 'max_connections') * 0.85
THEN 'ALERT'
ELSE 'OK'
END AS conn_status,
COUNT(*) AS active_connections
FROM pg_stat_activity;
Production Readiness Checklist
-- PRE-DEPLOY CHECKLIST for RAG on PostgreSQL
-- 1. Verify installed extensions
SELECT extname, extversion FROM pg_extension WHERE extname IN ('vector', 'pg_stat_statements');
-- 2. Verify vector indexes and their sizes
SELECT
tablename, indexname,
pg_size_pretty(pg_relation_size(indexrelid)) AS size
FROM pg_stat_user_indexes
WHERE indexname LIKE '%hnsw%' OR indexname LIKE '%ivfflat%';
-- 3. Verify memory configuration (for prod with 32GB RAM)
SHOW shared_buffers; -- should be >= 8GB
SHOW work_mem; -- should be >= 16MB
SHOW maintenance_work_mem; -- should be >= 512MB
-- 4. Verify autovacuum is enabled on documents table
SELECT autovacuum_enabled, autovacuum_vacuum_scale_factor
FROM pg_settings
WHERE name = 'autovacuum';
-- 5. Test PgBouncer connection
-- psql -h pgbouncer-host -p 5432 -U raguser ragdb_read -c "SELECT 1"
-- 6. Test Redis cache
-- redis-cli -h redis-host PING -- should respond PONG
-- 7. Performance benchmark before go-live
-- target: p95 < 100ms for vector search, p99 < 500ms
-- 8. Verify replica lag
SELECT now() - pg_last_xact_replay_timestamp() AS lag;
-- target: lag < 5 seconds under normal operations
Architecture Comparison: PostgreSQL vs Dedicated Vector DBs
| Aspect | PostgreSQL + pgvector | Pinecone / Qdrant / Weaviate |
|---|---|---|
| Infrastructure complexity | Low: single system | High: additional system to manage |
| Cost for 10M queries/month | $200-400/month (fixed server) | $400-800/month (variable per query) |
| P95 latency (1M vectors) | 20-80ms (optimized HNSW) | 5-40ms (often in-memory) |
| Maximum scalability | ~500M-1B vectors practical | Billions of vectors |
| Built-in full-text search | Yes (tsvector, GIN index) | No (separate Elasticsearch needed) |
| ACID transactions | Yes (native) | Limited or absent |
| Team skill requirement | PostgreSQL DBA knowledge | API client only |
Series Summary: PostgreSQL for AI
Throughout this series, we have covered the complete technology stack for using PostgreSQL as the foundation for AI applications:
- pgvector: Installation, vector operators, first HNSW index
- Embeddings: Models (OpenAI, Sentence Transformers), distances, chunking strategies
- RAG: Complete ingestion-retrieval-generation pipeline
- Similarity Search: HNSW vs IVFFlat, ANN algorithms, MMR for diversity
- Advanced Indexing: Optimal parameters, monitoring, zero-downtime rebuild
- Production: PgBouncer, read replicas, partitioning, K8s, monitoring
The central message of 2026 remains: "Just Use Postgres". If you are already using PostgreSQL (and the vast majority of applications are), you already have everything needed to build sophisticated, scalable, and cost-effective RAG systems. Do not add a separate vector database service until you have exceeded 500M vectors or have extreme sub-millisecond latency requirements.
Resources and Documentation
- pgvector GitHub: github.com/pgvector/pgvector
- MTEB Leaderboard: huggingface.co/spaces/mteb/leaderboard
- RAGAS Framework: github.com/explodinggradients/ragas
- ANN Benchmarks: ann-benchmarks.com
- PgBouncer docs: pgbouncer.org/config.html
- Related AI Engineering series: Enterprise RAG Architecture
- Related Data and AI Business series: Data Infrastructure for AI







