Vector Database Selection: Pinecone, Qdrant, Weaviate, pgvector Compared
Choosing the wrong vector database for your RAG system can cost weeks of refactoring and unplanned budget. In 2026 the market has consolidated around four main options, each with a very different trade-off profile: Pinecone for those who want managed service without ops, Qdrant for those who prioritize performance and self-hosting, Weaviate for those who need native hybrid search, e pgvector for those who want to stay in their existing PostgreSQL database.
This guide provides you with benchmark data, real costs and selection criteria for taking an informed decision based on your specific workload.
What You Will Learn
- Latency and throughput benchmarks for major vector databases 2026
- Real costs at different scales (100K, 1M, 10M, 100M vectors)
- Decision matrix: which database for which use case
- Practical setup with Python for Qdrant and pgvector
- HNSW indices: critical parameters for optimal performance
- When Milvus is the right choice for 100M+ carriers
Why the Vector Database Matters More than the Model
In a RAG system, the quality of the final response depends 60-70% on the quality of the retrieval, not from the chosen LLM. A GPT-4o with poor retrieval will produce worse responses than a GPT-4o-mini with excellent retrieval. The vector database is the heart of retrieval: its latency, its precision and its scalability determine the user experience of your system.
Benchmark 2026: Real Data on Standardized Hardware
The following benchmarks were measured on a dataset of 1 million vectors from 1536 dimensions (embedding OpenAI text-embedding-3-small), with query k=10, on a server with 8 vCPU and 32GB RAM.
Database | P50 Latency | P99 Latency | Throughput | Recall@10
-----------------|-------------|-------------|-------------|----------
Qdrant (HNSW) | 12ms | 28ms | 850 q/s | 0.989
Pinecone (cloud) | 18ms | 35ms | 600 q/s | 0.987
Weaviate (HNSW) | 22ms | 50ms | 500 q/s | 0.985
pgvector (IVFFlat)| 45ms | 120ms | 200 q/s | 0.941
pgvector (HNSW) | 18ms | 45ms | 420 q/s | 0.983
Milvus (HNSW) | 8ms | 18ms | 1200 q/s | 0.991
Note: pgvector HNSW disponibile da PG 16 (2023), molto migliorato
Qdrant emerges as the leader in performance/self-hosting. Pinecone offers guaranteed SLAs as a managed service. pgvector with HNSW (introduced in PostgreSQL 16) has bridged almost completely the gap with dedicated solutions.
Cost Analysis at Different Scales
Costo mensile stimato (USD) — 1M vettori 1536 dim, 100K query/giorno
Pinecone Starter Pod: ~$70/mese (serverless, pay-per-use)
Pinecone Standard Pod: ~$280/mese (1x p1.x1 pod, SLA garantito)
Qdrant Cloud: ~$85/mese (0.5 vCPU, 1GB RAM managed)
Qdrant Self-hosted: ~$30/mese (infrastruttura cloud VPS)
Weaviate Cloud: ~$95/mese
pgvector (in Postgres): $0 aggiuntivo (gia paghi il DB)
A 10M vettori:
Pinecone: ~$700-1400/mese
Qdrant: ~$250/mese (self-hosted con server dedicato)
pgvector: ~$50/mese aggiuntivo (RAM extra per PostgreSQL)
Milvus: ~$180/mese (self-hosted, ottimo TCO a questa scala)
A 100M+ vettori:
Pinecone: >$5000/mese
Qdrant: ~$800/mese (cluster dedicato)
Milvus: ~$500/mese (cluster distribuito on-premise)
Practical Setup: Qdrant
Qdrant is the most common choice for technical teams who want maximum performance with the control over deployment. Here is the complete setup:
# Qdrant: setup, indexing e querying
from qdrant_client import QdrantClient
from qdrant_client.models import (
Distance, VectorParams, HnswConfigDiff,
PointStruct, Filter, FieldCondition, MatchValue
)
from openai import OpenAI
import uuid
# Connessione (locale o cloud)
client = QdrantClient(url="http://localhost:6333")
# Per Qdrant Cloud: QdrantClient(url="...", api_key="...")
# Crea la collection con HNSW ottimizzato
client.create_collection(
collection_name="knowledge_base",
vectors_config=VectorParams(
size=1536, # dimensione embedding OpenAI
distance=Distance.COSINE
),
hnsw_config=HnswConfigDiff(
m=16, # connessioni per nodo (default: 16)
ef_construct=200, # qualita indice durante build (default: 100)
# Aumentare ef_construct migliora recall ma rallenta l'indicizzazione
),
on_disk_payload=True # payload su disco per risparmiare RAM
)
# Indicizza documenti
openai_client = OpenAI()
def embed_text(text: str) -> list[float]:
response = openai_client.embeddings.create(
input=text,
model="text-embedding-3-small"
)
return response.data[0].embedding
def index_documents(documents: list[dict]) -> None:
points = []
for doc in documents:
embedding = embed_text(doc["content"])
points.append(PointStruct(
id=str(uuid.uuid4()),
vector=embedding,
payload={
"content": doc["content"],
"source": doc["source"],
"category": doc["category"],
"created_at": doc["created_at"]
}
))
# Batch upload per efficienza
client.upsert(
collection_name="knowledge_base",
points=points,
wait=True
)
# Query con filtri
def search(
query: str,
category_filter: str | None = None,
limit: int = 5
) -> list[dict]:
query_embedding = embed_text(query)
query_filter = None
if category_filter:
query_filter = Filter(
must=[
FieldCondition(
key="category",
match=MatchValue(value=category_filter)
)
]
)
results = client.query_points(
collection_name="knowledge_base",
query=query_embedding,
query_filter=query_filter,
limit=limit,
with_payload=True
)
return [
{"content": r.payload["content"], "score": r.score, "source": r.payload["source"]}
for r in results.points
]
# Esempio d'uso
docs = [
{"content": "Come configurare l'autenticazione 2FA...", "source": "docs/security.md",
"category": "security", "created_at": "2026-01-15"},
]
index_documents(docs)
results = search("configurazione sicurezza account", category_filter="security")
Practical Setup: pgvector
pgvector is the smart choice when you already use PostgreSQL: zero additional infrastructure, ACID transactions, JOIN with other tables. With HNSW the performance is comparable to the solutions dedicated for scales under 5M carriers.
# pgvector: schema e querying con psycopg2
import psycopg2
from pgvector.psycopg2 import register_vector
import numpy as np
conn = psycopg2.connect("postgresql://user:pass@localhost/dbname")
register_vector(conn)
with conn.cursor() as cur:
# Abilita l'estensione (una tantum)
cur.execute("CREATE EXTENSION IF NOT EXISTS vector")
# Crea la tabella con embedding
cur.execute("""
CREATE TABLE IF NOT EXISTS documents (
id BIGSERIAL PRIMARY KEY,
content TEXT NOT NULL,
source VARCHAR(500),
category VARCHAR(100),
embedding vector(1536), -- dimensione OpenAI
created_at TIMESTAMP DEFAULT NOW()
)
""")
# Crea indice HNSW (migliore performance di IVFFlat)
cur.execute("""
CREATE INDEX IF NOT EXISTS documents_embedding_hnsw_idx
ON documents
USING hnsw (embedding vector_cosine_ops)
WITH (m = 16, ef_construction = 200)
""")
conn.commit()
# Inserimento
def insert_document(content: str, source: str, category: str, embedding: list[float]):
with conn.cursor() as cur:
cur.execute(
"INSERT INTO documents (content, source, category, embedding) VALUES (%s, %s, %s, %s)",
(content, source, category, np.array(embedding))
)
conn.commit()
# Query similarity search con filtro
def search_documents(query_embedding: list[float], category: str = None, limit: int = 5):
with conn.cursor() as cur:
if category:
cur.execute("""
SELECT content, source, 1 - (embedding <=> %s) AS similarity
FROM documents
WHERE category = %s
ORDER BY embedding <=> %s
LIMIT %s
""", (np.array(query_embedding), category, np.array(query_embedding), limit))
else:
cur.execute("""
SELECT content, source, 1 - (embedding <=> %s) AS similarity
FROM documents
ORDER BY embedding <=> %s
LIMIT %s
""", (np.array(query_embedding), np.array(query_embedding), limit))
return cur.fetchall()
# Restituisce: [(content, source, similarity_score), ...]
Decision Matrix: Which one to choose
Caso d'uso | Raccomandazione | Motivo
------------------------------------|------------------------|----------------------------------
Gia uso PostgreSQL, <1M vettori | pgvector | Zero infra, JOIN nativi, ACID
Startup, MVP rapido | Pinecone Serverless | Zero ops, pay-per-use
Team tecnico, >1M vettori | Qdrant self-hosted | Best performance/cost
Hybrid search nativo necessario | Weaviate | BM25+vector integrato
>100M vettori, cluster distribuito | Milvus | Scalabilita orizzontale
Compliance (dati on-premise) | Qdrant/Milvus | Full control, nessun cloud
Common Mistakes in Selection
- Choose pgvector with IVFFlat on datasets > 500K vectors: latency degrades significantly. Use HNSW (available since PostgreSQL 16) or migrate to a solution dedicated
- Overestimating the size of embeddings: text-embedding-3-small a 1536 dimensions has almost identical performance to text-embedding-3-large at 3072 dimensions but it costs half and uses half RAM
- Ignore ef_search at query time: increase ef_search from 100 to 200 in Qdrant/Weaviate improves recall by 95% to 99% with only 1.3x latency overhead
- Size only for the initial volume: Plan the migration first to reach the limit of your current choice
Critical HNSW Parameters
The HNSW (Hierarchical Navigable Small World) index is the heart of vector performance databases. Understanding its parameters allows you to optimize the recall/latency trade-off:
Parametro | Effetto sull'indice | Effetto sulla query
----------------|----------------------|--------------------
m (16-64) | Connessioni per nodo | Recall (piu alto = meglio)
| piu alto = piu RAM | Latency (marginale impatto)
| piu alto = build piu |
| lento |
| |
ef_construction | Qualita build | Nessun effetto diretto
(100-500) | piu alto = recall | (determina qualita indice)
| migliore |
| piu alto = build piu |
| lento |
| |
ef_search | Nessun effetto | Recall (molto impatto)
(50-500) | sull'indice | Latency (tradeoff principale)
Configurazione raccomandata per produzione:
- m = 16 (bilanciato), m = 32 (alta accuratezza, +50% RAM)
- ef_construct = 200 (build lenta, alta qualita indice)
- ef_search = 100-200 (aumenta a query time senza ricostruire)
Conclusions
The 2026 rule of thumb: under 1M vectors, pgvector with HNSW is often the choice right if you are already on PostgreSQL — zero additional infrastructure and performance competitive. Between 1M and 50M carriers, Qdrant self-hosted offers the best performance/cost ratio. Over 100M carriers, Milvus with distributed deployment and the standard choice.
The next article delves into RAG architectures — from Naive RAG to Modular RAG — with code complete and pattern to handle the most complex use cases.







