Embeddings: Theory and Practice with PostgreSQL
Every semantic search system, every RAG pipeline, and every AI application that works with natural language shares one fundamental building block: embeddings. They are the translation of meaning into numbers, the bridge between the world of text and the world of mathematics. Without embeddings, a database cannot distinguish "dog" from "automobile" - with embeddings, it knows that "dog" is closer to "cat" than it is to "toaster".
In the first article of this series we configured pgvector and learned how to store and query vectors in PostgreSQL. But where do those vectors come from? How do you generate a high-quality embedding? And which model should you choose among the dozens available? In this article we answer all these questions, from mathematical theory to practical Python and PostgreSQL implementation.
Series Overview
| # | Article | Focus |
|---|---|---|
| 1 | pgvector | Installation, operators, indexing |
| 2 | You are here - Embeddings | Models, distances, generation |
| 3 | RAG with PostgreSQL | End-to-end RAG pipeline |
| 4 | Advanced Similarity Search | Hybrid search, filtering |
| 5 | Indexing and Performance | HNSW, IVFFlat, tuning |
| 6 | RAG in Production | Monitoring, scaling, CI/CD |
What You Will Learn
- What an embedding is and why it is fundamental to modern AI
- Historical evolution: from one-hot encoding to Word2Vec, GloVe, BERT and Sentence Transformers
- Mathematical properties of embeddings: vector analogies and semantic clustering
- The four distance metrics with formulas and use cases
- How to generate embeddings with Python: locally and via API
- How to store and query embeddings in PostgreSQL with pgvector
- Multimodal embeddings: text, images, audio and code
- How to evaluate embedding model quality (MTEB)
- Costs and scaling strategies for millions of documents
1. What Are Embeddings?
An embedding is a dense vector representation of an object (word, sentence, document, image) in a continuous low-dimensional space. In practical terms, it is an array of floating-point numbers that captures the "meaning" of that object.
# The embedding of the sentence "The cat sleeps on the couch"
# generated with OpenAI text-embedding-3-small (1536 dimensions)
embedding = [
0.0231, -0.0456, 0.0891, -0.0123, 0.0567, -0.0234,
0.0789, -0.0345, 0.0123, -0.0678, 0.0456, -0.0891,
# ... 1524 more values ...
]
print(f"Type: {type(embedding)}") # <class 'list'>
print(f"Dimensions: {len(embedding)}") # 1536
The key intuition is this: in a well-trained vector space, the geometric distance between two vectors reflects the semantic similarity between the concepts they represent. Sentences with similar meaning will have nearby vectors; sentences with different meanings will be far apart.
Embedding Properties
| Property | Description | Example |
|---|---|---|
| Dense | Every dimension has a non-zero value | [0.023, -0.045, 0.089, ...] |
| Continuous | Real values, not discrete | Each component is a float32/float16 |
| Fixed dimensionality | The same model always produces vectors of the same length | 384, 768, 1536 or 3072 dimensions |
| Semantically meaningful | Distances between vectors reflect meaning relationships | sim("cat", "feline") > sim("cat", "car") |
If we think of embedding space as a map, similar concepts form "neighborhoods": animals in one area, vehicles in another, emotions in yet another. The beauty is that these relationships emerge automatically from training - they are never programmed manually.
2. From Words to Vectors: Historical Evolution
The history of embeddings is a progression of increasingly sophisticated ideas, each solving the limitations of the previous one. Understanding this evolution helps explain why modern models work so well.
2.1 One-Hot Encoding (1990s)
The simplest approach: each word is represented by a vector with a single 1 and all other values set to 0. If the vocabulary has V words, each vector has V dimensions.
# Vocabulary: ["cat", "dog", "fish", "car", "bike"]
# Vector size = vocabulary size = 5
cat = [1, 0, 0, 0, 0]
dog = [0, 1, 0, 0, 0]
fish = [0, 0, 1, 0, 0]
car = [0, 0, 0, 1, 0]
bike = [0, 0, 0, 0, 1]
# Problem 1: the distance between "cat" and "dog" equals
# the distance between "cat" and "car"
import numpy as np
dist_cat_dog = np.linalg.norm(
np.array(cat) - np.array(dog)
) # sqrt(2) = 1.414
dist_cat_car = np.linalg.norm(
np.array(cat) - np.array(car)
) # sqrt(2) = 1.414 -- identical!
# Problem 2: with a vocabulary of 100,000 words,
# each vector has 100,000 dimensions (sparse, inefficient)
Limitations of One-Hot Encoding
Exploding dimensionality: for a 100K-word vocabulary, each vector has 100K dimensions, almost all zero. No semantic information: all vectors are equidistant from each other. "Cat" is as far from "feline" as it is from "earthquake". This approach captures no meaning relationships between words.
2.2 TF-IDF (Term Frequency - Inverse Document Frequency)
A step forward: instead of 0/1, vector components indicate how important a word is in a document relative to the entire corpus. But each document becomes a sparse vector in vocabulary dimensionality.
from sklearn.feature_extraction.text import TfidfVectorizer
documents = [
"the cat sleeps on the couch",
"the dog plays in the garden",
"the automobile drives on the road",
"the feline rests on the armchair",
]
vectorizer = TfidfVectorizer()
tfidf_matrix = vectorizer.fit_transform(documents)
# Result: sparse matrix (4 documents x N terms)
print(f"Shape: {tfidf_matrix.shape}") # (4, 14)
print(f"Terms: {vectorizer.get_feature_names_out()}")
# Problem: "cat sleeps" and "feline rests" are distant
# because they use different words, even though meaning is similar
from sklearn.metrics.pairwise import cosine_similarity
sim = cosine_similarity(tfidf_matrix[0:1], tfidf_matrix[3:4])
print(f"Similarity cat-feline: {sim[0][0]:.3f}") # ~0.07 (low!)
TF-IDF improves on one-hot encoding by weighting words by importance, but suffers from the same fundamental problem: it does not understand that "cat" and "feline" are synonyms, because it reasons only on exact lexical matches.
2.3 Word2Vec: The Revolution (2013)
In 2013, Tomas Mikolov and his team at Google published Word2Vec, which changed everything. The brilliant idea: a word is defined by the context in which it appears. Words that appear in similar contexts will have similar representations.
Word2Vec uses shallow neural networks to learn dense vectors (typically 100-300 dimensions) from large text corpora. Two architectures:
Word2Vec Architectures
| Architecture | Input | Output | Description |
|---|---|---|---|
| CBOW | Context words | Target word | Predicts the central word given the surrounding context |
| Skip-gram | Target word | Context words | Predicts the surrounding words given the central word |
from gensim.models import Word2Vec
# Sample corpus (in production: millions of sentences)
sentences = [
["the", "cat", "sleeps", "on", "the", "couch"],
["the", "dog", "plays", "in", "the", "garden"],
["the", "feline", "rests", "on", "the", "armchair"],
["the", "dog", "runs", "in", "the", "park"],
]
# Train Word2Vec (Skip-gram)
model = Word2Vec(
sentences=sentences,
vector_size=100, # embedding dimensionality
window=5, # context: 5 words before and after
min_count=1, # include words with at least 1 occurrence
sg=1, # 1 = Skip-gram, 0 = CBOW
epochs=100
)
# Now "cat" and "feline" are close!
print(model.wv.most_similar("cat", topn=3))
# [('feline', 0.92), ('dog', 0.85), ('sleeps', 0.71)]
# Access to the vector
vector_cat = model.wv["cat"]
print(f"Dimensions: {vector_cat.shape}") # (100,)
print(f"First 5: {vector_cat[:5]}")
2.4 GloVe: Global Vectors (2014)
Stanford developed GloVe (Global Vectors for Word Representation) with a different approach: instead of a neural network, GloVe factorizes the global co-occurrence matrix of the corpus. It combines the advantages of global statistical methods (like LSA) with those of Word2Vec's local context.
GloVe minimizes a cost function that ensures the dot product between two word vectors is proportional to the logarithm of their co-occurrence probability.
2.5 FastText: Subword Embeddings (2016)
Facebook AI Research (FAIR) extended Word2Vec with FastText, which represents each word as a set of character n-grams. This solves two critical problems:
- Rare or out-of-vocabulary (OOV) words: FastText can generate embeddings for never-seen words by composing vectors of sub-segments
- Morphology: morphologically related words (e.g. "run", "running", "runner") share n-grams and therefore have similar vectors
Evolution: From Sparse to Dense Representations
| Method | Year | Type | Typical Dimensions | Semantics |
|---|---|---|---|---|
| One-hot | - | Sparse | V (vocabulary) | None |
| TF-IDF | 1972 | Sparse | V (vocabulary) | Statistical |
| Word2Vec | 2013 | Dense | 100-300 | Local contextual |
| GloVe | 2014 | Dense | 50-300 | Global + local |
| FastText | 2016 | Dense | 100-300 | Subword + context |
| BERT | 2018 | Dense | 768 | Dynamic contextual |
| Sentence Transformers | 2019 | Dense | 384-1024 | Full sentences |
3. Mathematical Properties of Embeddings
One of the most fascinating discoveries of Word2Vec is that the vector space learns algebraic relationships between concepts. Arithmetic operations on vectors produce semantically coherent results.
3.1 Vector Analogies
The famous analogy: king - man + woman = queen. In vector terms, the difference between "king" and "man" captures the concept of "royalty", and adding it to "woman" gives "queen". Formally:
import gensim.downloader as api
# Load pre-trained GloVe embeddings
model = api.load("glove-wiki-gigaword-100")
# king - man + woman = ?
result = model.most_similar(
positive=["king", "woman"],
negative=["man"],
topn=3
)
print(result)
# [('queen', 0.7698), ('princess', 0.6450), ('monarch', 0.6345)]
# Other analogies that work:
# Paris - France + Italy = Rome
result2 = model.most_similar(
positive=["paris", "italy"],
negative=["france"],
topn=1
)
print(result2) # [('rome', 0.8722)]
# good - bad + sad = ?
result3 = model.most_similar(
positive=["good", "sad"],
negative=["bad"],
topn=1
)
print(result3) # [('happy', 0.6891)]
3.2 Semantic Clustering
Embeddings naturally form clusters in vector space. If we project vectors into 2D (using t-SNE or UMAP), we observe that words of the same category group together: animals near animals, countries near countries, professions near professions.
This property is fundamental for practical applications: similarity search works precisely because documents on similar topics have nearby embeddings in vector space.
4. Modern Embeddings: Contextual Representations
Word2Vec and GloVe generate a single vector per word, independent of context. But "bank" has different meanings in "river bank" and "bank account". The contextual embeddings, introduced with BERT in 2018, solve this problem: the same word has different vectors depending on context.
4.1 BERT Embeddings
BERT (Bidirectional Encoder Representations from Transformers) processes the entire sentence and produces a vector for each token. To obtain a whole-sentence embedding, two approaches are commonly used:
- CLS token: the first special [CLS] token contains an aggregated representation of the sentence
- Mean pooling: average of all token vectors - generally produces better results for similarity search
BERT is not optimal for similarity search
Original BERT was not trained to produce high-quality sentence embeddings. The CLS token is optimized for classification, not semantic similarity. For similarity search, specialized models like Sentence Transformers are needed.
4.2 Sentence Transformers (SBERT)
In 2019, Reimers and Gurevych introduced Sentence-BERT, fine-tuning BERT with a siamese structure to produce meaningful sentence embeddings. This revolutionized similarity search: for the first time it was possible to compare sentences with a simple cosine distance, achieving high-quality results.
4.3 Embedding Models: Full Comparison
Embedding Models Compared (2026)
| Model | Provider | Dimensions | MTEB Score | Cost / 1M tokens | Notes |
|---|---|---|---|---|---|
| text-embedding-3-small | OpenAI | 1536 | 62.3 | $0.02 | Best cost/quality ratio |
| text-embedding-3-large | OpenAI | 3072 | 64.6 | $0.13 | Maximum OpenAI quality |
| embed-v3 | Cohere | 1024 | 64.5 | $0.10 | Supports 100+ languages |
| voyage-3 | Voyage AI | 1024 | 67.1 | $0.06 | Top for retrieval tasks |
| all-MiniLM-L6-v2 | HuggingFace | 384 | 56.3 | Free | Fast, local, compact |
| all-mpnet-base-v2 | HuggingFace | 768 | 57.8 | Free | Best open-source base model |
| gte-large-en-v1.5 | Alibaba (HF) | 1024 | 65.4 | Free | Competitive with commercial models |
| bge-large-en-v1.5 | BAAI (HF) | 1024 | 64.2 | Free | Excellent for RAG |
How to Choose a Model
- Prototype / limited budget: all-MiniLM-L6-v2 (free, fast, 384 dim)
- Production, cost-effective: text-embedding-3-small (OpenAI, $0.02/1M tokens)
- Maximum retrieval quality: voyage-3 or gte-large-en-v1.5
- Multilingual: Cohere embed-v3 (100+ languages)
- Self-hosted / privacy: bge-large-en-v1.5 or gte-large-en-v1.5
5. Distance Metrics Between Vectors
The choice of distance metric directly influences the quality of similarity search. Here are the four main metrics with their mathematical formulas, strengths, and when to use them.
5.1 Cosine Similarity
The most commonly used metric for text embeddings. It measures the angle between two vectors, ignoring their magnitude (length). Two vectors pointing in the same direction have cosine similarity 1, orthogonal vectors 0, opposite vectors -1.
In pgvector, the <=> operator calculates the cosine distance
(= 1 - cosine similarity), where 0 means identical and 2 means opposite.
5.2 Euclidean Distance (L2)
The straight-line distance between two points in space. Takes into account both direction
and magnitude of vectors. The <-> operator in pgvector computes L2 distance.
5.3 Dot Product (Inner Product)
The dot product measures both direction and magnitude. For normalized vectors (norm = 1),
the dot product is equivalent to cosine similarity. In pgvector, the <#>
operator computes the negative inner product (for ORDER BY ASC compatibility).
5.4 Manhattan Distance (L1)
Sum of absolute differences component by component. Less sensitive to outliers compared to Euclidean distance. Not natively supported in pgvector but can be computed manually.
import numpy as np
from scipy.spatial.distance import cosine, euclidean, cityblock
# Two sample vectors (normalized)
a = np.array([0.5, 0.3, 0.8, 0.1, 0.6])
b = np.array([0.4, 0.35, 0.75, 0.15, 0.55])
# L2 normalization
a_norm = a / np.linalg.norm(a)
b_norm = b / np.linalg.norm(b)
# 1. Cosine Similarity (1 - cosine distance)
cos_sim = 1 - cosine(a_norm, b_norm)
print(f"Cosine similarity: {cos_sim:.6f}") # ~0.999
# 2. Euclidean Distance (L2)
l2_dist = euclidean(a_norm, b_norm)
print(f"L2 distance: {l2_dist:.6f}") # ~0.042
# 3. Dot Product (for normalized vectors = cosine similarity)
dot = np.dot(a_norm, b_norm)
print(f"Dot product: {dot:.6f}") # ~0.999
# 4. Manhattan Distance (L1)
l1_dist = cityblock(a_norm, b_norm)
print(f"L1 distance: {l1_dist:.6f}") # ~0.072
# L2-Cosine relation for normalized vectors:
# d_L2^2 = 2 * (1 - cos_sim)
print(f"\nVerification: L2^2 = {l2_dist**2:.6f}")
print(f"2*(1-cos) = {2*(1-cos_sim):.6f}") # identical!
When to Use Each Metric
| Metric | pgvector Operator | Use When | Avoid When |
|---|---|---|---|
| Cosine | <=> |
Text embeddings, when magnitude does not matter | Spatial data where magnitude is significant |
| L2 (Euclidean) | <-> |
Images, numeric data, when magnitude matters | Vectors with different scales across components |
| Dot Product | <#> |
Pre-normalized vectors (slightly better performance) | Non-normalized vectors (results distorted by magnitude) |
| Manhattan (L1) | Not native in pgvector | Sparse data, outlier robustness | General use with dense embeddings |
Practical Rule
For 95% of cases with text embeddings, use cosine distance
(<=> in pgvector). Modern embedding models produce already-normalized
vectors, which makes cosine and dot product practically equivalent. Euclidean distance
makes sense for spatial data or when vector magnitude carries information.
6. Generating Embeddings in Python
Let us now see how to generate embeddings with three different approaches: local models with Sentence Transformers, OpenAI API, and HuggingFace Inference API. Each approach has its own advantages and trade-offs.
6.1 Sentence Transformers (Local)
The most flexible and private approach: the model runs on your machine, no data leaves your network, no API call costs.
# pip install sentence-transformers
from sentence_transformers import SentenceTransformer
import numpy as np
# Load the model (downloaded automatically on first use)
model = SentenceTransformer("all-MiniLM-L6-v2")
# Single sentence embedding
sentence = "PostgreSQL is an open-source relational database"
embedding = model.encode(sentence)
print(f"Type: {type(embedding)}") # numpy.ndarray
print(f"Dimensions: {embedding.shape}") # (384,)
# Batch embedding (much more efficient)
sentences = [
"PostgreSQL is an open-source relational database",
"pgvector adds vector support to PostgreSQL",
"Machine learning requires large amounts of data",
"Pizza margherita is a classic Italian dish",
]
embeddings = model.encode(
sentences,
batch_size=32, # process 32 sentences at a time
show_progress_bar=True, # show progress for large batches
normalize_embeddings=True # normalize to L2 norm = 1
)
print(f"Shape: {embeddings.shape}") # (4, 384)
# Compute pairwise similarity
from sentence_transformers.util import cos_sim
similarities = cos_sim(embeddings, embeddings)
print(f"\nSimilarity matrix:\n{similarities}")
# The first 2 sentences (about PostgreSQL) will have high similarity
# The pizza sentence will be distant from the others
6.2 OpenAI Embedding API
The OpenAI API offers high-quality models without infrastructure management. Ideal for production with moderate volumes.
# pip install openai
from openai import OpenAI
import os
client = OpenAI(api_key=os.getenv("OPENAI_API_KEY"))
def get_embeddings(
texts: list[str],
model: str = "text-embedding-3-small"
) -> list[list[float]]:
"""Generate embeddings for a list of texts."""
response = client.embeddings.create(
input=texts,
model=model,
)
return [item.embedding for item in response.data]
# Single embedding
text = "PostgreSQL as vector database for AI"
embedding = get_embeddings([text])[0]
print(f"Dimensions: {len(embedding)}") # 1536
# Batch embeddings (up to 2048 texts per call)
texts = [
"How to install pgvector on Docker",
"Tutorial for similarity search in PostgreSQL",
"Guide to pasta al forno cooking",
]
embeddings = get_embeddings(texts)
print(f"Embeddings generated: {len(embeddings)}") # 3
# Reduced dimension with text-embedding-3-small
# You can specify smaller dimensions to save space
response = client.embeddings.create(
input=["Sample text"],
model="text-embedding-3-small",
dimensions=512 # reduced from 1536 to 512
)
emb_small = response.data[0].embedding
print(f"Reduced dimensions: {len(emb_small)}") # 512
6.3 HuggingFace Inference API
A compromise between local models and commercial APIs: access to thousands of open-source models via API, with a generous free tier.
# pip install huggingface_hub
from huggingface_hub import InferenceClient
import os
client = InferenceClient(
token=os.getenv("HF_TOKEN")
)
def get_hf_embeddings(
texts: list[str],
model: str = "BAAI/bge-large-en-v1.5"
) -> list[list[float]]:
"""Generate embeddings using HuggingFace Inference API."""
result = client.feature_extraction(
text=texts,
model=model,
)
return result
# Generate embeddings
texts = [
"Vector search with PostgreSQL and pgvector",
"How to create HNSW indexes for fast search",
]
embeddings = get_hf_embeddings(texts)
print(f"Embeddings: {len(embeddings)}") # 2
print(f"Dimensions: {len(embeddings[0])}") # 1024 (bge-large)
6.4 Efficient Batch Processing
When you need to generate embeddings for thousands or millions of documents, the efficiency of batch processing becomes critical.
import time
from typing import Generator
from sentence_transformers import SentenceTransformer
import numpy as np
def chunk_list(
lst: list, chunk_size: int
) -> Generator[list, None, None]:
"""Split a list into fixed-size chunks."""
for i in range(0, len(lst), chunk_size):
yield lst[i:i + chunk_size]
def generate_embeddings_batch(
texts: list[str],
model_name: str = "all-MiniLM-L6-v2",
batch_size: int = 256,
device: str = "cpu" # "cuda" for GPU
) -> np.ndarray:
"""Generate embeddings in batches with progress tracking."""
model = SentenceTransformer(model_name, device=device)
all_embeddings = []
total_batches = (len(texts) + batch_size - 1) // batch_size
start = time.time()
for i, batch in enumerate(chunk_list(texts, batch_size)):
batch_emb = model.encode(
batch,
batch_size=batch_size,
normalize_embeddings=True,
show_progress_bar=False
)
all_embeddings.append(batch_emb)
elapsed = time.time() - start
rate = (i + 1) * batch_size / elapsed
print(
f"Batch {i+1}/{total_batches} - "
f"{rate:.0f} texts/sec"
)
return np.vstack(all_embeddings)
# Usage
texts = [f"Document number {i}" for i in range(10_000)]
embeddings = generate_embeddings_batch(
texts,
batch_size=256,
device="cuda" # use GPU if available
)
print(f"Final shape: {embeddings.shape}") # (10000, 384)
7. Storing Embeddings in PostgreSQL
Now that we know how to generate embeddings, let us see how to save them in PostgreSQL with pgvector and run similarity search queries. This is the practical connection to article 1 of the series.
7.1 Table Schema
-- Enable pgvector
CREATE EXTENSION IF NOT EXISTS vector;
-- Documents table with embedding
CREATE TABLE documents (
id BIGSERIAL PRIMARY KEY,
title VARCHAR(500) NOT NULL,
content TEXT NOT NULL,
source VARCHAR(255),
category VARCHAR(100),
embedding vector(384), -- dimension of chosen model
created_at TIMESTAMPTZ DEFAULT NOW(),
metadata JSONB DEFAULT '{}'::jsonb
);
-- HNSW index for fast search (cosine distance)
CREATE INDEX idx_documents_embedding
ON documents
USING hnsw (embedding vector_cosine_ops)
WITH (m = 16, ef_construction = 200);
-- Index on category for combined filters
CREATE INDEX idx_documents_category
ON documents (category);
7.2 Insertion from Python
import psycopg2
from psycopg2.extras import execute_values
from sentence_transformers import SentenceTransformer
import numpy as np
# Configuration
DB_CONFIG = {
"host": "localhost",
"port": 5432,
"dbname": "vectordb",
"user": "admin",
"password": "secret_password",
}
# 1. Generate embeddings
model = SentenceTransformer("all-MiniLM-L6-v2")
documents = [
{
"title": "Introduction to pgvector",
"content": "pgvector is a PostgreSQL extension for vectors...",
"source": "blog",
"category": "database"
},
{
"title": "RAG with LangChain",
"content": "Retrieval Augmented Generation combines retrieval...",
"source": "tutorial",
"category": "ai"
},
{
"title": "Python for Data Science",
"content": "Python is the most used language for data science...",
"source": "guide",
"category": "programming"
},
]
# Generate embeddings for the content
texts = [d["content"] for d in documents]
embeddings = model.encode(texts, normalize_embeddings=True)
# 2. Save to PostgreSQL
conn = psycopg2.connect(**DB_CONFIG)
cur = conn.cursor()
# Prepare data for batch insert
values = []
for doc, emb in zip(documents, embeddings):
values.append((
doc["title"],
doc["content"],
doc["source"],
doc["category"],
emb.tolist() # convert numpy array to Python list
))
# Efficient batch insert
execute_values(
cur,
"""INSERT INTO documents
(title, content, source, category, embedding)
VALUES %s""",
values,
template="(%s, %s, %s, %s, %s::vector)"
)
conn.commit()
print(f"Inserted {len(values)} documents with embeddings")
cur.close()
conn.close()
7.3 Similarity Search from Python
def similarity_search(
query: str,
top_k: int = 5,
category: str = None,
threshold: float = 0.3
) -> list[dict]:
"""Find documents similar to the query."""
# Generate query embedding
query_embedding = model.encode(
query, normalize_embeddings=True
).tolist()
conn = psycopg2.connect(**DB_CONFIG)
cur = conn.cursor()
# Query with optional filter
if category:
cur.execute("""
SELECT id, title, content, category,
1 - (embedding <=> %s::vector) AS similarity
FROM documents
WHERE category = %s
AND 1 - (embedding <=> %s::vector) > %s
ORDER BY embedding <=> %s::vector
LIMIT %s
""", (
query_embedding, category,
query_embedding, threshold,
query_embedding, top_k
))
else:
cur.execute("""
SELECT id, title, content, category,
1 - (embedding <=> %s::vector) AS similarity
FROM documents
WHERE 1 - (embedding <=> %s::vector) > %s
ORDER BY embedding <=> %s::vector
LIMIT %s
""", (
query_embedding,
query_embedding, threshold,
query_embedding, top_k
))
results = []
for row in cur.fetchall():
results.append({
"id": row[0],
"title": row[1],
"content": row[2][:200], # truncated
"category": row[3],
"similarity": round(row[4], 4),
})
cur.close()
conn.close()
return results
# Example usage
results = similarity_search(
"how to use vectors in a database",
top_k=3,
category="database"
)
for r in results:
print(f"[{r['similarity']}] {r['title']}")
7.4 Indexing: HNSW vs IVFFlat
For datasets with more than a few thousand documents, an index is essential for acceptable performance. pgvector offers two index types:
HNSW vs IVFFlat
| Feature | HNSW | IVFFlat |
|---|---|---|
| Query speed | Very fast | Fast |
| Recall | 95-99% | 85-95% |
| Build time | Slow (minutes) | Fast (seconds) |
| Memory | High (graph in RAM) | Low (centroids) |
| Insert/Update | Good (incremental update) | Requires periodic rebuild |
| Recommended for | Production, high quality | Prototyping, static datasets |
-- HNSW (recommended for production)
-- m: connections per node (16-64, default 16)
-- ef_construction: build quality (64-512, default 64)
CREATE INDEX idx_hnsw_cosine
ON documents
USING hnsw (embedding vector_cosine_ops)
WITH (m = 16, ef_construction = 200);
-- For L2 distance
CREATE INDEX idx_hnsw_l2
ON documents
USING hnsw (embedding vector_l2_ops)
WITH (m = 16, ef_construction = 200);
-- IVFFlat (faster to build)
-- lists: number of clusters (sqrt(N) as a base rule)
CREATE INDEX idx_ivfflat_cosine
ON documents
USING ivfflat (embedding vector_cosine_ops)
WITH (lists = 100); -- for ~10K documents
-- Query parameters to control recall vs speed
SET hnsw.ef_search = 100; -- default 40, increase for more recall
SET ivfflat.probes = 10; -- default 1, increase for more recall
8. Embeddings for Different Data Types
Embeddings are not limited to text. Modern models can generate vector representations for images, audio, source code, and even multimodal data.
Multimodal Embeddings: Models by Data Type
| Data Type | Model | Dimensions | Use Case |
|---|---|---|---|
| Text | all-MiniLM-L6-v2, text-embedding-3-small | 384-3072 | Semantic search, RAG, classification |
| Images | CLIP (OpenAI), SigLIP (Google) | 512-768 | Image search, visual classification |
| Audio | Whisper, CLAP | 512-1280 | Audio search, music classification |
| Code | CodeBERT, StarCoder embeddings | 768 | Code search, duplicate detection |
| Multimodal | CLIP, ImageBind (Meta) | 512-1024 | Cross-modal search (text for images) |
# pip install transformers pillow
from transformers import CLIPProcessor, CLIPModel
from PIL import Image
import torch
import numpy as np
# Load CLIP
model = CLIPModel.from_pretrained("openai/clip-vit-base-patch32")
processor = CLIPProcessor.from_pretrained("openai/clip-vit-base-patch32")
# Image embedding
image = Image.open("cat_photo.jpg")
inputs = processor(images=image, return_tensors="pt")
with torch.no_grad():
image_embedding = model.get_image_features(**inputs)
image_emb = image_embedding[0].numpy()
print(f"Image embedding: {image_emb.shape}") # (512,)
# Text embedding (in the SAME space!)
text_inputs = processor(
text=["a sleeping cat", "a playing dog"],
return_tensors="pt",
padding=True
)
with torch.no_grad():
text_embeddings = model.get_text_features(**text_inputs)
text_embs = text_embeddings.numpy()
# Compute cross-modal similarity
from numpy.linalg import norm
for i, text in enumerate(["a sleeping cat", "a playing dog"]):
sim = np.dot(image_emb, text_embs[i]) / (
norm(image_emb) * norm(text_embs[i])
)
print(f"Similarity '{text}': {sim:.4f}")
# "a sleeping cat" will have higher similarity with cat_photo.jpg
The power of CLIP is that text and images live in the same vector space. You can search for images with a text query or find text related to an image. This enables multimodal search in PostgreSQL: store CLIP embeddings in the same pgvector table and search with text queries.
9. Evaluating Embedding Quality
How do you know if an embedding model is "good"? The answer depends on the specific task, but standardized benchmarks and objective metrics exist.
9.1 MTEB: Massive Text Embedding Benchmark
MTEB is the reference benchmark for evaluating embedding models. It measures performance across 58+ tasks grouped into 8 categories:
- Retrieval: finding relevant documents given a query
- Semantic Textual Similarity (STS): how similar two sentences are
- Classification: classifying texts into categories
- Clustering: grouping similar texts
- Pair Classification: determining if two texts are related
- Reranking: re-ordering results by relevance
- Summarization: quality of summaries
- BitextMining: finding parallel translations
from sentence_transformers import SentenceTransformer
from sentence_transformers.evaluation import (
InformationRetrievalEvaluator
)
# Prepare evaluation dataset
queries = {
"q1": "how to install pgvector",
"q2": "what is similarity search",
"q3": "image embeddings with CLIP",
}
corpus = {
"d1": "Guide to installing pgvector on Ubuntu",
"d2": "pgvector for PostgreSQL: Docker setup",
"d3": "Vector similarity search",
"d4": "CLIP: multimodal model for images and text",
"d5": "Spaghetti carbonara recipe",
}
# Mapping query -> relevant documents
relevant = {
"q1": {"d1": 1, "d2": 1}, # d1 and d2 relevant for q1
"q2": {"d3": 1},
"q3": {"d4": 1},
}
# Evaluate the model
model = SentenceTransformer("all-MiniLM-L6-v2")
evaluator = InformationRetrievalEvaluator(
queries=queries,
corpus=corpus,
relevant_docs=relevant,
name="custom-eval"
)
results = evaluator(model)
print(f"NDCG@10: {results['custom-eval_ndcg@10']:.4f}")
print(f"MAP@10: {results['custom-eval_map@10']:.4f}")
9.2 Intrinsic vs Extrinsic Evaluation
Two Approaches to Evaluation
| Type | What It Measures | Example | When to Use |
|---|---|---|---|
| Intrinsic | Properties of the vectors themselves | Analogies, clustering, STS | Quick comparison between models |
| Extrinsic | Performance on the final task | RAG quality, search precision | Final production decision |
Practical Advice
Do not rely only on the MTEB score. A model may have a high MTEB score but perform poorly in your specific domain. Always evaluate on your own dataset: create a small set of queries and relevant documents from your domain, and measure nDCG and MAP. This gives you a much more reliable estimate of real-world performance.
10. Dimensionality Reduction
High-dimensional vectors are difficult to visualize and can be expensive in terms of storage and computation. Dimensionality reduction techniques help both for visualization and optimization.
10.1 Visualization Techniques
Dimensionality Reduction Techniques
| Technique | Preserves | Speed | Typical Use |
|---|---|---|---|
| PCA | Global variance | Very fast | Dimension reduction for storage, pre-processing |
| t-SNE | Local structure | Slow | 2D visualization of clusters |
| UMAP | Local + global structure | Medium | 2D visualization, also for pre-indexing reduction |
# pip install umap-learn matplotlib
import umap
import matplotlib.pyplot as plt
import numpy as np
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("all-MiniLM-L6-v2")
# Texts with different categories
texts = [
# Databases
"PostgreSQL is a relational database",
"MongoDB is a NoSQL database",
"Redis is an in-memory database",
# AI/ML
"Deep learning uses deep neural networks",
"GPT is a language model",
"Linear regression is a simple algorithm",
# Food
"Pizza is baked in a wood-fired oven",
"Tiramisu is an Italian dessert",
"Pasta carbonara uses eggs and guanciale",
]
categories = ["DB"]*3 + ["AI"]*3 + ["Food"]*3
colors = ["blue"]*3 + ["red"]*3 + ["green"]*3
# Generate embeddings (384 dimensions)
embeddings = model.encode(texts, normalize_embeddings=True)
# Reduce to 2D with UMAP
reducer = umap.UMAP(n_components=2, random_state=42)
emb_2d = reducer.fit_transform(embeddings)
# Visualize
plt.figure(figsize=(10, 8))
for i, (x, y) in enumerate(emb_2d):
plt.scatter(x, y, c=colors[i], s=100, zorder=5)
plt.annotate(texts[i][:30], (x, y), fontsize=8, ha='left')
plt.title("Embeddings in 2D (UMAP)")
plt.savefig("embeddings_umap.png", dpi=150)
plt.show()
10.2 Matryoshka Embeddings
A recent and innovative technique: Matryoshka Representation Learning (MRL) embeddings are trained so that the first N components of the vector are already a valid embedding. You can truncate the vector from 1536 to 512 or 256 dimensions while maintaining good quality.
OpenAI text-embedding-3-small and text-embedding-3-large support
this technique: you can specify the dimensions parameter to get more compact
vectors without recomputing embeddings.
from openai import OpenAI
import numpy as np
import os
client = OpenAI(api_key=os.getenv("OPENAI_API_KEY"))
text = "PostgreSQL as a vector database for AI applications"
# Generate the same embedding at different dimensions
for dim in [256, 512, 1024, 1536]:
response = client.embeddings.create(
input=[text],
model="text-embedding-3-small",
dimensions=dim
)
emb = response.data[0].embedding
print(f"Dimensions: {dim}, norm: {np.linalg.norm(emb):.4f}")
# In PostgreSQL: use columns with appropriate dimensions
# CREATE TABLE docs_compact (
# id BIGSERIAL PRIMARY KEY,
# content TEXT,
# embedding vector(256) -- more compact, 6x less storage
# );
11. Costs and Scaling Strategies
When moving from prototype to production, the costs of generating and storing embeddings become a critical factor. Here is a detailed analysis.
11.1 Costs for 1 Million Documents
Cost Estimate: 1M Documents (average 500 tokens/doc)
| Model | Generation Cost | Vector Size | Storage (float32) | Initial Total |
|---|---|---|---|---|
| all-MiniLM-L6-v2 | $0 (local) | 384 | ~1.4 GB | Only GPU/CPU time |
| text-embedding-3-small | ~$10 (500M tokens) | 1536 | ~5.7 GB | $10 + storage |
| text-embedding-3-small (512 dim) | ~$10 | 512 | ~1.9 GB | $10 + less storage |
| text-embedding-3-large | ~$65 (500M tokens) | 3072 | ~11.4 GB | $65 + storage |
11.2 Optimization Strategies
Cost Reduction Strategies
- Use local models where possible: all-MiniLM-L6-v2 costs nothing and achieves acceptable quality for most use cases
- Matryoshka embeddings: use 512 instead of 1536 dimensions with text-embedding-3-small - same API cost, 67% less storage
- Cache embeddings: do not recompute the same text. Store embedding model name in the table to manage migrations
- Batch processing: never call the API one document at a time - batch 100-500 texts per request
- Incremental ingestion: only re-embed documents that have changed (use MD5 hash to detect changes)
import hashlib
import psycopg2
def should_reindex(conn, source_path: str, content: str) -> bool:
"""Returns True if the document needs to be re-indexed."""
content_hash = hashlib.md5(content.encode()).hexdigest()
with conn.cursor() as cur:
cur.execute("""
SELECT content_hash FROM documents
WHERE source_path = %s
LIMIT 1
""", (source_path,))
row = cur.fetchone()
if row is None:
return True # new document
return row[0] != content_hash # True if changed
# Save the hash in the table schema:
# CREATE TABLE documents (
# ...
# content_hash TEXT, -- MD5 of the content
# embedding_model TEXT NOT NULL, -- which model was used
# ...
# )
Common Mistakes with Embeddings
- Mismatched models: Always use the same model for ingestion and queries. Mixing models produces garbage results.
- Chunks too large: Chunks over 512 tokens often contain multiple topics, hurting retrieval precision.
- Chunks too small: Chunks under 100 characters lack enough context for meaningful embeddings.
- No overlap: Without overlap, context at chunk boundaries is lost, reducing recall.
- Forgetting to normalize: Some distance metrics require L2-normalized vectors. Always check model documentation.
- Ignoring token limits: Text beyond the model's maximum token limit is silently truncated.
Embeddings Evaluation: Benchmarking Your Model Choice
Choosing the right embedding model is not just about benchmarks from research papers. The model that performs best on MTEB may not be the best for your specific domain. Here is how to run your own evaluation to make an informed decision:
import psycopg2
import time
from openai import OpenAI
from sentence_transformers import SentenceTransformer
# ===================================
# Evaluation framework for embedding models
# ===================================
class EmbeddingEvaluator:
"""
Evaluate embedding models on your specific domain data.
Methodology:
1. Create a test set of (query, relevant_document) pairs
2. Embed all documents with each candidate model
3. For each query, measure Recall@K and MRR (Mean Reciprocal Rank)
4. Compare latency and cost
"""
def __init__(self, conn, test_pairs: list[tuple]):
"""
Args:
conn: PostgreSQL connection
test_pairs: List of (query, expected_source_path) tuples
"""
self.conn = conn
self.test_pairs = test_pairs # ground truth
def evaluate_model(self, model_name: str, embedder, k: int = 5) -> dict:
"""Run full evaluation for one embedding model."""
recall_scores = []
mrr_scores = []
latencies = []
for query, expected_source in self.test_pairs:
# Embed query
t_start = time.perf_counter()
query_vec = embedder.embed(query)
embed_time = (time.perf_counter() - t_start) * 1000
# Search in PostgreSQL
t_start = time.perf_counter()
with self.conn.cursor() as cur:
cur.execute("""
SELECT source_path
FROM documents
WHERE embedding_model = %s
ORDER BY embedding <=> %s::vector
LIMIT %s
""", (model_name, query_vec, k))
results = [r[0] for r in cur.fetchall()]
search_time = (time.perf_counter() - t_start) * 1000
# Calculate Recall@K
recall = 1.0 if expected_source in results else 0.0
recall_scores.append(recall)
# Calculate MRR (position of first correct result)
mrr = 0.0
if expected_source in results:
position = results.index(expected_source) + 1
mrr = 1.0 / position
mrr_scores.append(mrr)
latencies.append(embed_time + search_time)
return {
"model": model_name,
f"recall_at_{k}": round(sum(recall_scores) / len(recall_scores), 4),
"mrr": round(sum(mrr_scores) / len(mrr_scores), 4),
"avg_latency_ms": round(sum(latencies) / len(latencies), 2),
"n_queries": len(self.test_pairs)
}
# Example: compare OpenAI small vs local MiniLM
test_pairs = [
("How do I create an HNSW index?", "docs/pgvector_guide.md"),
("What is cosine similarity?", "docs/embeddings_intro.md"),
("PgBouncer connection pooling configuration", "docs/production_guide.md"),
# ... add 20-50 pairs for statistical significance
]
evaluator = EmbeddingEvaluator(conn, test_pairs)
# Results example (your domain may differ):
# model=text-embedding-3-small: recall@5=0.91, mrr=0.78, latency=45ms
# model=all-MiniLM-L6-v2: recall@5=0.84, mrr=0.71, latency=12ms
# Conclusion: small quality gap, 3.7x latency difference for local model
Advanced Embedding Techniques: Fine-Tuning and Domain Adaptation
General-purpose embedding models are trained on broad text corpora. For specialized domains (medical, legal, financial, code), fine-tuning or domain adaptation can significantly improve retrieval quality. Here are the practical approaches:
Approach 1: Prompt Engineering for Better Embeddings
# The simplest domain adaptation: prefix prompting
# Some models (like text-embedding-3-small) respond well to task-specific prefixes
def embed_with_task_prefix(text: str, task: str = "search_document") -> list[float]:
"""
Add task-specific prefix to improve embedding quality for specific tasks.
Tested with E5 and similar models that support instruction-following.
task values:
- "search_document": for documents to be retrieved
- "search_query": for user queries
- "classification": for classification tasks
- "clustering": for clustering tasks
"""
prefixed = f"Represent this {task}: {text}"
return embedder.embed(prefixed)
# Example: E5-large-instruct with task prefix
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("intfloat/e5-large-instruct")
# For ingestion (documents):
doc_embedding = model.encode(
"Represent this document for retrieval: " + document_text,
normalize_embeddings=True
)
# For queries:
query_embedding = model.encode(
"Represent this query for retrieving relevant documents: " + user_query,
normalize_embeddings=True
)
# The asymmetric approach (different prefixes for docs vs queries)
# often improves Recall@10 by 5-15% compared to symmetric embeddings
Approach 2: Domain-Specific Fine-Tuning with Sentence Transformers
from sentence_transformers import SentenceTransformer, InputExample, losses
from torch.utils.data import DataLoader
def fine_tune_embedding_model(
base_model: str,
training_pairs: list[tuple], # (query, relevant_doc, irrelevant_doc) triplets
output_path: str,
epochs: int = 3,
batch_size: int = 32
):
"""
Fine-tune an embedding model on domain-specific data using triplet loss.
training_pairs: List of (anchor, positive, negative) tuples where:
- anchor: a user query from your domain
- positive: a relevant document for that query
- negative: an irrelevant document (hard negative = challenging)
Hard negatives are documents that are superficially similar but not relevant.
Mining hard negatives is the most important step for good fine-tuning.
"""
model = SentenceTransformer(base_model)
# Convert to InputExamples
examples = [
InputExample(texts=[anchor, positive, negative])
for anchor, positive, negative in training_pairs
]
# DataLoader
dataloader = DataLoader(examples, shuffle=True, batch_size=batch_size)
# Triplet loss: maximize distance(anchor, negative) - distance(anchor, positive)
loss = losses.TripletLoss(model=model)
# Training
model.fit(
train_objectives=[(dataloader, loss)],
epochs=epochs,
warmup_steps=int(0.1 * len(dataloader) * epochs),
output_path=output_path,
show_progress_bar=True
)
return model
# Minimum data requirements for meaningful fine-tuning:
# - At least 1000 triplets for initial improvement
# - 5000+ triplets for significant gains
# - Use existing production query logs + relevant document pairs
# - Generate hard negatives with BM25 (retrieve by keyword, use as false positives)
Production Embedding Pipeline: Async and Batch Processing
In production, the embedding generation phase is often the bottleneck. Here is how to implement an efficient async pipeline that processes documents in parallel while respecting API rate limits:
import asyncio
import aiohttp
from typing import AsyncGenerator
import time
class AsyncEmbeddingPipeline:
"""
Production-grade async embedding pipeline with:
- Rate limiting (tokens per minute)
- Automatic retry with exponential backoff
- Progress tracking
- Batch size optimization
"""
def __init__(self,
api_key: str,
model: str = "text-embedding-3-small",
max_tpm: int = 1_000_000, # tokens per minute limit
max_retries: int = 3):
self.api_key = api_key
self.model = model
self.max_tpm = max_tpm
self.max_retries = max_retries
self._token_count = 0
self._window_start = time.time()
async def _check_rate_limit(self, tokens: int):
"""Wait if we are approaching the rate limit."""
elapsed = time.time() - self._window_start
if elapsed >= 60:
# Reset window
self._token_count = 0
self._window_start = time.time()
elif self._token_count + tokens > self.max_tpm:
# Wait until the minute window resets
wait_time = 60 - elapsed
print(f"Rate limit reached. Waiting {wait_time:.1f}s...")
await asyncio.sleep(wait_time)
self._token_count = 0
self._window_start = time.time()
self._token_count += tokens
async def embed_batch_async(self,
session: aiohttp.ClientSession,
texts: list[str],
attempt: int = 0) -> list[list[float]]:
"""Embed a batch of texts with retry logic."""
# Estimate token count (rough: 1 token ~= 4 chars)
estimated_tokens = sum(len(t) // 4 for t in texts)
await self._check_rate_limit(estimated_tokens)
try:
async with session.post(
"https://api.openai.com/v1/embeddings",
headers={"Authorization": f"Bearer {self.api_key}"},
json={"input": texts, "model": self.model},
timeout=aiohttp.ClientTimeout(total=30)
) as response:
if response.status == 429:
# Rate limited: exponential backoff
if attempt < self.max_retries:
wait = (2 ** attempt) * 5
await asyncio.sleep(wait)
return await self.embed_batch_async(session, texts, attempt + 1)
response.raise_for_status()
data = await response.json()
return [item["embedding"] for item in sorted(data["data"], key=lambda x: x["index"])]
except Exception as e:
if attempt < self.max_retries:
await asyncio.sleep(2 ** attempt)
return await self.embed_batch_async(session, texts, attempt + 1)
raise
async def process_documents(self,
texts: list[str],
batch_size: int = 100) -> list[list[float]]:
"""Process all documents with concurrent batches."""
all_embeddings = [None] * len(texts)
async with aiohttp.ClientSession() as session:
# Process in batches, with max 3 concurrent requests
semaphore = asyncio.Semaphore(3)
async def process_batch(batch_texts, start_idx):
async with semaphore:
embeddings = await self.embed_batch_async(session, batch_texts)
for i, emb in enumerate(embeddings):
all_embeddings[start_idx + i] = emb
tasks = []
for i in range(0, len(texts), batch_size):
batch = texts[i:i + batch_size]
tasks.append(process_batch(batch, i))
await asyncio.gather(*tasks)
return all_embeddings
# Usage:
pipeline = AsyncEmbeddingPipeline(api_key=openai_api_key)
embeddings = asyncio.run(pipeline.process_documents(all_texts, batch_size=100))
print(f"Generated {len(embeddings)} embeddings")
Conclusions and Next Steps
Embeddings are the foundation of all AI-powered search. Understanding the theory behind them - historical evolution, distance metrics, and the properties of vector space - allows you to make better decisions about model selection, chunking strategy, and quality evaluation. PostgreSQL with pgvector provides a robust, cost-effective platform for storing and querying these representations without introducing new infrastructure dependencies.
The next article in this series builds on embeddings to construct a complete Retrieval-Augmented Generation (RAG) pipeline: from document ingestion through intelligent retrieval to LLM-powered answer generation - all on PostgreSQL.







