Ciao! Sono

Federico Calò

Sviluppatore Software | Divulgatore Tecnico

Creo applicazioni web moderne e strumenti digitali personalizzati per aiutare le attività a crescere attraverso l'innovazione tecnologica. La mia passione è unire informatica ed economia per generare valore reale.

Contattami

Chi Sono

La mia passione per l'informatica è nata tra i banchi dell'Istituto Tecnico Commerciale di Maglie, dove ho scoperto il potere della programmazione e il fascino di creare soluzioni digitali. Fin da subito, ho capito che l'informatica non era solo codice, ma uno strumento straordinario per trasformare idee in realtà.

Durante gli studi superiori in Sistemi Informativi Aziendali, ho iniziato a intrecciare informatica ed economia, comprendendo come la tecnologia possa essere il motore della crescita per qualsiasi attività. Questa visione mi ha accompagnato all'Università degli Studi di Bari, dove ho conseguito la Laurea in Informatica, approfondendo le mie competenze tecniche e la mia passione per lo sviluppo software.

Oggi metto questa esperienza al servizio di imprese, professionisti e startup, creando soluzioni digitali su misura che automatizzano processi, ottimizzano risorse e aprono nuove opportunità di business. Perché la vera innovazione inizia quando la tecnologia incontra le esigenze reali delle persone.

Le Mie Competenze

Analisi Dati & Modelli Previsionali

Trasformo i dati in insights strategici con analisi approfondite e modelli predittivi per decisioni informate

Automazione Processi

Creo strumenti personalizzati che automatizzano operazioni ripetitive e liberano tempo per attività a valore aggiunto

Sistemi Custom

Sviluppo sistemi software su misura, dalle integrazioni tra piattaforme alle dashboard personalizzate

const federico = {
  nome: "Federico Calò",
  ruolo: "Sviluppatore Software",
  città: "Bari, Italia",
  missione: "Aiutare attraverso l'informatica",
  passioni: [
    "Codice Pulito",
    "Innovazione",
    "Crescita Continua"
  ]
};

La Mia Missione

Credo fermamente che l'informatica sia lo strumento più potente per trasformare le idee in realtà e migliorare la vita delle persone.

🚀

Democratizzare la Tecnologia

La mia missione è rendere l'informatica accessibile a tutti: dalle piccole imprese locali alle startup innovative, fino ai professionisti che vogliono digitalizzare la propria attività. Ogni realtà merita di sfruttare le potenzialità del digitale.

💡

Unire Informatica ed Economia

Non è solo questione di scrivere codice: è capire come la tecnologia possa generare valore reale. Intrecciando competenze informatiche e visione economica, aiuto le attività a crescere, ottimizzare processi e raggiungere nuovi traguardi di efficienza e redditività.

🎯

Creare Soluzioni su Misura

Ogni attività è unica, e così devono esserlo le soluzioni. Sviluppo strumenti personalizzati che rispondono alle esigenze specifiche di ciascun cliente, automatizzando processi ripetitivi e liberando tempo per ciò che conta davvero: far crescere il business.

Trasforma la Tua Attività con la Tecnologia

Dicembre 2024

Visualizza

Master SQL

RoadMap.sh

Novembre 2024

Visualizza

Oracle Certified Foundations Associate

Oracle

Ottobre 2024

Visualizza

People Leadership Credential

Connect

Settembre 2024

💻 Linguaggi & Tecnologie

☕Java

🐍Python

📜JavaScript

🅰️Angular

⚛️React

🔷TypeScript

🗄️SQL

🐘PHP

🎨CSS/SCSS

🔧Node.js

🐳Docker

🌿Git

💼

12/2024 - Presente

Custom Software Engineering Analyst

Accenture

Bari, Puglia, Italia · Ibrida Analisi e sviluppo di sistemi informatici attraverso l'utilizzo di Java e Quarkus in Health and Public Sector. Formazione continua su tecnologie moderne per la creazione di soluzioni software personalizzate ed efficienti e sugli agenti.

💼

06/2022 - 12/2024

Analista software e Back End Developer Associate Consultant

Links Management and Technology SpA

Esperienza nell'analisi di sistemi software as-is e flussi ETL utilizzando PowerCenter. Formazione completata su Spring Boot per lo sviluppo di applicazioni backend moderne e scalabili. Sviluppatore Backend specializzato in Spring Boot, con esperienza in progettazione di database, analisi, sviluppo e testing dei task assegnati.

💼

02/2021 - 10/2021

Programmatore software

Adesso.it (prima era WebScience srl)

Esperienza nell'analisi AS-IS e TO-BE, evoluzioni SEO ed evoluzioni website per migliorare le performance e l'engagement degli utenti.

🎓

2018 - 2025

Laurea in Informatica

Università degli Studi di Bari Aldo Moro

Bachelor's degree in Computer Science, focusing on software engineering, algorithms, and modern development practices.

📚

2013 - 2018

Diploma - Sistemi Informativi Aziendali

Istituto Tecnico Commerciale di Maglie

Technical diploma specializing in Business Information Systems, combining IT knowledge with business management.

Contattami

Hai un progetto in mente? Parliamone! Compila il form qui sotto e ti risponderò al più presto.

* Campi obbligatori. I tuoi dati saranno utilizzati solo per rispondere alla tua richiesta.

Embeddings: Theory and Practice with PostgreSQL

Every semantic search system, every RAG pipeline, and every AI application that works with natural language shares one fundamental building block: embeddings. They are the translation of meaning into numbers, the bridge between the world of text and the world of mathematics. Without embeddings, a database cannot distinguish "dog" from "automobile" - with embeddings, it knows that "dog" is closer to "cat" than it is to "toaster".

In the first article of this series we configured pgvector and learned how to store and query vectors in PostgreSQL. But where do those vectors come from? How do you generate a high-quality embedding? And which model should you choose among the dozens available? In this article we answer all these questions, from mathematical theory to practical Python and PostgreSQL implementation.

Series Overview

#	Article	Focus
1	pgvector	Installation, operators, indexing
2	You are here - Embeddings	Models, distances, generation
3	RAG with PostgreSQL	End-to-end RAG pipeline
4	Advanced Similarity Search	Hybrid search, filtering
5	Indexing and Performance	HNSW, IVFFlat, tuning
6	RAG in Production	Monitoring, scaling, CI/CD

What You Will Learn

What an embedding is and why it is fundamental to modern AI
Historical evolution: from one-hot encoding to Word2Vec, GloVe, BERT and Sentence Transformers
Mathematical properties of embeddings: vector analogies and semantic clustering
The four distance metrics with formulas and use cases
How to generate embeddings with Python: locally and via API
How to store and query embeddings in PostgreSQL with pgvector
Multimodal embeddings: text, images, audio and code
How to evaluate embedding model quality (MTEB)
Costs and scaling strategies for millions of documents

1. What Are Embeddings?

An embedding is a dense vector representation of an object (word, sentence, document, image) in a continuous low-dimensional space. In practical terms, it is an array of floating-point numbers that captures the "meaning" of that object.

An embedding in practice


# The embedding of the sentence "The cat sleeps on the couch"
# generated with OpenAI text-embedding-3-small (1536 dimensions)
embedding = [
    0.0231, -0.0456, 0.0891, -0.0123, 0.0567, -0.0234,
    0.0789, -0.0345, 0.0123, -0.0678, 0.0456, -0.0891,
    # ... 1524 more values ...
]
print(f"Type: {type(embedding)}")      # <class 'list'>
print(f"Dimensions: {len(embedding)}") # 1536

The key intuition is this: in a well-trained vector space, the geometric distance between two vectors reflects the semantic similarity between the concepts they represent. Sentences with similar meaning will have nearby vectors; sentences with different meanings will be far apart.

      Embedding Properties
      
            Property
            Description
            Example
          
            Dense
            Every dimension has a non-zero value
            [0.023, -0.045, 0.089, ...]
          
            Continuous
            Real values, not discrete
            Each component is a float32/float16
          
            Fixed dimensionality
            The same model always produces vectors of the same length
            384, 768, 1536 or 3072 dimensions
          
            Semantically meaningful
            Distances between vectors reflect meaning relationships
            sim("cat", "feline") > sim("cat", "car")

If we think of embedding space as a map, similar concepts form "neighborhoods": animals in one area, vehicles in another, emotions in yet another. The beauty is that these relationships emerge automatically from training - they are never programmed manually.

2. From Words to Vectors: Historical Evolution

The history of embeddings is a progression of increasingly sophisticated ideas, each solving the limitations of the previous one. Understanding this evolution helps explain why modern models work so well.

2.1 One-Hot Encoding (1990s)

The simplest approach: each word is represented by a vector with a single 1 and all other values set to 0. If the vocabulary has V words, each vector has V dimensions.

One-hot encoding: simple but limited


# Vocabulary: ["cat", "dog", "fish", "car", "bike"]
# Vector size = vocabulary size = 5

cat  = [1, 0, 0, 0, 0]
dog  = [0, 1, 0, 0, 0]
fish = [0, 0, 1, 0, 0]
car  = [0, 0, 0, 1, 0]
bike = [0, 0, 0, 0, 1]

# Problem 1: the distance between "cat" and "dog" equals
# the distance between "cat" and "car"
import numpy as np
dist_cat_dog = np.linalg.norm(
    np.array(cat) - np.array(dog)
)  # sqrt(2) = 1.414
dist_cat_car = np.linalg.norm(
    np.array(cat) - np.array(car)
)  # sqrt(2) = 1.414 -- identical!

# Problem 2: with a vocabulary of 100,000 words,
# each vector has 100,000 dimensions (sparse, inefficient)

Limitations of One-Hot Encoding

Exploding dimensionality: for a 100K-word vocabulary, each vector has 100K dimensions, almost all zero. No semantic information: all vectors are equidistant from each other. "Cat" is as far from "feline" as it is from "earthquake". This approach captures no meaning relationships between words.

2.2 TF-IDF (Term Frequency - Inverse Document Frequency)

A step forward: instead of 0/1, vector components indicate how important a word is in a document relative to the entire corpus. But each document becomes a sparse vector in vocabulary dimensionality.

TF-IDF: statistical importance but no semantics


from sklearn.feature_extraction.text import TfidfVectorizer

documents = [
    "the cat sleeps on the couch",
    "the dog plays in the garden",
    "the automobile drives on the road",
    "the feline rests on the armchair",
]

vectorizer = TfidfVectorizer()
tfidf_matrix = vectorizer.fit_transform(documents)

# Result: sparse matrix (4 documents x N terms)
print(f"Shape: {tfidf_matrix.shape}")  # (4, 14)
print(f"Terms: {vectorizer.get_feature_names_out()}")

# Problem: "cat sleeps" and "feline rests" are distant
# because they use different words, even though meaning is similar
from sklearn.metrics.pairwise import cosine_similarity
sim = cosine_similarity(tfidf_matrix[0:1], tfidf_matrix[3:4])
print(f"Similarity cat-feline: {sim[0][0]:.3f}")  # ~0.07 (low!)

TF-IDF improves on one-hot encoding by weighting words by importance, but suffers from the same fundamental problem: it does not understand that "cat" and "feline" are synonyms, because it reasons only on exact lexical matches.

2.3 Word2Vec: The Revolution (2013)

In 2013, Tomas Mikolov and his team at Google published Word2Vec, which changed everything. The brilliant idea: a word is defined by the context in which it appears. Words that appear in similar contexts will have similar representations.

Word2Vec uses shallow neural networks to learn dense vectors (typically 100-300 dimensions) from large text corpora. Two architectures:

      Word2Vec Architectures
      
            Architecture
            Input
            Output
            Description
          
            CBOW
            Context words
            Target word
            Predicts the central word given the surrounding context
          
            Skip-gram
            Target word
            Context words
            Predicts the surrounding words given the central word

Word2Vec: words as dense vectors


from gensim.models import Word2Vec

# Sample corpus (in production: millions of sentences)
sentences = [
    ["the", "cat", "sleeps", "on", "the", "couch"],
    ["the", "dog", "plays", "in", "the", "garden"],
    ["the", "feline", "rests", "on", "the", "armchair"],
    ["the", "dog", "runs", "in", "the", "park"],
]

# Train Word2Vec (Skip-gram)
model = Word2Vec(
    sentences=sentences,
    vector_size=100,    # embedding dimensionality
    window=5,           # context: 5 words before and after
    min_count=1,        # include words with at least 1 occurrence
    sg=1,               # 1 = Skip-gram, 0 = CBOW
    epochs=100
)

# Now "cat" and "feline" are close!
print(model.wv.most_similar("cat", topn=3))
# [('feline', 0.92), ('dog', 0.85), ('sleeps', 0.71)]

# Access to the vector
vector_cat = model.wv["cat"]
print(f"Dimensions: {vector_cat.shape}")  # (100,)
print(f"First 5: {vector_cat[:5]}")

2.4 GloVe: Global Vectors (2014)

Stanford developed GloVe (Global Vectors for Word Representation) with a different approach: instead of a neural network, GloVe factorizes the global co-occurrence matrix of the corpus. It combines the advantages of global statistical methods (like LSA) with those of Word2Vec's local context.

GloVe minimizes a cost function that ensures the dot product between two word vectors is proportional to the logarithm of their co-occurrence probability.

2.5 FastText: Subword Embeddings (2016)

Facebook AI Research (FAIR) extended Word2Vec with FastText, which represents each word as a set of character n-grams. This solves two critical problems:

Rare or out-of-vocabulary (OOV) words: FastText can generate embeddings for never-seen words by composing vectors of sub-segments
Morphology: morphologically related words (e.g. "run", "running", "runner") share n-grams and therefore have similar vectors

      Evolution: From Sparse to Dense Representations
      
        
            Method
            Year
            Type
            Typical Dimensions
            Semantics
          

        One-hot-SparseV (vocabulary)None
TF-IDF1972SparseV (vocabulary)Statistical
Word2Vec2013Dense100-300Local contextual
GloVe2014Dense50-300Global + local
FastText2016Dense100-300Subword + context
BERT2018Dense768Dynamic contextual
Sentence Transformers2019Dense384-1024Full sentences

      
    

3. Mathematical Properties of Embeddings

One of the most fascinating discoveries of Word2Vec is that the vector space learns algebraic relationships between concepts. Arithmetic operations on vectors produce semantically coherent results.

3.1 Vector Analogies

The famous analogy: king - man + woman = queen. In vector terms, the difference between "king" and "man" captures the concept of "royalty", and adding it to "woman" gives "queen". Formally:

Vector analogies with Gensim


import gensim.downloader as api

# Load pre-trained GloVe embeddings
model = api.load("glove-wiki-gigaword-100")

# king - man + woman = ?
result = model.most_similar(
    positive=["king", "woman"],
    negative=["man"],
    topn=3
)
print(result)
# [('queen', 0.7698), ('princess', 0.6450), ('monarch', 0.6345)]

# Other analogies that work:
# Paris - France + Italy = Rome
result2 = model.most_similar(
    positive=["paris", "italy"],
    negative=["france"],
    topn=1
)
print(result2)  # [('rome', 0.8722)]

# good - bad + sad = ?
result3 = model.most_similar(
    positive=["good", "sad"],
    negative=["bad"],
    topn=1
)
print(result3)  # [('happy', 0.6891)]

3.2 Semantic Clustering

Embeddings naturally form clusters in vector space. If we project vectors into 2D (using t-SNE or UMAP), we observe that words of the same category group together: animals near animals, countries near countries, professions near professions.

This property is fundamental for practical applications: similarity search works precisely because documents on similar topics have nearby embeddings in vector space.

4. Modern Embeddings: Contextual Representations

Word2Vec and GloVe generate a single vector per word, independent of context. But "bank" has different meanings in "river bank" and "bank account". The contextual embeddings, introduced with BERT in 2018, solve this problem: the same word has different vectors depending on context.

4.1 BERT Embeddings

BERT (Bidirectional Encoder Representations from Transformers) processes the entire sentence and produces a vector for each token. To obtain a whole-sentence embedding, two approaches are commonly used:

CLS token: the first special [CLS] token contains an aggregated representation of the sentence
Mean pooling: average of all token vectors - generally produces better results for similarity search

BERT is not optimal for similarity search

Original BERT was not trained to produce high-quality sentence embeddings. The CLS token is optimized for classification, not semantic similarity. For similarity search, specialized models like Sentence Transformers are needed.

4.2 Sentence Transformers (SBERT)

In 2019, Reimers and Gurevych introduced Sentence-BERT, fine-tuning BERT with a siamese structure to produce meaningful sentence embeddings. This revolutionized similarity search: for the first time it was possible to compare sentences with a simple cosine distance, achieving high-quality results.

4.3 Embedding Models: Full Comparison

      Embedding Models Compared (2026)
      
        
            Model
            Provider
            Dimensions
            MTEB Score
            Cost / 1M tokens
            Notes
          

        
            text-embedding-3-small
            OpenAI
            1536
            62.3
            $0.02
            Best cost/quality ratio
          

            text-embedding-3-large
            OpenAI
            3072
            64.6
            $0.13
            Maximum OpenAI quality
          

            embed-v3
            Cohere
            1024
            64.5
            $0.10
            Supports 100+ languages
          

            voyage-3
            Voyage AI
            1024
            67.1
            $0.06
            Top for retrieval tasks
          

            all-MiniLM-L6-v2
            HuggingFace
            384
            56.3
            Free
            Fast, local, compact
          

            all-mpnet-base-v2
            HuggingFace
            768
            57.8
            Free
            Best open-source base model
          

            gte-large-en-v1.5
            Alibaba (HF)
            1024
            65.4
            Free
            Competitive with commercial models
          

            bge-large-en-v1.5
            BAAI (HF)
            1024
            64.2
            Free
            Excellent for RAG
          

      
    

How to Choose a Model

Prototype / limited budget: all-MiniLM-L6-v2 (free, fast, 384 dim)
Production, cost-effective: text-embedding-3-small (OpenAI, $0.02/1M tokens)
Maximum retrieval quality: voyage-3 or gte-large-en-v1.5
Multilingual: Cohere embed-v3 (100+ languages)
Self-hosted / privacy: bge-large-en-v1.5 or gte-large-en-v1.5

5. Distance Metrics Between Vectors

The choice of distance metric directly influences the quality of similarity search. Here are the four main metrics with their mathematical formulas, strengths, and when to use them.

5.1 Cosine Similarity

The most commonly used metric for text embeddings. It measures the angle between two vectors, ignoring their magnitude (length). Two vectors pointing in the same direction have cosine similarity 1, orthogonal vectors 0, opposite vectors -1.

In pgvector, the <=> operator calculates the cosine distance (= 1 - cosine similarity), where 0 means identical and 2 means opposite.

5.2 Euclidean Distance (L2)

The straight-line distance between two points in space. Takes into account both direction and magnitude of vectors. The <-> operator in pgvector computes L2 distance.

5.3 Dot Product (Inner Product)

The dot product measures both direction and magnitude. For normalized vectors (norm = 1), the dot product is equivalent to cosine similarity. In pgvector, the <#> operator computes the negative inner product (for ORDER BY ASC compatibility).

5.4 Manhattan Distance (L1)

Sum of absolute differences component by component. Less sensitive to outliers compared to Euclidean distance. Not natively supported in pgvector but can be computed manually.

Distance metrics comparison in Python


import numpy as np
from scipy.spatial.distance import cosine, euclidean, cityblock

# Two sample vectors (normalized)
a = np.array([0.5, 0.3, 0.8, 0.1, 0.6])
b = np.array([0.4, 0.35, 0.75, 0.15, 0.55])

# L2 normalization
a_norm = a / np.linalg.norm(a)
b_norm = b / np.linalg.norm(b)

# 1. Cosine Similarity (1 - cosine distance)
cos_sim = 1 - cosine(a_norm, b_norm)
print(f"Cosine similarity:   {cos_sim:.6f}")   # ~0.999

# 2. Euclidean Distance (L2)
l2_dist = euclidean(a_norm, b_norm)
print(f"L2 distance:         {l2_dist:.6f}")   # ~0.042

# 3. Dot Product (for normalized vectors = cosine similarity)
dot = np.dot(a_norm, b_norm)
print(f"Dot product:         {dot:.6f}")       # ~0.999

# 4. Manhattan Distance (L1)
l1_dist = cityblock(a_norm, b_norm)
print(f"L1 distance:         {l1_dist:.6f}")   # ~0.072

# L2-Cosine relation for normalized vectors:
# d_L2^2 = 2 * (1 - cos_sim)
print(f"\nVerification: L2^2 = {l2_dist**2:.6f}")
print(f"2*(1-cos) = {2*(1-cos_sim):.6f}")     # identical!

      When to Use Each Metric
      
            Metric
            pgvector Operator
            Use When
            Avoid When
          
            Cosine
            <=>
            Text embeddings, when magnitude does not matter
            Spatial data where magnitude is significant
          
            L2 (Euclidean)
            <->
            Images, numeric data, when magnitude matters
            Vectors with different scales across components
          
            Dot Product
            <#>
            Pre-normalized vectors (slightly better performance)
            Non-normalized vectors (results distorted by magnitude)
          
            Manhattan (L1)
            Not native in pgvector
            Sparse data, outlier robustness
            General use with dense embeddings

Practical Rule

For 95% of cases with text embeddings, use cosine distance (<=> in pgvector). Modern embedding models produce already-normalized vectors, which makes cosine and dot product practically equivalent. Euclidean distance makes sense for spatial data or when vector magnitude carries information.

6. Generating Embeddings in Python

Let us now see how to generate embeddings with three different approaches: local models with Sentence Transformers, OpenAI API, and HuggingFace Inference API. Each approach has its own advantages and trade-offs.

6.1 Sentence Transformers (Local)

The most flexible and private approach: the model runs on your machine, no data leaves your network, no API call costs.

Embedding generation with Sentence Transformers


# pip install sentence-transformers

from sentence_transformers import SentenceTransformer
import numpy as np

# Load the model (downloaded automatically on first use)
model = SentenceTransformer("all-MiniLM-L6-v2")

# Single sentence embedding
sentence = "PostgreSQL is an open-source relational database"
embedding = model.encode(sentence)
print(f"Type: {type(embedding)}")          # numpy.ndarray
print(f"Dimensions: {embedding.shape}")     # (384,)

# Batch embedding (much more efficient)
sentences = [
    "PostgreSQL is an open-source relational database",
    "pgvector adds vector support to PostgreSQL",
    "Machine learning requires large amounts of data",
    "Pizza margherita is a classic Italian dish",
]

embeddings = model.encode(
    sentences,
    batch_size=32,            # process 32 sentences at a time
    show_progress_bar=True,   # show progress for large batches
    normalize_embeddings=True # normalize to L2 norm = 1
)
print(f"Shape: {embeddings.shape}")  # (4, 384)

# Compute pairwise similarity
from sentence_transformers.util import cos_sim
similarities = cos_sim(embeddings, embeddings)
print(f"\nSimilarity matrix:\n{similarities}")
# The first 2 sentences (about PostgreSQL) will have high similarity
# The pizza sentence will be distant from the others

6.2 OpenAI Embedding API

The OpenAI API offers high-quality models without infrastructure management. Ideal for production with moderate volumes.

Embedding generation with OpenAI API


# pip install openai

from openai import OpenAI
import os

client = OpenAI(api_key=os.getenv("OPENAI_API_KEY"))

def get_embeddings(
    texts: list[str],
    model: str = "text-embedding-3-small"
) -> list[list[float]]:
    """Generate embeddings for a list of texts."""
    response = client.embeddings.create(
        input=texts,
        model=model,
    )
    return [item.embedding for item in response.data]

# Single embedding
text = "PostgreSQL as vector database for AI"
embedding = get_embeddings([text])[0]
print(f"Dimensions: {len(embedding)}")  # 1536

# Batch embeddings (up to 2048 texts per call)
texts = [
    "How to install pgvector on Docker",
    "Tutorial for similarity search in PostgreSQL",
    "Guide to pasta al forno cooking",
]
embeddings = get_embeddings(texts)
print(f"Embeddings generated: {len(embeddings)}")  # 3

# Reduced dimension with text-embedding-3-small
# You can specify smaller dimensions to save space
response = client.embeddings.create(
    input=["Sample text"],
    model="text-embedding-3-small",
    dimensions=512  # reduced from 1536 to 512
)
emb_small = response.data[0].embedding
print(f"Reduced dimensions: {len(emb_small)}")  # 512

6.3 HuggingFace Inference API

A compromise between local models and commercial APIs: access to thousands of open-source models via API, with a generous free tier.

Embedding generation with HuggingFace Inference API


# pip install huggingface_hub

from huggingface_hub import InferenceClient
import os

client = InferenceClient(
    token=os.getenv("HF_TOKEN")
)

def get_hf_embeddings(
    texts: list[str],
    model: str = "BAAI/bge-large-en-v1.5"
) -> list[list[float]]:
    """Generate embeddings using HuggingFace Inference API."""
    result = client.feature_extraction(
        text=texts,
        model=model,
    )
    return result

# Generate embeddings
texts = [
    "Vector search with PostgreSQL and pgvector",
    "How to create HNSW indexes for fast search",
]
embeddings = get_hf_embeddings(texts)
print(f"Embeddings: {len(embeddings)}")       # 2
print(f"Dimensions: {len(embeddings[0])}")    # 1024 (bge-large)

6.4 Efficient Batch Processing

When you need to generate embeddings for thousands or millions of documents, the efficiency of batch processing becomes critical.

Optimized batch processing pipeline


import time
from typing import Generator
from sentence_transformers import SentenceTransformer
import numpy as np

def chunk_list(
    lst: list, chunk_size: int
) -> Generator[list, None, None]:
    """Split a list into fixed-size chunks."""
    for i in range(0, len(lst), chunk_size):
        yield lst[i:i + chunk_size]

def generate_embeddings_batch(
    texts: list[str],
    model_name: str = "all-MiniLM-L6-v2",
    batch_size: int = 256,
    device: str = "cpu"  # "cuda" for GPU
) -> np.ndarray:
    """Generate embeddings in batches with progress tracking."""
    model = SentenceTransformer(model_name, device=device)

    all_embeddings = []
    total_batches = (len(texts) + batch_size - 1) // batch_size
    start = time.time()

    for i, batch in enumerate(chunk_list(texts, batch_size)):
        batch_emb = model.encode(
            batch,
            batch_size=batch_size,
            normalize_embeddings=True,
            show_progress_bar=False
        )
        all_embeddings.append(batch_emb)

        elapsed = time.time() - start
        rate = (i + 1) * batch_size / elapsed
        print(
            f"Batch {i+1}/{total_batches} - "
            f"{rate:.0f} texts/sec"
        )

    return np.vstack(all_embeddings)

# Usage
texts = [f"Document number {i}" for i in range(10_000)]
embeddings = generate_embeddings_batch(
    texts,
    batch_size=256,
    device="cuda"  # use GPU if available
)
print(f"Final shape: {embeddings.shape}")  # (10000, 384)

7. Storing Embeddings in PostgreSQL

Now that we know how to generate embeddings, let us see how to save them in PostgreSQL with pgvector and run similarity search queries. This is the practical connection to article 1 of the series.

7.1 Table Schema

Schema for saving embeddings with metadata


-- Enable pgvector
CREATE EXTENSION IF NOT EXISTS vector;

-- Documents table with embedding
CREATE TABLE documents (
    id BIGSERIAL PRIMARY KEY,
    title VARCHAR(500) NOT NULL,
    content TEXT NOT NULL,
    source VARCHAR(255),
    category VARCHAR(100),
    embedding vector(384),  -- dimension of chosen model
    created_at TIMESTAMPTZ DEFAULT NOW(),
    metadata JSONB DEFAULT '{}'::jsonb
);

-- HNSW index for fast search (cosine distance)
CREATE INDEX idx_documents_embedding
    ON documents
    USING hnsw (embedding vector_cosine_ops)
    WITH (m = 16, ef_construction = 200);

-- Index on category for combined filters
CREATE INDEX idx_documents_category
    ON documents (category);

7.2 Insertion from Python

Complete pipeline: generate and store embeddings


import psycopg2
from psycopg2.extras import execute_values
from sentence_transformers import SentenceTransformer
import numpy as np

# Configuration
DB_CONFIG = {
    "host": "localhost",
    "port": 5432,
    "dbname": "vectordb",
    "user": "admin",
    "password": "secret_password",
}

# 1. Generate embeddings
model = SentenceTransformer("all-MiniLM-L6-v2")

documents = [
    {
        "title": "Introduction to pgvector",
        "content": "pgvector is a PostgreSQL extension for vectors...",
        "source": "blog",
        "category": "database"
    },
    {
        "title": "RAG with LangChain",
        "content": "Retrieval Augmented Generation combines retrieval...",
        "source": "tutorial",
        "category": "ai"
    },
    {
        "title": "Python for Data Science",
        "content": "Python is the most used language for data science...",
        "source": "guide",
        "category": "programming"
    },
]

# Generate embeddings for the content
texts = [d["content"] for d in documents]
embeddings = model.encode(texts, normalize_embeddings=True)

# 2. Save to PostgreSQL
conn = psycopg2.connect(**DB_CONFIG)
cur = conn.cursor()

# Prepare data for batch insert
values = []
for doc, emb in zip(documents, embeddings):
    values.append((
        doc["title"],
        doc["content"],
        doc["source"],
        doc["category"],
        emb.tolist()  # convert numpy array to Python list
    ))

# Efficient batch insert
execute_values(
    cur,
    """INSERT INTO documents
       (title, content, source, category, embedding)
       VALUES %s""",
    values,
    template="(%s, %s, %s, %s, %s::vector)"
)

conn.commit()
print(f"Inserted {len(values)} documents with embeddings")

cur.close()
conn.close()

7.3 Similarity Search from Python

Similarity search query with pgvector


def similarity_search(
    query: str,
    top_k: int = 5,
    category: str = None,
    threshold: float = 0.3
) -> list[dict]:
    """Find documents similar to the query."""
    # Generate query embedding
    query_embedding = model.encode(
        query, normalize_embeddings=True
    ).tolist()

    conn = psycopg2.connect(**DB_CONFIG)
    cur = conn.cursor()

    # Query with optional filter
    if category:
        cur.execute("""
            SELECT id, title, content, category,
                   1 - (embedding <=> %s::vector) AS similarity
            FROM documents
            WHERE category = %s
              AND 1 - (embedding <=> %s::vector) > %s
            ORDER BY embedding <=> %s::vector
            LIMIT %s
        """, (
            query_embedding, category,
            query_embedding, threshold,
            query_embedding, top_k
        ))
    else:
        cur.execute("""
            SELECT id, title, content, category,
                   1 - (embedding <=> %s::vector) AS similarity
            FROM documents
            WHERE 1 - (embedding <=> %s::vector) > %s
            ORDER BY embedding <=> %s::vector
            LIMIT %s
        """, (
            query_embedding,
            query_embedding, threshold,
            query_embedding, top_k
        ))

    results = []
    for row in cur.fetchall():
        results.append({
            "id": row[0],
            "title": row[1],
            "content": row[2][:200],  # truncated
            "category": row[3],
            "similarity": round(row[4], 4),
        })

    cur.close()
    conn.close()
    return results

# Example usage
results = similarity_search(
    "how to use vectors in a database",
    top_k=3,
    category="database"
)
for r in results:
    print(f"[{r['similarity']}] {r['title']}")

7.4 Indexing: HNSW vs IVFFlat

For datasets with more than a few thousand documents, an index is essential for acceptable performance. pgvector offers two index types:

      HNSW vs IVFFlat
      
            Feature
            HNSW
            IVFFlat
          
            Query speed
            Very fast
            Fast
          
            Recall
            95-99%
            85-95%
          
            Build time
            Slow (minutes)
            Fast (seconds)
          
            Memory
            High (graph in RAM)
            Low (centroids)
          
            Insert/Update
            Good (incremental update)
            Requires periodic rebuild
          
            Recommended for
            Production, high quality
            Prototyping, static datasets

Creating indexes in pgvector


-- HNSW (recommended for production)
-- m: connections per node (16-64, default 16)
-- ef_construction: build quality (64-512, default 64)
CREATE INDEX idx_hnsw_cosine
    ON documents
    USING hnsw (embedding vector_cosine_ops)
    WITH (m = 16, ef_construction = 200);

-- For L2 distance
CREATE INDEX idx_hnsw_l2
    ON documents
    USING hnsw (embedding vector_l2_ops)
    WITH (m = 16, ef_construction = 200);

-- IVFFlat (faster to build)
-- lists: number of clusters (sqrt(N) as a base rule)
CREATE INDEX idx_ivfflat_cosine
    ON documents
    USING ivfflat (embedding vector_cosine_ops)
    WITH (lists = 100);  -- for ~10K documents

-- Query parameters to control recall vs speed
SET hnsw.ef_search = 100;       -- default 40, increase for more recall
SET ivfflat.probes = 10;        -- default 1, increase for more recall

8. Embeddings for Different Data Types

Embeddings are not limited to text. Modern models can generate vector representations for images, audio, source code, and even multimodal data.

      Multimodal Embeddings: Models by Data Type
      
        
            Data Type
            Model
            Dimensions
            Use Case
          

        
            Text
            all-MiniLM-L6-v2, text-embedding-3-small
            384-3072
            Semantic search, RAG, classification
          

            Images
            CLIP (OpenAI), SigLIP (Google)
            512-768
            Image search, visual classification
          

            Audio
            Whisper, CLAP
            512-1280
            Audio search, music classification
          

            Code
            CodeBERT, StarCoder embeddings
            768
            Code search, duplicate detection
          

            Multimodal
            CLIP, ImageBind (Meta)
            512-1024
            Cross-modal search (text for images)
          

      
    

CLIP: image search with text


# pip install transformers pillow

from transformers import CLIPProcessor, CLIPModel
from PIL import Image
import torch
import numpy as np

# Load CLIP
model = CLIPModel.from_pretrained("openai/clip-vit-base-patch32")
processor = CLIPProcessor.from_pretrained("openai/clip-vit-base-patch32")

# Image embedding
image = Image.open("cat_photo.jpg")
inputs = processor(images=image, return_tensors="pt")
with torch.no_grad():
    image_embedding = model.get_image_features(**inputs)
image_emb = image_embedding[0].numpy()
print(f"Image embedding: {image_emb.shape}")  # (512,)

# Text embedding (in the SAME space!)
text_inputs = processor(
    text=["a sleeping cat", "a playing dog"],
    return_tensors="pt",
    padding=True
)
with torch.no_grad():
    text_embeddings = model.get_text_features(**text_inputs)
text_embs = text_embeddings.numpy()

# Compute cross-modal similarity
from numpy.linalg import norm
for i, text in enumerate(["a sleeping cat", "a playing dog"]):
    sim = np.dot(image_emb, text_embs[i]) / (
        norm(image_emb) * norm(text_embs[i])
    )
    print(f"Similarity '{text}': {sim:.4f}")
# "a sleeping cat" will have higher similarity with cat_photo.jpg

The power of CLIP is that text and images live in the same vector space. You can search for images with a text query or find text related to an image. This enables multimodal search in PostgreSQL: store CLIP embeddings in the same pgvector table and search with text queries.

9. Evaluating Embedding Quality

How do you know if an embedding model is "good"? The answer depends on the specific task, but standardized benchmarks and objective metrics exist.

9.1 MTEB: Massive Text Embedding Benchmark

MTEB is the reference benchmark for evaluating embedding models. It measures performance across 58+ tasks grouped into 8 categories:

Retrieval: finding relevant documents given a query
Semantic Textual Similarity (STS): how similar two sentences are
Classification: classifying texts into categories
Clustering: grouping similar texts
Pair Classification: determining if two texts are related
Reranking: re-ordering results by relevance
Summarization: quality of summaries
BitextMining: finding parallel translations

Evaluating embeddings on a custom task


from sentence_transformers import SentenceTransformer
from sentence_transformers.evaluation import (
    InformationRetrievalEvaluator
)

# Prepare evaluation dataset
queries = {
    "q1": "how to install pgvector",
    "q2": "what is similarity search",
    "q3": "image embeddings with CLIP",
}

corpus = {
    "d1": "Guide to installing pgvector on Ubuntu",
    "d2": "pgvector for PostgreSQL: Docker setup",
    "d3": "Vector similarity search",
    "d4": "CLIP: multimodal model for images and text",
    "d5": "Spaghetti carbonara recipe",
}

# Mapping query -> relevant documents
relevant = {
    "q1": {"d1": 1, "d2": 1},  # d1 and d2 relevant for q1
    "q2": {"d3": 1},
    "q3": {"d4": 1},
}

# Evaluate the model
model = SentenceTransformer("all-MiniLM-L6-v2")
evaluator = InformationRetrievalEvaluator(
    queries=queries,
    corpus=corpus,
    relevant_docs=relevant,
    name="custom-eval"
)
results = evaluator(model)
print(f"NDCG@10: {results['custom-eval_ndcg@10']:.4f}")
print(f"MAP@10:  {results['custom-eval_map@10']:.4f}")

9.2 Intrinsic vs Extrinsic Evaluation

      Two Approaches to Evaluation
      
            Type
            What It Measures
            Example
            When to Use
          
            Intrinsic
            Properties of the vectors themselves
            Analogies, clustering, STS
            Quick comparison between models
          
            Extrinsic
            Performance on the final task
            RAG quality, search precision
            Final production decision

Practical Advice

Do not rely only on the MTEB score. A model may have a high MTEB score but perform poorly in your specific domain. Always evaluate on your own dataset: create a small set of queries and relevant documents from your domain, and measure nDCG and MAP. This gives you a much more reliable estimate of real-world performance.

10. Dimensionality Reduction

High-dimensional vectors are difficult to visualize and can be expensive in terms of storage and computation. Dimensionality reduction techniques help both for visualization and optimization.

10.1 Visualization Techniques

      Dimensionality Reduction Techniques
      
            Technique
            Preserves
            Speed
            Typical Use
          
            PCA
            Global variance
            Very fast
            Dimension reduction for storage, pre-processing
          
            t-SNE
            Local structure
            Slow
            2D visualization of clusters
          
            UMAP
            Local + global structure
            Medium
            2D visualization, also for pre-indexing reduction

Embedding visualization with UMAP


# pip install umap-learn matplotlib

import umap
import matplotlib.pyplot as plt
import numpy as np
from sentence_transformers import SentenceTransformer

model = SentenceTransformer("all-MiniLM-L6-v2")

# Texts with different categories
texts = [
    # Databases
    "PostgreSQL is a relational database",
    "MongoDB is a NoSQL database",
    "Redis is an in-memory database",
    # AI/ML
    "Deep learning uses deep neural networks",
    "GPT is a language model",
    "Linear regression is a simple algorithm",
    # Food
    "Pizza is baked in a wood-fired oven",
    "Tiramisu is an Italian dessert",
    "Pasta carbonara uses eggs and guanciale",
]
categories = ["DB"]*3 + ["AI"]*3 + ["Food"]*3
colors = ["blue"]*3 + ["red"]*3 + ["green"]*3

# Generate embeddings (384 dimensions)
embeddings = model.encode(texts, normalize_embeddings=True)

# Reduce to 2D with UMAP
reducer = umap.UMAP(n_components=2, random_state=42)
emb_2d = reducer.fit_transform(embeddings)

# Visualize
plt.figure(figsize=(10, 8))
for i, (x, y) in enumerate(emb_2d):
    plt.scatter(x, y, c=colors[i], s=100, zorder=5)
    plt.annotate(texts[i][:30], (x, y), fontsize=8, ha='left')
plt.title("Embeddings in 2D (UMAP)")
plt.savefig("embeddings_umap.png", dpi=150)
plt.show()

10.2 Matryoshka Embeddings

A recent and innovative technique: Matryoshka Representation Learning (MRL) embeddings are trained so that the first N components of the vector are already a valid embedding. You can truncate the vector from 1536 to 512 or 256 dimensions while maintaining good quality.

OpenAI text-embedding-3-small and text-embedding-3-large support this technique: you can specify the dimensions parameter to get more compact vectors without recomputing embeddings.

Matryoshka: variable dimensions with the same model


from openai import OpenAI
import numpy as np
import os

client = OpenAI(api_key=os.getenv("OPENAI_API_KEY"))

text = "PostgreSQL as a vector database for AI applications"

# Generate the same embedding at different dimensions
for dim in [256, 512, 1024, 1536]:
    response = client.embeddings.create(
        input=[text],
        model="text-embedding-3-small",
        dimensions=dim
    )
    emb = response.data[0].embedding
    print(f"Dimensions: {dim}, norm: {np.linalg.norm(emb):.4f}")

# In PostgreSQL: use columns with appropriate dimensions
# CREATE TABLE docs_compact (
#     id BIGSERIAL PRIMARY KEY,
#     content TEXT,
#     embedding vector(256)  -- more compact, 6x less storage
# );

11. Costs and Scaling Strategies

When moving from prototype to production, the costs of generating and storing embeddings become a critical factor. Here is a detailed analysis.

11.1 Costs for 1 Million Documents

      Cost Estimate: 1M Documents (average 500 tokens/doc)
      
        
            Model
            Generation Cost
            Vector Size
            Storage (float32)
            Initial Total
          

        
            all-MiniLM-L6-v2
            $0 (local)
            384
            ~1.4 GB
            Only GPU/CPU time
          

            text-embedding-3-small
            ~$10 (500M tokens)
            1536
            ~5.7 GB
            $10 + storage
          

            text-embedding-3-small (512 dim)
            ~$10
            512
            ~1.9 GB
            $10 + less storage
          

            text-embedding-3-large
            ~$65 (500M tokens)
            3072
            ~11.4 GB
            $65 + storage
          

      
    

11.2 Optimization Strategies

Cost Reduction Strategies

Use local models where possible: all-MiniLM-L6-v2 costs nothing and achieves acceptable quality for most use cases
Matryoshka embeddings: use 512 instead of 1536 dimensions with text-embedding-3-small - same API cost, 67% less storage
Cache embeddings: do not recompute the same text. Store embedding model name in the table to manage migrations
Batch processing: never call the API one document at a time - batch 100-500 texts per request
Incremental ingestion: only re-embed documents that have changed (use MD5 hash to detect changes)


import hashlib
import psycopg2

def should_reindex(conn, source_path: str, content: str) -> bool:
    """Returns True if the document needs to be re-indexed."""
    content_hash = hashlib.md5(content.encode()).hexdigest()

    with conn.cursor() as cur:
        cur.execute("""
            SELECT content_hash FROM documents
            WHERE source_path = %s
            LIMIT 1
        """, (source_path,))
        row = cur.fetchone()

    if row is None:
        return True  # new document
    return row[0] != content_hash  # True if changed

# Save the hash in the table schema:
# CREATE TABLE documents (
#     ...
#     content_hash TEXT,  -- MD5 of the content
#     embedding_model TEXT NOT NULL,  -- which model was used
#     ...
# )

Common Mistakes with Embeddings

Mismatched models: Always use the same model for ingestion and queries. Mixing models produces garbage results.
Chunks too large: Chunks over 512 tokens often contain multiple topics, hurting retrieval precision.
Chunks too small: Chunks under 100 characters lack enough context for meaningful embeddings.
No overlap: Without overlap, context at chunk boundaries is lost, reducing recall.
Forgetting to normalize: Some distance metrics require L2-normalized vectors. Always check model documentation.
Ignoring token limits: Text beyond the model's maximum token limit is silently truncated.

Embeddings Evaluation: Benchmarking Your Model Choice

Choosing the right embedding model is not just about benchmarks from research papers. The model that performs best on MTEB may not be the best for your specific domain. Here is how to run your own evaluation to make an informed decision:

import psycopg2
import time
from openai import OpenAI
from sentence_transformers import SentenceTransformer

# ===================================
# Evaluation framework for embedding models
# ===================================

class EmbeddingEvaluator:
    """
    Evaluate embedding models on your specific domain data.

    Methodology:
    1. Create a test set of (query, relevant_document) pairs
    2. Embed all documents with each candidate model
    3. For each query, measure Recall@K and MRR (Mean Reciprocal Rank)
    4. Compare latency and cost
    """

    def __init__(self, conn, test_pairs: list[tuple]):
        """
        Args:
            conn: PostgreSQL connection
            test_pairs: List of (query, expected_source_path) tuples
        """
        self.conn = conn
        self.test_pairs = test_pairs  # ground truth

    def evaluate_model(self, model_name: str, embedder, k: int = 5) -> dict:
        """Run full evaluation for one embedding model."""
        recall_scores = []
        mrr_scores = []
        latencies = []

        for query, expected_source in self.test_pairs:
            # Embed query
            t_start = time.perf_counter()
            query_vec = embedder.embed(query)
            embed_time = (time.perf_counter() - t_start) * 1000

            # Search in PostgreSQL
            t_start = time.perf_counter()
            with self.conn.cursor() as cur:
                cur.execute("""
                    SELECT source_path
                    FROM documents
                    WHERE embedding_model = %s
                    ORDER BY embedding <=> %s::vector
                    LIMIT %s
                """, (model_name, query_vec, k))
                results = [r[0] for r in cur.fetchall()]
            search_time = (time.perf_counter() - t_start) * 1000

            # Calculate Recall@K
            recall = 1.0 if expected_source in results else 0.0
            recall_scores.append(recall)

            # Calculate MRR (position of first correct result)
            mrr = 0.0
            if expected_source in results:
                position = results.index(expected_source) + 1
                mrr = 1.0 / position
            mrr_scores.append(mrr)

            latencies.append(embed_time + search_time)

        return {
            "model": model_name,
            f"recall_at_{k}": round(sum(recall_scores) / len(recall_scores), 4),
            "mrr": round(sum(mrr_scores) / len(mrr_scores), 4),
            "avg_latency_ms": round(sum(latencies) / len(latencies), 2),
            "n_queries": len(self.test_pairs)
        }

# Example: compare OpenAI small vs local MiniLM
test_pairs = [
    ("How do I create an HNSW index?", "docs/pgvector_guide.md"),
    ("What is cosine similarity?", "docs/embeddings_intro.md"),
    ("PgBouncer connection pooling configuration", "docs/production_guide.md"),
    # ... add 20-50 pairs for statistical significance
]

evaluator = EmbeddingEvaluator(conn, test_pairs)

# Results example (your domain may differ):
# model=text-embedding-3-small: recall@5=0.91, mrr=0.78, latency=45ms
# model=all-MiniLM-L6-v2:      recall@5=0.84, mrr=0.71, latency=12ms
# Conclusion: small quality gap, 3.7x latency difference for local model

Advanced Embedding Techniques: Fine-Tuning and Domain Adaptation

General-purpose embedding models are trained on broad text corpora. For specialized domains (medical, legal, financial, code), fine-tuning or domain adaptation can significantly improve retrieval quality. Here are the practical approaches:

Approach 1: Prompt Engineering for Better Embeddings

# The simplest domain adaptation: prefix prompting
# Some models (like text-embedding-3-small) respond well to task-specific prefixes

def embed_with_task_prefix(text: str, task: str = "search_document") -> list[float]:
    """
    Add task-specific prefix to improve embedding quality for specific tasks.
    Tested with E5 and similar models that support instruction-following.

    task values:
    - "search_document": for documents to be retrieved
    - "search_query": for user queries
    - "classification": for classification tasks
    - "clustering": for clustering tasks
    """
    prefixed = f"Represent this {task}: {text}"
    return embedder.embed(prefixed)

# Example: E5-large-instruct with task prefix
from sentence_transformers import SentenceTransformer

model = SentenceTransformer("intfloat/e5-large-instruct")

# For ingestion (documents):
doc_embedding = model.encode(
    "Represent this document for retrieval: " + document_text,
    normalize_embeddings=True
)

# For queries:
query_embedding = model.encode(
    "Represent this query for retrieving relevant documents: " + user_query,
    normalize_embeddings=True
)

# The asymmetric approach (different prefixes for docs vs queries)
# often improves Recall@10 by 5-15% compared to symmetric embeddings

Approach 2: Domain-Specific Fine-Tuning with Sentence Transformers

from sentence_transformers import SentenceTransformer, InputExample, losses
from torch.utils.data import DataLoader

def fine_tune_embedding_model(
    base_model: str,
    training_pairs: list[tuple],  # (query, relevant_doc, irrelevant_doc) triplets
    output_path: str,
    epochs: int = 3,
    batch_size: int = 32
):
    """
    Fine-tune an embedding model on domain-specific data using triplet loss.

    training_pairs: List of (anchor, positive, negative) tuples where:
    - anchor: a user query from your domain
    - positive: a relevant document for that query
    - negative: an irrelevant document (hard negative = challenging)

    Hard negatives are documents that are superficially similar but not relevant.
    Mining hard negatives is the most important step for good fine-tuning.
    """
    model = SentenceTransformer(base_model)

    # Convert to InputExamples
    examples = [
        InputExample(texts=[anchor, positive, negative])
        for anchor, positive, negative in training_pairs
    ]

    # DataLoader
    dataloader = DataLoader(examples, shuffle=True, batch_size=batch_size)

    # Triplet loss: maximize distance(anchor, negative) - distance(anchor, positive)
    loss = losses.TripletLoss(model=model)

    # Training
    model.fit(
        train_objectives=[(dataloader, loss)],
        epochs=epochs,
        warmup_steps=int(0.1 * len(dataloader) * epochs),
        output_path=output_path,
        show_progress_bar=True
    )

    return model

# Minimum data requirements for meaningful fine-tuning:
# - At least 1000 triplets for initial improvement
# - 5000+ triplets for significant gains
# - Use existing production query logs + relevant document pairs
# - Generate hard negatives with BM25 (retrieve by keyword, use as false positives)

Production Embedding Pipeline: Async and Batch Processing

In production, the embedding generation phase is often the bottleneck. Here is how to implement an efficient async pipeline that processes documents in parallel while respecting API rate limits:

import asyncio
import aiohttp
from typing import AsyncGenerator
import time

class AsyncEmbeddingPipeline:
    """
    Production-grade async embedding pipeline with:
    - Rate limiting (tokens per minute)
    - Automatic retry with exponential backoff
    - Progress tracking
    - Batch size optimization
    """

    def __init__(self,
                 api_key: str,
                 model: str = "text-embedding-3-small",
                 max_tpm: int = 1_000_000,  # tokens per minute limit
                 max_retries: int = 3):
        self.api_key = api_key
        self.model = model
        self.max_tpm = max_tpm
        self.max_retries = max_retries
        self._token_count = 0
        self._window_start = time.time()

    async def _check_rate_limit(self, tokens: int):
        """Wait if we are approaching the rate limit."""
        elapsed = time.time() - self._window_start
        if elapsed >= 60:
            # Reset window
            self._token_count = 0
            self._window_start = time.time()
        elif self._token_count + tokens > self.max_tpm:
            # Wait until the minute window resets
            wait_time = 60 - elapsed
            print(f"Rate limit reached. Waiting {wait_time:.1f}s...")
            await asyncio.sleep(wait_time)
            self._token_count = 0
            self._window_start = time.time()

        self._token_count += tokens

    async def embed_batch_async(self,
                                session: aiohttp.ClientSession,
                                texts: list[str],
                                attempt: int = 0) -> list[list[float]]:
        """Embed a batch of texts with retry logic."""
        # Estimate token count (rough: 1 token ~= 4 chars)
        estimated_tokens = sum(len(t) // 4 for t in texts)
        await self._check_rate_limit(estimated_tokens)

        try:
            async with session.post(
                "https://api.openai.com/v1/embeddings",
                headers={"Authorization": f"Bearer {self.api_key}"},
                json={"input": texts, "model": self.model},
                timeout=aiohttp.ClientTimeout(total=30)
            ) as response:
                if response.status == 429:
                    # Rate limited: exponential backoff
                    if attempt < self.max_retries:
                        wait = (2 ** attempt) * 5
                        await asyncio.sleep(wait)
                        return await self.embed_batch_async(session, texts, attempt + 1)
                response.raise_for_status()
                data = await response.json()
                return [item["embedding"] for item in sorted(data["data"], key=lambda x: x["index"])]

        except Exception as e:
            if attempt < self.max_retries:
                await asyncio.sleep(2 ** attempt)
                return await self.embed_batch_async(session, texts, attempt + 1)
            raise

    async def process_documents(self,
                                texts: list[str],
                                batch_size: int = 100) -> list[list[float]]:
        """Process all documents with concurrent batches."""
        all_embeddings = [None] * len(texts)

        async with aiohttp.ClientSession() as session:
            # Process in batches, with max 3 concurrent requests
            semaphore = asyncio.Semaphore(3)

            async def process_batch(batch_texts, start_idx):
                async with semaphore:
                    embeddings = await self.embed_batch_async(session, batch_texts)
                    for i, emb in enumerate(embeddings):
                        all_embeddings[start_idx + i] = emb

            tasks = []
            for i in range(0, len(texts), batch_size):
                batch = texts[i:i + batch_size]
                tasks.append(process_batch(batch, i))

            await asyncio.gather(*tasks)

        return all_embeddings

# Usage:
pipeline = AsyncEmbeddingPipeline(api_key=openai_api_key)
embeddings = asyncio.run(pipeline.process_documents(all_texts, batch_size=100))
print(f"Generated {len(embeddings)} embeddings")

Conclusions and Next Steps

Embeddings are the foundation of all AI-powered search. Understanding the theory behind them - historical evolution, distance metrics, and the properties of vector space - allows you to make better decisions about model selection, chunking strategy, and quality evaluation. PostgreSQL with pgvector provides a robust, cost-effective platform for storing and querying these representations without introducing new infrastructure dependencies.

The next article in this series builds on embeddings to construct a complete Retrieval-Augmented Generation (RAG) pipeline: from document ingestion through intelligent retrieval to LLM-powered answer generation - all on PostgreSQL.

Continue the Series

Previous: pgvector: Turn PostgreSQL into a Vector Database
Next: RAG with PostgreSQL: End-to-End Pipeline
Related: AI Engineering: RAG Pipeline Architecture

Property	Description	Example
Dense	Every dimension has a non-zero value	[0.023, -0.045, 0.089, ...]
Continuous	Real values, not discrete	Each component is a float32/float16
Fixed dimensionality	The same model always produces vectors of the same length	384, 768, 1536 or 3072 dimensions
Semantically meaningful	Distances between vectors reflect meaning relationships	sim("cat", "feline") > sim("cat", "car")

Architecture	Input	Output	Description
CBOW	Context words	Target word	Predicts the central word given the surrounding context
Skip-gram	Target word	Context words	Predicts the surrounding words given the central word

Method	Year	Type	Typical Dimensions	Semantics
One-hot	-	Sparse	V (vocabulary)	None
TF-IDF	1972	Sparse	V (vocabulary)	Statistical
Word2Vec	2013	Dense	100-300	Local contextual
GloVe	2014	Dense	50-300	Global + local
FastText	2016	Dense	100-300	Subword + context
BERT	2018	Dense	768	Dynamic contextual
Sentence Transformers	2019	Dense	384-1024	Full sentences

Model	Provider	Dimensions	MTEB Score	Cost / 1M tokens	Notes
text-embedding-3-small	OpenAI	1536	62.3	$0.02	Best cost/quality ratio
text-embedding-3-large	OpenAI	3072	64.6	$0.13	Maximum OpenAI quality
embed-v3	Cohere	1024	64.5	$0.10	Supports 100+ languages
voyage-3	Voyage AI	1024	67.1	$0.06	Top for retrieval tasks
all-MiniLM-L6-v2	HuggingFace	384	56.3	Free	Fast, local, compact
all-mpnet-base-v2	HuggingFace	768	57.8	Free	Best open-source base model
gte-large-en-v1.5	Alibaba (HF)	1024	65.4	Free	Competitive with commercial models
bge-large-en-v1.5	BAAI (HF)	1024	64.2	Free	Excellent for RAG

Metric	pgvector Operator	Use When	Avoid When
Cosine	`<=>`	Text embeddings, when magnitude does not matter	Spatial data where magnitude is significant
L2 (Euclidean)	`<->`	Images, numeric data, when magnitude matters	Vectors with different scales across components
Dot Product	`<#>`	Pre-normalized vectors (slightly better performance)	Non-normalized vectors (results distorted by magnitude)
Manhattan (L1)	Not native in pgvector	Sparse data, outlier robustness	General use with dense embeddings

Feature	HNSW	IVFFlat
Query speed	Very fast	Fast
Recall	95-99%	85-95%
Build time	Slow (minutes)	Fast (seconds)
Memory	High (graph in RAM)	Low (centroids)
Insert/Update	Good (incremental update)	Requires periodic rebuild
Recommended for	Production, high quality	Prototyping, static datasets

Data Type	Model	Dimensions	Use Case
Text	all-MiniLM-L6-v2, text-embedding-3-small	384-3072	Semantic search, RAG, classification
Images	CLIP (OpenAI), SigLIP (Google)	512-768	Image search, visual classification
Audio	Whisper, CLAP	512-1280	Audio search, music classification
Code	CodeBERT, StarCoder embeddings	768	Code search, duplicate detection
Multimodal	CLIP, ImageBind (Meta)	512-1024	Cross-modal search (text for images)

Type	What It Measures	Example	When to Use
Intrinsic	Properties of the vectors themselves	Analogies, clustering, STS	Quick comparison between models
Extrinsic	Performance on the final task	RAG quality, search precision	Final production decision

Technique	Preserves	Speed	Typical Use
PCA	Global variance	Very fast	Dimension reduction for storage, pre-processing
t-SNE	Local structure	Slow	2D visualization of clusters
UMAP	Local + global structure	Medium	2D visualization, also for pre-indexing reduction

Model	Generation Cost	Vector Size	Storage (float32)	Initial Total
all-MiniLM-L6-v2	$0 (local)	384	~1.4 GB	Only GPU/CPU time
text-embedding-3-small	~$10 (500M tokens)	1536	~5.7 GB	$10 + storage
text-embedding-3-small (512 dim)	~$10	512	~1.9 GB	$10 + less storage
text-embedding-3-large	~$65 (500M tokens)	3072	~11.4 GB	$65 + storage