Embedding Models and Semantic Search: Complete Guide
In the first article of this series we explored the RAG architecture and its role in solving LLM hallucinations. The beating heart of every RAG system is retrieval: the ability to find, in a potentially huge knowledge base, the documents most relevant to a question. This capability is entirely based on embeddings and vector search.
An embedding is a numerical representation of the meaning of a text: a sequence of numbers (a vector) that captures the semantic relationships between words, phrases and documents. The quality of embeddings directly determines the quality of retrieval, and therefore the quality of the entire RAG system. Choosing the wrong embedding model means building the house foundations on sand.
In this second article of the AI Engineering and Advanced RAG series, we will take a complete journey: from the origins of embeddings with Word2Vec, through the BERT revolution, to modern Sentence Transformers. We will see how to generate embeddings, how to compare texts in vector space, how to build a semantic search engine with FAISS, and how to choose the right model for your use case.
Series Overview
| # | Article | Focus |
|---|---|---|
| 1 | RAG Explained | Fundamentals and complete architecture |
| 2 | You are here - Embeddings and Semantic Search | How texts become vectors |
| 3 | Vector Databases in Depth | Storage, indexing, similarity search |
| 4 | Hybrid Retrieval: BM25 + Vector Search | Combining keyword and semantic search |
| 5 | RAG in Production with LangChain | Practical end-to-end implementation |
| 6 | Prompt Engineering for LLMs | Templates, versioning and testing |
| 7 | Context Window Management | Optimizing LLM input |
| 8 | Multi-Agent Systems | Orchestration and coordination |
| 9 | Knowledge Graphs for AI | Structured knowledge + retrieval |
| 10 | RAG Evaluation and Monitoring | Metrics, benchmarking, production |
What You Will Learn
- What an embedding is and how it represents meaning in numerical form
- The evolution from Word2Vec to BERT to Sentence Transformers
- Why vanilla BERT does not work for similarity and how SBERT solves the problem
- How to choose the right embedding model from dozens of options
- Implement semantic search with sentence-transformers and FAISS in Python
- Vector similarity metrics and when to use each one
- Architectural comparison of the main vector search engines
- How to fine-tune embeddings for specific domains
1. What is an Embedding
An embedding is a mathematical function that transforms a discrete object (a word, a phrase, a document, an image) into a vector of real numbers in a continuous fixed-dimensionality space. In practice, it converts human-readable text into a list of numbers understandable by the machine, while preserving the semantic relationships between the original texts.
The fundamental idea is that texts with similar meaning must have vectors that are close in space, while texts with different meaning must have distant vectors. This property is called semantic isomorphism: the structure of semantic relationships between words is preserved in the geometry of the vector space.
1.1 From One-Hot Encoding to Dense Vectors
To understand why embeddings are necessary, let us consider the simplest alternative: one-hot encoding. With a vocabulary of 50,000 words, each word is represented by a vector of 50,000 dimensions with a single 1 and everything else zeros.
ONE-HOT ENCODING (vocabulary of 50,000 words):
"cat" = [0, 0, ..., 1, ..., 0, 0] (50,000 dimensions, only one 1)
"dog" = [0, 0, ..., 0, ..., 1, 0] (50,000 dimensions, only one 1)
Distance between "cat" and "dog" = same as "cat" and "refrigerator"
No semantic information!
DENSE EMBEDDING (e.g. 384 dimensions):
"cat" = [0.23, -0.45, 0.89, ..., 0.12] (384 dimensions, all real numbers)
"dog" = [0.21, -0.42, 0.91, ..., 0.15] (384 dimensions, all real numbers)
Distance between "cat" and "dog" = SMALL (domestic animals)
Distance between "cat" and "refrigerator" = LARGE (different concepts)
Meaning is captured in the geometry!
The problems with one-hot encoding are clear: the vectors are huge (dimensionality equal to the vocabulary), sparse (almost all zeros) and, most importantly, all orthogonal to each other. Any two words have the same distance, regardless of meaning. There is no way to distinguish "cat" from "feline" compared to "cat" from "economics".
Dense embeddings solve all three problems: the vectors are compact (a few hundred dimensions), dense (all values are meaningful) and capture semantic relationships in their geometry. Similar words have nearby vectors, and directions in space correspond to linguistic concepts.
1.2 The Semantic Space
A fascinating property of embeddings is that semantic relationships transform into geometric relationships. The classic example is vector arithmetic: the vector "king" minus "man" plus "woman" produces a vector very close to "queen". This is not magic: it means that the space has captured the concept of "gender" as one direction and the concept of "royalty" as another direction.
Semantic relationships as vector operations:
vec("king") - vec("man") + vec("woman") ~ vec("queen")
vec("Paris") - vec("France") + vec("Italy") ~ vec("Rome")
vec("good") - vec("better") + vec("large") ~ vec("larger")
Clusters in space:
[cat, dog, horse, fish] --> nearby (animals)
[Python, Java, C++, Rust] --> nearby (programming languages)
[happy, joyful, cheerful] --> very close (synonyms)
Fundamental Intuition
An embedding is essentially a meaning compressor. It takes the meaning of a text, with all its nuances, and compresses it into a point in multidimensional space. The position of that point relative to all other points captures all the semantic relationships the model has learned. This is the foundation on which all semantic search is built.
2. Classic Word Embeddings: Word2Vec, GloVe, FastText
The modern history of embeddings begins in 2013 with Word2Vec, published by Tomas Mikolov and colleagues at Google. The revolutionary idea was simple: you can learn the meaning of a word from the context in which it appears. As linguist John Firth said in 1957: "You shall know a word by the company it keeps".
2.1 Word2Vec: CBOW and Skip-gram
Word2Vec proposes two neural architectures for learning embeddings:
- CBOW (Continuous Bag of Words): Given a window of context words, predict the central word. Example: given "the ___ barks loudly", predict "dog"
- Skip-gram: Given a central word, predict the context words. Example: given "dog", predict "the", "barks", "loudly"
CBOW (Continuous Bag of Words):
Input: context words ["the", "___", "barks", "loudly"]
Output: target word "dog"
Context --> [Embedding Layer] --> Average vectors --> [Softmax] --> "dog"
Fast, good for frequent words
SKIP-GRAM:
Input: target word "dog"
Output: context words ["the", "barks", "loudly"]
"dog" --> [Embedding Layer] --> [Softmax] --> context words
Slower, better for rare words
Typical parameters:
- Embedding dimensions: 100-300
- Context window: 5-10 words
- Vocabulary: 100k-1M words
- Training: billions of words (Wikipedia, Common Crawl)
2.2 GloVe and FastText
GloVe (Global Vectors for Word Representation, Stanford 2014) takes a different approach: it builds a global co-occurrence matrix and factorizes it to obtain embeddings. It captures global relationships that Word2Vec, with its local window, might miss.
FastText (Facebook 2016) extends Word2Vec by working at the subword level (character n-grams). The word "embedding" is also represented by its components: "emb", "mbe", "bed", "edd", etc. This allows generating embeddings even for words never seen during training (out-of-vocabulary words), a crucial advantage for morphologically rich languages.
Classic Word Embeddings Comparison
| Model | Year | Approach | Strength | Limitation |
|---|---|---|---|---|
| Word2Vec | 2013 | Local context prediction | Fast, effective | No OOV, no context |
| GloVe | 2014 | Global co-occurrence | Global relationships | No OOV, no context |
| FastText | 2016 | Character n-grams | Handles OOV | One vector per word |
2.3 The Fundamental Limitation: One Vector Per Word
All classic word embeddings share a structural limitation: they produce a single vector for each word, regardless of context. The word "bank" has the same embedding in "river bank" as in "bank account". This is a serious problem because the meaning of a word almost always depends on the context in which it appears.
Furthermore, these models operate at the level of individual words: they cannot produce an embedding for a phrase or paragraph. To represent a sentence, one must resort to rudimentary strategies like averaging the word vectors, losing information about order and syntactic structure.
Why Not Use Word2Vec for RAG
Classic word embeddings are inadequate for modern semantic search because: (1) they do not capture context, (2) they do not produce sentence-level embeddings, (3) averaging vectors loses critical information. The sentence "the dog bites the man" and "the man bites the dog" would have the same embedding. RAG requires models that understand the meaning of the entire sentence in its context.
3. Contextual Embeddings: The BERT Revolution
In 2018, Google published BERT (Bidirectional Encoder Representations from Transformers) and radically changed the landscape. BERT produces contextual embeddings: the representation of each word depends on the entire context of the sentence in which it appears. The word "bank" in "river bank" will have a different embedding from "bank account".
3.1 The BERT Architecture
BERT is based on the Transformer encoder: a neural network architecture that uses the self-attention mechanism to capture relationships between all words in a sentence simultaneously. Unlike recurrent networks (LSTM, GRU) that process text sequentially, BERT processes the entire sentence in parallel, leveraging bidirectional attention.
Input: "I went to the bank to deposit money"
| | | | | | | |
v v v v v v v v
BERT: [CLS] [I] [went] [to] [the] [bank] [to] [deposit] [money] [SEP]
|
v
Each token attends to ALL other tokens (bidirectional attention)
Output: contextual vector for each token
"bank" in context "deposit money" --> vector for FINANCIAL bank
"bank" in context "river, fishing" --> vector for RIVER bank
[CLS] token = aggregated representation of the ENTIRE sentence
3.2 BERT Pre-training: MLM and NSP
BERT is pre-trained on two unsupervised tasks using enormous text corpora (Wikipedia + BookCorpus, 3.3 billion words):
- Masked Language Modeling (MLM): 15% of tokens are masked at random and BERT must predict them. This forces bidirectional understanding: to predict "bank" in "I went to the __ to deposit", BERT must understand both the left and right context.
- Next Sentence Prediction (NSP): BERT receives two sentences and must predict whether the second follows the first in the original text. This trains BERT to understand discourse-level relationships.
3.3 BERT Variants
After the original publication, numerous BERT variants emerged, each optimized for different use cases:
Main BERT Variants
| Model | Parameters | Strength | Use Case |
|---|---|---|---|
| BERT-base | 110M | Balanced | General NLP tasks |
| BERT-large | 340M | High performance | High-stakes tasks |
| RoBERTa | 125M | Better training | Classification, NER |
| DistilBERT | 66M | 40% faster, 97% quality | Production with constraints |
| DeBERTa | 86M-1.5B | State-of-the-art NLU | Complex understanding tasks |
3.4 The Problem with BERT for Semantic Search
Despite being revolutionary, BERT has a fundamental problem for semantic search and RAG: it was designed for sequence classification and token classification, not for generating sentence-level embeddings suitable for similarity comparison.
Using the [CLS] token as a sentence representation (as many did initially) or averaging all token representations produces embeddings of very poor quality for similarity tasks. The reason is that BERT was never trained to produce sentence representations in a metric space where cosine distance is meaningful for semantic similarity.
from transformers import BertTokenizer, BertModel
import torch
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained('bert-base-uncased')
def get_bert_embedding(text):
inputs = tokenizer(text, return_tensors='pt', truncation=True, max_length=512)
with torch.no_grad():
outputs = model(**inputs)
# [CLS] token as sentence representation
return outputs.last_hidden_state[:, 0, :]
emb1 = get_bert_embedding("I love machine learning")
emb2 = get_bert_embedding("I enjoy deep learning")
emb3 = get_bert_embedding("I hate broccoli")
# PROBLEM: vanilla BERT gives poor similarity results
# "machine learning" vs "deep learning" often LESS similar than expected
# "machine learning" vs "broccoli" often MORE similar than expected!
# BERT is not trained to maximize cosine similarity for related sentences
4. Sentence Transformers: The Solution for Semantic Search
In 2019, Nils Reimers and Iryna Gurevych published SBERT (Sentence-BERT): a BERT variant specifically fine-tuned to produce sentence embeddings that work well for semantic similarity. This is the foundation of modern semantic search and RAG retrieval.
4.1 Siamese Architecture and Triplet Loss
SBERT uses a Siamese network architecture: two identical BERT models (sharing the same weights) that process two sentences simultaneously and are trained to produce similar embeddings for semantically related sentences and different embeddings for unrelated sentences.
SIAMESE ARCHITECTURE:
Sentence A ──> BERT + Pooling ──> embedding_a ──\
──> Cosine Similarity ──> Loss
Sentence B ──> BERT + Pooling ──> embedding_b ──/
TRAINING DATA (NLI - Natural Language Inference):
Anchor: "A man is playing guitar"
Positive: "Someone is making music" (entailment) --> high similarity
Negative: "A woman is cooking dinner" (contradiction) --> low similarity
TRIPLET LOSS:
Loss = max(0, ||A-P||^2 - ||A-N||^2 + margin)
Forces: dist(anchor, positive) + margin < dist(anchor, negative)
Effect: semantically related sentences cluster together in space
4.2 Pooling Strategies
To go from per-token BERT representations to a single sentence vector, SBERT uses different pooling strategies:
- Mean pooling: Average of all token vectors (best default choice)
- Max pooling: Maximum value for each dimension
- CLS pooling: Using only the [CLS] token (worst for SBERT)
from sentence_transformers import SentenceTransformer
import numpy as np
# Load pre-trained model
model = SentenceTransformer('all-MiniLM-L6-v2') # 384 dimensions, fast
sentences = [
"Machine learning is a subset of artificial intelligence",
"Deep learning uses neural networks with many layers",
"I enjoy hiking in the mountains",
"The weather is nice today"
]
# Generate embeddings (batch processing, GPU if available)
embeddings = model.encode(sentences, batch_size=32, show_progress_bar=True)
print(f"Shape: {embeddings.shape}") # (4, 384)
# Cosine similarity
from sklearn.metrics.pairwise import cosine_similarity
similarities = cosine_similarity(embeddings)
print("\nSimilarity Matrix:")
for i, s1 in enumerate(sentences):
for j, s2 in enumerate(sentences):
if i < j:
sim = similarities[i][j]
print(f" {s1[:40]} vs {s2[:40]}: {sim:.3f}")
# Expected output:
# ML vs Deep Learning: 0.784 (HIGH - related topics)
# ML vs hiking: 0.112 (LOW - unrelated)
# ML vs weather: 0.089 (LOW - unrelated)
4.3 Popular SBERT Models: Benchmarks and Recommendations
The MTEB (Massive Text Embedding Benchmark) is the standard reference for evaluating embedding models across 56 tasks and 112 languages. Here is a selection of the best models for different use cases:
MTEB Benchmark: Top Models 2025
| Model | Dimensions | Parameters | MTEB Score | Speed | Best For |
|---|---|---|---|---|---|
| text-embedding-3-large | 3072 | ~570M | 64.6 | API only | Maximum quality, OpenAI API |
| e5-mistral-7b-instruct | 4096 | 7B | 66.6 | Slow | Sota local, multilingual |
| all-mpnet-base-v2 | 768 | 109M | 57.8 | Fast | Best open-source general |
| all-MiniLM-L6-v2 | 384 | 22M | 56.2 | Very fast | Production with constraints |
| bge-large-en-v1.5 | 1024 | 335M | 63.6 | Medium | Best free English model |
| paraphrase-multilingual-mpnet | 768 | 278M | 53.1 | Medium | 50+ languages including Italian |
5. Vector Similarity Metrics
Once we have embeddings, we need a way to measure how similar two vectors are. The choice of metric significantly impacts both the quality of results and computational performance.
5.1 Cosine Similarity
Cosine similarity measures the angle between two vectors, ignoring their magnitude. It is the most used metric for semantic search because it is invariant to the length of the vectors (a long document and a short one with the same content will be similar).
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity, euclidean_distances
def cosine_sim(a, b):
"""Cosine similarity: measures angle, invariant to magnitude"""
return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))
def euclidean_dist(a, b):
"""Euclidean distance: straight-line distance in space"""
return np.linalg.norm(a - b)
def dot_product(a, b):
"""Dot product: fast, considers both angle and magnitude"""
return np.dot(a, b)
# Example
a = np.array([0.23, -0.45, 0.89, 0.12])
b = np.array([0.21, -0.42, 0.91, 0.15])
c = np.array([-0.89, 0.12, -0.34, 0.67])
print(f"Cosine similarity a-b: {cosine_sim(a, b):.4f}") # ~0.998 (very similar)
print(f"Cosine similarity a-c: {cosine_sim(a, c):.4f}") # ~-0.45 (different)
print(f"Euclidean distance a-b: {euclidean_dist(a, b):.4f}") # ~0.05 (close)
print(f"Euclidean distance a-c: {euclidean_dist(a, c):.4f}") # ~1.8 (far)
Metric Comparison
| Metric | Range | Best For | Considerations |
|---|---|---|---|
| Cosine Similarity | [-1, 1] | Most embedding models | Invariant to magnitude, 1 = identical |
| Dot Product | (-inf, +inf) | Normalized embeddings | Faster than cosine if normalized |
| Euclidean Distance | [0, +inf) | Dense embeddings with learned scale | 0 = identical, less common for NLP |
| Manhattan Distance | [0, +inf) | Sparse vectors | Less sensitive to outliers |
6. Building Semantic Search with FAISS
FAISS (Facebook AI Similarity Search) is a library for efficient similarity search and dense vector clustering. It allows searching billions of vectors in milliseconds, using CPU or GPU. It is the foundation for many production RAG systems when a dedicated vector database is not needed.
6.1 Complete Semantic Search Engine
import faiss
import numpy as np
from sentence_transformers import SentenceTransformer
from typing import List, Tuple
import json
class SemanticSearchEngine:
"""Complete semantic search engine with FAISS and Sentence Transformers"""
def __init__(self, model_name: str = 'all-MiniLM-L6-v2'):
self.model = SentenceTransformer(model_name)
self.dimension = self.model.get_sentence_embedding_dimension()
self.index = None
self.documents = []
self.metadata = []
def add_documents(self, texts: List[str], metadata: List[dict] = None):
"""Add documents to the search engine"""
if metadata is None:
metadata = [{} for _ in texts]
# Generate embeddings
print(f"Generating embeddings for {len(texts)} documents...")
embeddings = self.model.encode(
texts,
batch_size=64,
show_progress_bar=True,
normalize_embeddings=True # Important for cosine similarity!
)
# Initialize or expand index
if self.index is None:
# IndexFlatIP: Inner Product (= cosine with normalized vectors)
self.index = faiss.IndexFlatIP(self.dimension)
# Add to index
self.index.add(embeddings.astype('float32'))
self.documents.extend(texts)
self.metadata.extend(metadata)
print(f"Total documents in index: {self.index.ntotal}")
def search(self, query: str, top_k: int = 5) -> List[Tuple[str, float, dict]]:
"""Search for the most relevant documents"""
# Encode query
query_embedding = self.model.encode(
[query],
normalize_embeddings=True
).astype('float32')
# FAISS search
scores, indices = self.index.search(query_embedding, top_k)
results = []
for score, idx in zip(scores[0], indices[0]):
if idx != -1: # -1 = not found
results.append((
self.documents[idx],
float(score), # cosine similarity
self.metadata[idx]
))
return results
def save(self, path: str):
"""Save index and documents to disk"""
faiss.write_index(self.index, f"{path}.index")
with open(f"{path}.json", 'w') as f:
json.dump({
'documents': self.documents,
'metadata': self.metadata
}, f)
def load(self, path: str):
"""Load index from disk"""
self.index = faiss.read_index(f"{path}.index")
with open(f"{path}.json", 'r') as f:
data = json.load(f)
self.documents = data['documents']
self.metadata = data['metadata']
# Usage example
engine = SemanticSearchEngine('all-mpnet-base-v2')
# Add corpus
corpus = [
"RAG (Retrieval-Augmented Generation) combines LLMs with external knowledge",
"Vector databases store embeddings and enable similarity search",
"BERT is a bidirectional transformer model for NLP",
"Sentence Transformers produce sentence-level semantic embeddings",
"Python is a popular programming language for data science",
"The LangChain framework simplifies building LLM applications",
"Fine-tuning adapts pre-trained models to specific domains",
"FAISS enables efficient billion-scale similarity search"
]
engine.add_documents(corpus)
# Test search
query = "How does semantic search work for AI?"
results = engine.search(query, top_k=3)
print(f"\nQuery: {query}")
print("\nTop 3 results:")
for text, score, meta in results:
print(f" Score: {score:.4f} | {text}")
6.2 FAISS Index Types
FAISS offers different index types with different quality/speed/memory tradeoffs:
# 1. IndexFlatIP / IndexFlatL2: Exact search (brute force)
# - Perfect precision, slow on large corpora
# - Good up to ~100k vectors
index_flat = faiss.IndexFlatIP(dimension)
# 2. IndexIVFFlat: Inverted file with exact vectors
# - Groups vectors into clusters (Voronoi cells)
# - Search only in nearest clusters (nprobe parameter)
# - Good for 100k - 10M vectors
quantizer = faiss.IndexFlatIP(dimension)
index_ivf = faiss.IndexIVFFlat(quantizer, dimension, 100) # 100 clusters
index_ivf.train(training_vectors)
index_ivf.nprobe = 10 # search in 10 nearest clusters
# 3. IndexHNSW: Hierarchical Navigable Small World
# - Graph structure, excellent speed/quality tradeoff
# - Good for medium-large corpora
index_hnsw = faiss.IndexHNSWFlat(dimension, 32) # M=32
# 4. IndexIVFPQ: IVF + Product Quantization
# - Compression: reduces memory by 16-64x
# - Slight quality loss
# - Best for very large corpora (100M+)
index_ivfpq = faiss.IndexIVFPQ(quantizer, dimension, 100, 8, 8)
# RULE OF THUMB:
# < 100k docs: IndexFlatIP (exact, simple)
# 100k-10M docs: IndexIVFFlat (fast, good recall)
# > 10M docs: IndexIVFPQ (compressed, high scale)
# Real-time updates: IndexHNSW (no retraining)
7. Domain Fine-Tuning
Generic pre-trained models perform well across a wide range of domains, but for specific applications (legal, medical, financial, code) they can be significantly improved through fine-tuning. The idea is to adapt the model's representations so that domain-specific terms cluster correctly.
7.1 Fine-Tuning with Sentence Transformers
from sentence_transformers import SentenceTransformer, InputExample, losses
from torch.utils.data import DataLoader
# 1. Prepare training data
# Format: (anchor, positive, negative) or (sentence1, sentence2, label)
train_examples = [
# Format: similar sentence pairs (label=1) or dissimilar (label=0)
InputExample(texts=["RAG retrieval system", "document retrieval for LLMs"], label=0.9),
InputExample(texts=["RAG retrieval system", "cooking pasta recipe"], label=0.1),
InputExample(texts=["vector similarity search", "nearest neighbor search"], label=0.95),
InputExample(texts=["embedding fine-tuning", "model adaptation domain"], label=0.85),
# ... thousands of examples from your domain
]
# 2. Load base model
model = SentenceTransformer('all-MiniLM-L6-v2')
# 3. Define DataLoader
dataloader = DataLoader(train_examples, shuffle=True, batch_size=16)
# 4. Define loss function
# CosineSimilarityLoss: for labeled pairs (sentence1, sentence2, similarity)
loss = losses.CosineSimilarityLoss(model)
# Alternatively for triplets:
# loss = losses.TripletLoss(model, distance_metric=TripletDistanceMetric.COSINE)
# 5. Train
from sentence_transformers.evaluation import EmbeddingSimilarityEvaluator
# ... define evaluator with validation data
model.fit(
train_objectives=[(dataloader, loss)],
epochs=3,
warmup_steps=100,
output_path='./fine-tuned-model',
show_progress_bar=True
)
# 6. Save and use
model.save('./fine-tuned-domain-model')
domain_model = SentenceTransformer('./fine-tuned-domain-model')
7.2 Strategies for Limited Data
If you have limited training data (common in specialized domains), there are several effective strategies:
Data-Efficient Fine-Tuning Strategies
- GPL (Generative Pseudo Labeling): Use an LLM (GPT-4, Claude) to generate query-document pairs from your corpus automatically, creating fine-tuning data without manual annotation.
- InPars: Similar to GPL, uses GPT-3 to generate relevant questions for each document, creating training pairs in an unsupervised manner.
- Knowledge Distillation: Train a small student model (e.g. MiniLM) to mimic a large teacher model (e.g. text-embedding-3-large) on your domain data, getting much of the large model's quality at lower cost.
- Contrastive learning with in-batch negatives: Uses other examples in the same batch as negatives, maximizing use of small datasets.
8. Choosing the Right Embedding Model
Choosing the embedding model is one of the most important decisions when building a RAG system. There is no single "best" model: the right choice depends on specific requirements.
8.1 Decision Framework
CHOICE DECISION TREE:
1. BUDGET / LATENCY
- High budget, cloud API → text-embedding-3-large (OpenAI)
- Medium budget, good performance → bge-large-en-v1.5
- Low latency required, self-hosted → all-MiniLM-L6-v2
2. LANGUAGE
- English only → bge-large-en-v1.5, all-mpnet-base-v2
- Multilingual → paraphrase-multilingual-mpnet, multilingual-e5
- Italian specific → custom fine-tune or multilingual model
3. CORPUS SIZE
- < 100k docs → any model + FAISS IndexFlat
- 100k-10M docs → medium model + FAISS IVF
- > 10M docs → fast model + FAISS IVFPQ or dedicated vector DB
4. DOMAIN
- General → all-mpnet-base-v2 (good all-rounder)
- Code → codellama-embed, code-bert
- Medical → PubMedBERT, BioSentVec
- Legal → legal-bert-base-uncased
- Custom → fine-tune a base model on your data
5. PRIVACY (data cannot leave premises)
- Self-hosted required → open-source models on local infrastructure
- No cloud at all → all-MiniLM-L6-v2, bge-small-en-v1.5
8.2 Performance vs Cost
Production Tradeoffs: Performance vs Cost
| Scenario | Recommended Model | Cost/1M tokens | Dimensions |
|---|---|---|---|
| Maximum quality (budget available) | text-embedding-3-large | $0.13 | 3072 |
| Balanced quality/cost (OpenAI) | text-embedding-3-small | $0.02 | 1536 |
| Best free self-hosted | bge-large-en-v1.5 | Free | 1024 |
| High volume, resource constrained | all-MiniLM-L6-v2 | Free | 384 |
| State-of-the-art local | e5-mistral-7b-instruct | Free (high GPU) | 4096 |
9. Embeddings in a Complete RAG Pipeline
Now let us put everything together to see how embeddings integrate into a complete RAG pipeline, from document ingestion to generating the final answer.
from sentence_transformers import SentenceTransformer
import faiss
import numpy as np
from openai import OpenAI
from typing import List
class RAGPipeline:
"""Complete RAG pipeline with semantic search"""
def __init__(
self,
embedding_model: str = 'all-mpnet-base-v2',
llm_model: str = 'gpt-4o-mini'
):
self.embedder = SentenceTransformer(embedding_model)
self.llm = OpenAI()
self.llm_model = llm_model
self.dim = self.embedder.get_sentence_embedding_dimension()
self.index = faiss.IndexFlatIP(self.dim)
self.chunks = []
def ingest_documents(self, documents: List[str], chunk_size: int = 500):
"""Ingest and index documents"""
all_chunks = []
for doc in documents:
# Simple chunking (in production use recursive splitting)
words = doc.split()
for i in range(0, len(words), chunk_size // 5):
chunk = ' '.join(words[i:i + chunk_size // 5])
if chunk:
all_chunks.append(chunk)
embeddings = self.embedder.encode(
all_chunks,
normalize_embeddings=True,
batch_size=64
).astype('float32')
self.index.add(embeddings)
self.chunks.extend(all_chunks)
return len(all_chunks)
def retrieve(self, query: str, top_k: int = 5) -> List[tuple]:
"""Retrieve most relevant chunks for the query"""
query_emb = self.embedder.encode(
[query],
normalize_embeddings=True
).astype('float32')
scores, indices = self.index.search(query_emb, top_k)
return [(self.chunks[i], float(s)) for s, i in zip(scores[0], indices[0]) if i != -1]
def generate(self, query: str, top_k: int = 5) -> str:
"""Generate response with retrieved context"""
relevant_chunks = self.retrieve(query, top_k)
context = "\n\n".join([
f"[Relevance: {score:.3f}]\n{chunk}"
for chunk, score in relevant_chunks
])
prompt = f"""Answer the question based on the provided context.
If the context does not contain enough information, say so explicitly.
Context:
{context}
Question: {query}
Answer:"""
response = self.llm.chat.completions.create(
model=self.llm_model,
messages=[{"role": "user", "content": prompt}],
temperature=0.1,
max_tokens=1000
)
return response.choices[0].message.content
# Usage
rag = RAGPipeline()
documents = ["Your documents here..."]
rag.ingest_documents(documents)
answer = rag.generate("Your question here?")
print(answer)
10. Best Practices and Anti-Patterns
10.1 Best Practices
Embedding Best Practices
- Always normalize embeddings before indexing and searching, especially for cosine similarity. Unnormalized embeddings with dot product give incorrect results.
- Match instruction prefix for asymmetric models (e.g. E5, BGE): use "query: " for questions and "passage: " for documents. Failing to do this can degrade quality by 10-20%.
- Chunk appropriately: embedding quality degrades with very short (<50 tokens) and very long (>512 tokens) texts. Aim for 200-400 tokens.
- Use batch processing: encode hundreds of documents at a time, not one by one. With GPU, batching can be 50-100x faster.
- Evaluate on your data: MTEB benchmarks are useful as a starting point but do not replace evaluation on your specific data and use case.
10.2 Common Anti-Patterns
Anti-Patterns to Avoid
- Using vanilla BERT for similarity: [CLS] pooling on BERT-base without SBERT fine-tuning gives poor results. Always use models specifically fine-tuned for semantic similarity.
- Ignoring the embedding/search metric mismatch: if the model was trained with cosine similarity, do not use Euclidean distance in the index. They give different rankings.
- Never re-embedding after model change: if you change embedding model, you must re-index the entire corpus. Old embeddings are incompatible with the new model.
- Using oversized models without justification: a 7B parameter model costs 50x a 22M model in inference. Measure quality improvement before committing to expensive infrastructure.
- Ignoring multilingual limitations: English-only models used on Italian or mixed text will perform very poorly. Use multilingual models for non-English content.
Conclusions
Embeddings and semantic search are the fundamental building blocks of any advanced RAG system. We have covered the complete journey from classic Word2Vec word vectors to modern Sentence Transformers, understanding why BERT alone is not enough and how SBERT solves the problem.
The key points to remember:
- Embeddings capture semantic meaning in vector geometry
- Classic models (Word2Vec, GloVe) lack context and sentence-level representations
- BERT provides contextual embeddings but is not designed for similarity search
- Sentence Transformers (SBERT) are specifically optimized for semantic similarity
- FAISS enables efficient similarity search from thousands to billions of vectors
- Choosing the right model requires balancing quality, speed, cost and language requirements
- Domain fine-tuning can significantly improve performance on specialized corpora
In the next article we will explore Vector Databases in depth: Qdrant, Pinecone, Weaviate and Milvus, comparing architectures, performance and when to choose one over the other. We will also see how vector databases extend FAISS's capabilities with persistence, filtering and distributed scalability.
Continue the Series
- Article 1: RAG Explained - Fundamentals
- Article 2: Embeddings and Semantic Search (current)
- Article 3: Vector Database - Qdrant vs Pinecone vs Milvus
- Article 4: Hybrid Retrieval: BM25 + Vector Search
- Article 5: LangChain for RAG: Advanced Patterns
Also explore related articles: BERT and Transformers in NLP and pgvector for Semantic Search in PostgreSQL.







