Ciao! Sono

Federico Calò

Sviluppatore Software | Divulgatore Tecnico

Creo applicazioni web moderne e strumenti digitali personalizzati per aiutare le attività a crescere attraverso l'innovazione tecnologica. La mia passione è unire informatica ed economia per generare valore reale.

Contattami

Chi Sono

La mia passione per l'informatica è nata tra i banchi dell'Istituto Tecnico Commerciale di Maglie, dove ho scoperto il potere della programmazione e il fascino di creare soluzioni digitali. Fin da subito, ho capito che l'informatica non era solo codice, ma uno strumento straordinario per trasformare idee in realtà.

Durante gli studi superiori in Sistemi Informativi Aziendali, ho iniziato a intrecciare informatica ed economia, comprendendo come la tecnologia possa essere il motore della crescita per qualsiasi attività. Questa visione mi ha accompagnato all'Università degli Studi di Bari, dove ho conseguito la Laurea in Informatica, approfondendo le mie competenze tecniche e la mia passione per lo sviluppo software.

Oggi metto questa esperienza al servizio di imprese, professionisti e startup, creando soluzioni digitali su misura che automatizzano processi, ottimizzano risorse e aprono nuove opportunità di business. Perché la vera innovazione inizia quando la tecnologia incontra le esigenze reali delle persone.

Le Mie Competenze

Analisi Dati & Modelli Previsionali

Trasformo i dati in insights strategici con analisi approfondite e modelli predittivi per decisioni informate

Automazione Processi

Creo strumenti personalizzati che automatizzano operazioni ripetitive e liberano tempo per attività a valore aggiunto

Sistemi Custom

Sviluppo sistemi software su misura, dalle integrazioni tra piattaforme alle dashboard personalizzate

const federico = {
  nome: "Federico Calò",
  ruolo: "Sviluppatore Software",
  città: "Bari, Italia",
  missione: "Aiutare attraverso l'informatica",
  passioni: [
    "Codice Pulito",
    "Innovazione",
    "Crescita Continua"
  ]
};

La Mia Missione

Credo fermamente che l'informatica sia lo strumento più potente per trasformare le idee in realtà e migliorare la vita delle persone.

🚀

Democratizzare la Tecnologia

La mia missione è rendere l'informatica accessibile a tutti: dalle piccole imprese locali alle startup innovative, fino ai professionisti che vogliono digitalizzare la propria attività. Ogni realtà merita di sfruttare le potenzialità del digitale.

💡

Unire Informatica ed Economia

Non è solo questione di scrivere codice: è capire come la tecnologia possa generare valore reale. Intrecciando competenze informatiche e visione economica, aiuto le attività a crescere, ottimizzare processi e raggiungere nuovi traguardi di efficienza e redditività.

🎯

Creare Soluzioni su Misura

Ogni attività è unica, e così devono esserlo le soluzioni. Sviluppo strumenti personalizzati che rispondono alle esigenze specifiche di ciascun cliente, automatizzando processi ripetitivi e liberando tempo per ciò che conta davvero: far crescere il business.

Trasforma la Tua Attività con la Tecnologia

Dicembre 2024

Visualizza

Master SQL

RoadMap.sh

Novembre 2024

Visualizza

Oracle Certified Foundations Associate

Oracle

Ottobre 2024

Visualizza

People Leadership Credential

Connect

Settembre 2024

💻 Linguaggi & Tecnologie

☕Java

🐍Python

📜JavaScript

🅰️Angular

⚛️React

🔷TypeScript

🗄️SQL

🐘PHP

🎨CSS/SCSS

🔧Node.js

🐳Docker

🌿Git

💼

12/2024 - Presente

Custom Software Engineering Analyst

Accenture

Bari, Puglia, Italia · Ibrida Analisi e sviluppo di sistemi informatici attraverso l'utilizzo di Java e Quarkus in Health and Public Sector. Formazione continua su tecnologie moderne per la creazione di soluzioni software personalizzate ed efficienti e sugli agenti.

💼

06/2022 - 12/2024

Analista software e Back End Developer Associate Consultant

Links Management and Technology SpA

Esperienza nell'analisi di sistemi software as-is e flussi ETL utilizzando PowerCenter. Formazione completata su Spring Boot per lo sviluppo di applicazioni backend moderne e scalabili. Sviluppatore Backend specializzato in Spring Boot, con esperienza in progettazione di database, analisi, sviluppo e testing dei task assegnati.

💼

02/2021 - 10/2021

Programmatore software

Adesso.it (prima era WebScience srl)

Esperienza nell'analisi AS-IS e TO-BE, evoluzioni SEO ed evoluzioni website per migliorare le performance e l'engagement degli utenti.

🎓

2018 - 2025

Laurea in Informatica

Università degli Studi di Bari Aldo Moro

Bachelor's degree in Computer Science, focusing on software engineering, algorithms, and modern development practices.

📚

2013 - 2018

Diploma - Sistemi Informativi Aziendali

Istituto Tecnico Commerciale di Maglie

Technical diploma specializing in Business Information Systems, combining IT knowledge with business management.

Contattami

Hai un progetto in mente? Parliamone! Compila il form qui sotto e ti risponderò al più presto.

* Campi obbligatori. I tuoi dati saranno utilizzati solo per rispondere alla tua richiesta.

Embedding Models and Semantic Search: Complete Guide

In the first article of this series we explored the RAG architecture and its role in solving LLM hallucinations. The beating heart of every RAG system is retrieval: the ability to find, in a potentially huge knowledge base, the documents most relevant to a question. This capability is entirely based on embeddings and vector search.

An embedding is a numerical representation of the meaning of a text: a sequence of numbers (a vector) that captures the semantic relationships between words, phrases and documents. The quality of embeddings directly determines the quality of retrieval, and therefore the quality of the entire RAG system. Choosing the wrong embedding model means building the house foundations on sand.

In this second article of the AI Engineering and Advanced RAG series, we will take a complete journey: from the origins of embeddings with Word2Vec, through the BERT revolution, to modern Sentence Transformers. We will see how to generate embeddings, how to compare texts in vector space, how to build a semantic search engine with FAISS, and how to choose the right model for your use case.

Series Overview

#	Article	Focus
1	RAG Explained	Fundamentals and complete architecture
2	You are here - Embeddings and Semantic Search	How texts become vectors
3	Vector Databases in Depth	Storage, indexing, similarity search
4	Hybrid Retrieval: BM25 + Vector Search	Combining keyword and semantic search
5	RAG in Production with LangChain	Practical end-to-end implementation
6	Prompt Engineering for LLMs	Templates, versioning and testing
7	Context Window Management	Optimizing LLM input
8	Multi-Agent Systems	Orchestration and coordination
9	Knowledge Graphs for AI	Structured knowledge + retrieval
10	RAG Evaluation and Monitoring	Metrics, benchmarking, production

What You Will Learn

What an embedding is and how it represents meaning in numerical form
The evolution from Word2Vec to BERT to Sentence Transformers
Why vanilla BERT does not work for similarity and how SBERT solves the problem
How to choose the right embedding model from dozens of options
Implement semantic search with sentence-transformers and FAISS in Python
Vector similarity metrics and when to use each one
Architectural comparison of the main vector search engines
How to fine-tune embeddings for specific domains

1. What is an Embedding

An embedding is a mathematical function that transforms a discrete object (a word, a phrase, a document, an image) into a vector of real numbers in a continuous fixed-dimensionality space. In practice, it converts human-readable text into a list of numbers understandable by the machine, while preserving the semantic relationships between the original texts.

The fundamental idea is that texts with similar meaning must have vectors that are close in space, while texts with different meaning must have distant vectors. This property is called semantic isomorphism: the structure of semantic relationships between words is preserved in the geometry of the vector space.

1.1 From One-Hot Encoding to Dense Vectors

To understand why embeddings are necessary, let us consider the simplest alternative: one-hot encoding. With a vocabulary of 50,000 words, each word is represented by a vector of 50,000 dimensions with a single 1 and everything else zeros.

One-Hot Encoding vs Dense Embedding


ONE-HOT ENCODING (vocabulary of 50,000 words):
  "cat" = [0, 0, ..., 1, ..., 0, 0]   (50,000 dimensions, only one 1)
  "dog" = [0, 0, ..., 0, ..., 1, 0]   (50,000 dimensions, only one 1)

  Distance between "cat" and "dog" = same as "cat" and "refrigerator"
  No semantic information!

DENSE EMBEDDING (e.g. 384 dimensions):
  "cat" = [0.23, -0.45, 0.89, ..., 0.12]  (384 dimensions, all real numbers)
  "dog" = [0.21, -0.42, 0.91, ..., 0.15]  (384 dimensions, all real numbers)

  Distance between "cat" and "dog" = SMALL (domestic animals)
  Distance between "cat" and "refrigerator" = LARGE (different concepts)
  Meaning is captured in the geometry!

The problems with one-hot encoding are clear: the vectors are huge (dimensionality equal to the vocabulary), sparse (almost all zeros) and, most importantly, all orthogonal to each other. Any two words have the same distance, regardless of meaning. There is no way to distinguish "cat" from "feline" compared to "cat" from "economics".

Dense embeddings solve all three problems: the vectors are compact (a few hundred dimensions), dense (all values are meaningful) and capture semantic relationships in their geometry. Similar words have nearby vectors, and directions in space correspond to linguistic concepts.

1.2 The Semantic Space

A fascinating property of embeddings is that semantic relationships transform into geometric relationships. The classic example is vector arithmetic: the vector "king" minus "man" plus "woman" produces a vector very close to "queen". This is not magic: it means that the space has captured the concept of "gender" as one direction and the concept of "royalty" as another direction.

Vector Arithmetic in Embedding Space


Semantic relationships as vector operations:

  vec("king") - vec("man") + vec("woman")  ~  vec("queen")
  vec("Paris") - vec("France") + vec("Italy")  ~  vec("Rome")
  vec("good") - vec("better") + vec("large")  ~  vec("larger")

Clusters in space:
  [cat, dog, horse, fish]        -->  nearby (animals)
  [Python, Java, C++, Rust]      -->  nearby (programming languages)
  [happy, joyful, cheerful]      -->  very close (synonyms)

Fundamental Intuition

An embedding is essentially a meaning compressor. It takes the meaning of a text, with all its nuances, and compresses it into a point in multidimensional space. The position of that point relative to all other points captures all the semantic relationships the model has learned. This is the foundation on which all semantic search is built.

2. Classic Word Embeddings: Word2Vec, GloVe, FastText

The modern history of embeddings begins in 2013 with Word2Vec, published by Tomas Mikolov and colleagues at Google. The revolutionary idea was simple: you can learn the meaning of a word from the context in which it appears. As linguist John Firth said in 1957: "You shall know a word by the company it keeps".

2.1 Word2Vec: CBOW and Skip-gram

Word2Vec proposes two neural architectures for learning embeddings:

CBOW (Continuous Bag of Words): Given a window of context words, predict the central word. Example: given "the ___ barks loudly", predict "dog"
Skip-gram: Given a central word, predict the context words. Example: given "dog", predict "the", "barks", "loudly"

Word2Vec Architectures


CBOW (Continuous Bag of Words):
  Input: context words ["the", "___", "barks", "loudly"]
  Output: target word "dog"

  Context --> [Embedding Layer] --> Average vectors --> [Softmax] --> "dog"
  Fast, good for frequent words

SKIP-GRAM:
  Input: target word "dog"
  Output: context words ["the", "barks", "loudly"]

  "dog" --> [Embedding Layer] --> [Softmax] --> context words
  Slower, better for rare words

Typical parameters:
  - Embedding dimensions: 100-300
  - Context window: 5-10 words
  - Vocabulary: 100k-1M words
  - Training: billions of words (Wikipedia, Common Crawl)

2.2 GloVe and FastText

GloVe (Global Vectors for Word Representation, Stanford 2014) takes a different approach: it builds a global co-occurrence matrix and factorizes it to obtain embeddings. It captures global relationships that Word2Vec, with its local window, might miss.

FastText (Facebook 2016) extends Word2Vec by working at the subword level (character n-grams). The word "embedding" is also represented by its components: "emb", "mbe", "bed", "edd", etc. This allows generating embeddings even for words never seen during training (out-of-vocabulary words), a crucial advantage for morphologically rich languages.

      Classic Word Embeddings Comparison
      
        
            Model
            Year
            Approach
            Strength
            Limitation
          

        
            Word2Vec
            2013
            Local context prediction
            Fast, effective
            No OOV, no context
          

            GloVe
            2014
            Global co-occurrence
            Global relationships
            No OOV, no context
          

            FastText
            2016
            Character n-grams
            Handles OOV
            One vector per word
          

      
    

2.3 The Fundamental Limitation: One Vector Per Word

All classic word embeddings share a structural limitation: they produce a single vector for each word, regardless of context. The word "bank" has the same embedding in "river bank" as in "bank account". This is a serious problem because the meaning of a word almost always depends on the context in which it appears.

Furthermore, these models operate at the level of individual words: they cannot produce an embedding for a phrase or paragraph. To represent a sentence, one must resort to rudimentary strategies like averaging the word vectors, losing information about order and syntactic structure.

Why Not Use Word2Vec for RAG

Classic word embeddings are inadequate for modern semantic search because: (1) they do not capture context, (2) they do not produce sentence-level embeddings, (3) averaging vectors loses critical information. The sentence "the dog bites the man" and "the man bites the dog" would have the same embedding. RAG requires models that understand the meaning of the entire sentence in its context.

3. Contextual Embeddings: The BERT Revolution

In 2018, Google published BERT (Bidirectional Encoder Representations from Transformers) and radically changed the landscape. BERT produces contextual embeddings: the representation of each word depends on the entire context of the sentence in which it appears. The word "bank" in "river bank" will have a different embedding from "bank account".

3.1 The BERT Architecture

BERT is based on the Transformer encoder: a neural network architecture that uses the self-attention mechanism to capture relationships between all words in a sentence simultaneously. Unlike recurrent networks (LSTM, GRU) that process text sequentially, BERT processes the entire sentence in parallel, leveraging bidirectional attention.

How BERT Processes Text


Input: "I went to the bank to deposit money"
       |     |    |   |    |    |   |       |
       v     v    v   v    v    v   v       v
BERT: [CLS] [I] [went] [to] [the] [bank] [to] [deposit] [money] [SEP]
       |
       v
Each token attends to ALL other tokens (bidirectional attention)

Output: contextual vector for each token
  "bank" in context "deposit money" --> vector for FINANCIAL bank
  "bank" in context "river, fishing" --> vector for RIVER bank

[CLS] token = aggregated representation of the ENTIRE sentence

3.2 BERT Pre-training: MLM and NSP

BERT is pre-trained on two unsupervised tasks using enormous text corpora (Wikipedia + BookCorpus, 3.3 billion words):

Masked Language Modeling (MLM): 15% of tokens are masked at random and BERT must predict them. This forces bidirectional understanding: to predict "bank" in "I went to the __ to deposit", BERT must understand both the left and right context.
Next Sentence Prediction (NSP): BERT receives two sentences and must predict whether the second follows the first in the original text. This trains BERT to understand discourse-level relationships.

3.3 BERT Variants

After the original publication, numerous BERT variants emerged, each optimized for different use cases:

      Main BERT Variants
      
        
            Model
            Parameters
            Strength
            Use Case
          

        
            BERT-base
            110M
            Balanced
            General NLP tasks
          

            BERT-large
            340M
            High performance
            High-stakes tasks
          

            RoBERTa
            125M
            Better training
            Classification, NER
          

            DistilBERT
            66M
            40% faster, 97% quality
            Production with constraints
          

            DeBERTa
            86M-1.5B
            State-of-the-art NLU
            Complex understanding tasks
          

      
    

3.4 The Problem with BERT for Semantic Search

Despite being revolutionary, BERT has a fundamental problem for semantic search and RAG: it was designed for sequence classification and token classification, not for generating sentence-level embeddings suitable for similarity comparison.

Using the [CLS] token as a sentence representation (as many did initially) or averaging all token representations produces embeddings of very poor quality for similarity tasks. The reason is that BERT was never trained to produce sentence representations in a metric space where cosine distance is meaningful for semantic similarity.

The Problem: BERT for Similarity


from transformers import BertTokenizer, BertModel
import torch

tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained('bert-base-uncased')

def get_bert_embedding(text):
    inputs = tokenizer(text, return_tensors='pt', truncation=True, max_length=512)
    with torch.no_grad():
        outputs = model(**inputs)
    # [CLS] token as sentence representation
    return outputs.last_hidden_state[:, 0, :]

emb1 = get_bert_embedding("I love machine learning")
emb2 = get_bert_embedding("I enjoy deep learning")
emb3 = get_bert_embedding("I hate broccoli")

# PROBLEM: vanilla BERT gives poor similarity results
# "machine learning" vs "deep learning" often LESS similar than expected
# "machine learning" vs "broccoli" often MORE similar than expected!
# BERT is not trained to maximize cosine similarity for related sentences

4. Sentence Transformers: The Solution for Semantic Search

In 2019, Nils Reimers and Iryna Gurevych published SBERT (Sentence-BERT): a BERT variant specifically fine-tuned to produce sentence embeddings that work well for semantic similarity. This is the foundation of modern semantic search and RAG retrieval.

4.1 Siamese Architecture and Triplet Loss

SBERT uses a Siamese network architecture: two identical BERT models (sharing the same weights) that process two sentences simultaneously and are trained to produce similar embeddings for semantically related sentences and different embeddings for unrelated sentences.

SBERT Training Architecture


SIAMESE ARCHITECTURE:

  Sentence A ──> BERT + Pooling ──> embedding_a ──\
                                                    ──> Cosine Similarity ──> Loss
  Sentence B ──> BERT + Pooling ──> embedding_b ──/

TRAINING DATA (NLI - Natural Language Inference):
  Anchor: "A man is playing guitar"
  Positive: "Someone is making music" (entailment)   --> high similarity
  Negative: "A woman is cooking dinner" (contradiction) --> low similarity

TRIPLET LOSS:
  Loss = max(0, ||A-P||^2 - ||A-N||^2 + margin)

  Forces: dist(anchor, positive) + margin < dist(anchor, negative)
  Effect: semantically related sentences cluster together in space

4.2 Pooling Strategies

To go from per-token BERT representations to a single sentence vector, SBERT uses different pooling strategies:

Mean pooling: Average of all token vectors (best default choice)
Max pooling: Maximum value for each dimension
CLS pooling: Using only the [CLS] token (worst for SBERT)

Sentence Transformers in Practice


from sentence_transformers import SentenceTransformer
import numpy as np

# Load pre-trained model
model = SentenceTransformer('all-MiniLM-L6-v2')  # 384 dimensions, fast

sentences = [
    "Machine learning is a subset of artificial intelligence",
    "Deep learning uses neural networks with many layers",
    "I enjoy hiking in the mountains",
    "The weather is nice today"
]

# Generate embeddings (batch processing, GPU if available)
embeddings = model.encode(sentences, batch_size=32, show_progress_bar=True)
print(f"Shape: {embeddings.shape}")  # (4, 384)

# Cosine similarity
from sklearn.metrics.pairwise import cosine_similarity
similarities = cosine_similarity(embeddings)

print("\nSimilarity Matrix:")
for i, s1 in enumerate(sentences):
    for j, s2 in enumerate(sentences):
        if i < j:
            sim = similarities[i][j]
            print(f"  {s1[:40]} vs {s2[:40]}: {sim:.3f}")

# Expected output:
# ML vs Deep Learning: 0.784 (HIGH - related topics)
# ML vs hiking: 0.112 (LOW - unrelated)
# ML vs weather: 0.089 (LOW - unrelated)

4.3 Popular SBERT Models: Benchmarks and Recommendations

The MTEB (Massive Text Embedding Benchmark) is the standard reference for evaluating embedding models across 56 tasks and 112 languages. Here is a selection of the best models for different use cases:

      MTEB Benchmark: Top Models 2025
      
        
            Model
            Dimensions
            Parameters
            MTEB Score
            Speed
            Best For
          

        
            text-embedding-3-large
            3072
            ~570M
            64.6
            API only
            Maximum quality, OpenAI API
          

            e5-mistral-7b-instruct
            4096
            7B
            66.6
            Slow
            Sota local, multilingual
          

            all-mpnet-base-v2
            768
            109M
            57.8
            Fast
            Best open-source general
          

            all-MiniLM-L6-v2
            384
            22M
            56.2
            Very fast
            Production with constraints
          

            bge-large-en-v1.5
            1024
            335M
            63.6
            Medium
            Best free English model
          

            paraphrase-multilingual-mpnet
            768
            278M
            53.1
            Medium
            50+ languages including Italian
          

      
    

5. Vector Similarity Metrics

Once we have embeddings, we need a way to measure how similar two vectors are. The choice of metric significantly impacts both the quality of results and computational performance.

5.1 Cosine Similarity

Cosine similarity measures the angle between two vectors, ignoring their magnitude. It is the most used metric for semantic search because it is invariant to the length of the vectors (a long document and a short one with the same content will be similar).

Similarity Metrics in Python


import numpy as np
from sklearn.metrics.pairwise import cosine_similarity, euclidean_distances

def cosine_sim(a, b):
    """Cosine similarity: measures angle, invariant to magnitude"""
    return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))

def euclidean_dist(a, b):
    """Euclidean distance: straight-line distance in space"""
    return np.linalg.norm(a - b)

def dot_product(a, b):
    """Dot product: fast, considers both angle and magnitude"""
    return np.dot(a, b)

# Example
a = np.array([0.23, -0.45, 0.89, 0.12])
b = np.array([0.21, -0.42, 0.91, 0.15])
c = np.array([-0.89, 0.12, -0.34, 0.67])

print(f"Cosine similarity a-b: {cosine_sim(a, b):.4f}")   # ~0.998 (very similar)
print(f"Cosine similarity a-c: {cosine_sim(a, c):.4f}")   # ~-0.45 (different)
print(f"Euclidean distance a-b: {euclidean_dist(a, b):.4f}")  # ~0.05 (close)
print(f"Euclidean distance a-c: {euclidean_dist(a, c):.4f}")  # ~1.8 (far)

      Metric Comparison
      
            Metric
            Range
            Best For
            Considerations
          
            Cosine Similarity
            [-1, 1]
            Most embedding models
            Invariant to magnitude, 1 = identical
          
            Dot Product
            (-inf, +inf)
            Normalized embeddings
            Faster than cosine if normalized
          
            Euclidean Distance
            [0, +inf)
            Dense embeddings with learned scale
            0 = identical, less common for NLP
          
            Manhattan Distance
            [0, +inf)
            Sparse vectors
            Less sensitive to outliers

6. Building Semantic Search with FAISS

FAISS (Facebook AI Similarity Search) is a library for efficient similarity search and dense vector clustering. It allows searching billions of vectors in milliseconds, using CPU or GPU. It is the foundation for many production RAG systems when a dedicated vector database is not needed.

6.1 Complete Semantic Search Engine

Complete Semantic Search Engine with FAISS


import faiss
import numpy as np
from sentence_transformers import SentenceTransformer
from typing import List, Tuple
import json

class SemanticSearchEngine:
    """Complete semantic search engine with FAISS and Sentence Transformers"""

    def __init__(self, model_name: str = 'all-MiniLM-L6-v2'):
        self.model = SentenceTransformer(model_name)
        self.dimension = self.model.get_sentence_embedding_dimension()
        self.index = None
        self.documents = []
        self.metadata = []

    def add_documents(self, texts: List[str], metadata: List[dict] = None):
        """Add documents to the search engine"""
        if metadata is None:
            metadata = [{} for _ in texts]

        # Generate embeddings
        print(f"Generating embeddings for {len(texts)} documents...")
        embeddings = self.model.encode(
            texts,
            batch_size=64,
            show_progress_bar=True,
            normalize_embeddings=True  # Important for cosine similarity!
        )

        # Initialize or expand index
        if self.index is None:
            # IndexFlatIP: Inner Product (= cosine with normalized vectors)
            self.index = faiss.IndexFlatIP(self.dimension)

        # Add to index
        self.index.add(embeddings.astype('float32'))
        self.documents.extend(texts)
        self.metadata.extend(metadata)

        print(f"Total documents in index: {self.index.ntotal}")

    def search(self, query: str, top_k: int = 5) -> List[Tuple[str, float, dict]]:
        """Search for the most relevant documents"""
        # Encode query
        query_embedding = self.model.encode(
            [query],
            normalize_embeddings=True
        ).astype('float32')

        # FAISS search
        scores, indices = self.index.search(query_embedding, top_k)

        results = []
        for score, idx in zip(scores[0], indices[0]):
            if idx != -1:  # -1 = not found
                results.append((
                    self.documents[idx],
                    float(score),  # cosine similarity
                    self.metadata[idx]
                ))

        return results

    def save(self, path: str):
        """Save index and documents to disk"""
        faiss.write_index(self.index, f"{path}.index")
        with open(f"{path}.json", 'w') as f:
            json.dump({
                'documents': self.documents,
                'metadata': self.metadata
            }, f)

    def load(self, path: str):
        """Load index from disk"""
        self.index = faiss.read_index(f"{path}.index")
        with open(f"{path}.json", 'r') as f:
            data = json.load(f)
            self.documents = data['documents']
            self.metadata = data['metadata']


# Usage example
engine = SemanticSearchEngine('all-mpnet-base-v2')

# Add corpus
corpus = [
    "RAG (Retrieval-Augmented Generation) combines LLMs with external knowledge",
    "Vector databases store embeddings and enable similarity search",
    "BERT is a bidirectional transformer model for NLP",
    "Sentence Transformers produce sentence-level semantic embeddings",
    "Python is a popular programming language for data science",
    "The LangChain framework simplifies building LLM applications",
    "Fine-tuning adapts pre-trained models to specific domains",
    "FAISS enables efficient billion-scale similarity search"
]

engine.add_documents(corpus)

# Test search
query = "How does semantic search work for AI?"
results = engine.search(query, top_k=3)

print(f"\nQuery: {query}")
print("\nTop 3 results:")
for text, score, meta in results:
    print(f"  Score: {score:.4f} | {text}")

6.2 FAISS Index Types

FAISS offers different index types with different quality/speed/memory tradeoffs:

FAISS Index Types


# 1. IndexFlatIP / IndexFlatL2: Exact search (brute force)
# - Perfect precision, slow on large corpora
# - Good up to ~100k vectors
index_flat = faiss.IndexFlatIP(dimension)

# 2. IndexIVFFlat: Inverted file with exact vectors
# - Groups vectors into clusters (Voronoi cells)
# - Search only in nearest clusters (nprobe parameter)
# - Good for 100k - 10M vectors
quantizer = faiss.IndexFlatIP(dimension)
index_ivf = faiss.IndexIVFFlat(quantizer, dimension, 100)  # 100 clusters
index_ivf.train(training_vectors)
index_ivf.nprobe = 10  # search in 10 nearest clusters

# 3. IndexHNSW: Hierarchical Navigable Small World
# - Graph structure, excellent speed/quality tradeoff
# - Good for medium-large corpora
index_hnsw = faiss.IndexHNSWFlat(dimension, 32)  # M=32

# 4. IndexIVFPQ: IVF + Product Quantization
# - Compression: reduces memory by 16-64x
# - Slight quality loss
# - Best for very large corpora (100M+)
index_ivfpq = faiss.IndexIVFPQ(quantizer, dimension, 100, 8, 8)

# RULE OF THUMB:
# < 100k docs: IndexFlatIP (exact, simple)
# 100k-10M docs: IndexIVFFlat (fast, good recall)
# > 10M docs: IndexIVFPQ (compressed, high scale)
# Real-time updates: IndexHNSW (no retraining)

7. Domain Fine-Tuning

Generic pre-trained models perform well across a wide range of domains, but for specific applications (legal, medical, financial, code) they can be significantly improved through fine-tuning. The idea is to adapt the model's representations so that domain-specific terms cluster correctly.

7.1 Fine-Tuning with Sentence Transformers

Fine-Tuning a Sentence Transformer for a Domain


from sentence_transformers import SentenceTransformer, InputExample, losses
from torch.utils.data import DataLoader

# 1. Prepare training data
# Format: (anchor, positive, negative) or (sentence1, sentence2, label)
train_examples = [
    # Format: similar sentence pairs (label=1) or dissimilar (label=0)
    InputExample(texts=["RAG retrieval system", "document retrieval for LLMs"], label=0.9),
    InputExample(texts=["RAG retrieval system", "cooking pasta recipe"], label=0.1),
    InputExample(texts=["vector similarity search", "nearest neighbor search"], label=0.95),
    InputExample(texts=["embedding fine-tuning", "model adaptation domain"], label=0.85),
    # ... thousands of examples from your domain
]

# 2. Load base model
model = SentenceTransformer('all-MiniLM-L6-v2')

# 3. Define DataLoader
dataloader = DataLoader(train_examples, shuffle=True, batch_size=16)

# 4. Define loss function
# CosineSimilarityLoss: for labeled pairs (sentence1, sentence2, similarity)
loss = losses.CosineSimilarityLoss(model)

# Alternatively for triplets:
# loss = losses.TripletLoss(model, distance_metric=TripletDistanceMetric.COSINE)

# 5. Train
from sentence_transformers.evaluation import EmbeddingSimilarityEvaluator
# ... define evaluator with validation data

model.fit(
    train_objectives=[(dataloader, loss)],
    epochs=3,
    warmup_steps=100,
    output_path='./fine-tuned-model',
    show_progress_bar=True
)

# 6. Save and use
model.save('./fine-tuned-domain-model')
domain_model = SentenceTransformer('./fine-tuned-domain-model')

7.2 Strategies for Limited Data

If you have limited training data (common in specialized domains), there are several effective strategies:

      Data-Efficient Fine-Tuning Strategies
      GPL (Generative Pseudo Labeling): Use an LLM (GPT-4, Claude) to generate query-document pairs from your corpus automatically, creating fine-tuning data without manual annotation.
InPars: Similar to GPL, uses GPT-3 to generate relevant questions for each document, creating training pairs in an unsupervised manner.
Knowledge Distillation: Train a small student model (e.g. MiniLM) to mimic a large teacher model (e.g. text-embedding-3-large) on your domain data, getting much of the large model's quality at lower cost.
Contrastive learning with in-batch negatives: Uses other examples in the same batch as negatives, maximizing use of small datasets.

    

8. Choosing the Right Embedding Model

Choosing the embedding model is one of the most important decisions when building a RAG system. There is no single "best" model: the right choice depends on specific requirements.

8.1 Decision Framework

Model Selection Decision Tree


CHOICE DECISION TREE:

1. BUDGET / LATENCY
   - High budget, cloud API → text-embedding-3-large (OpenAI)
   - Medium budget, good performance → bge-large-en-v1.5
   - Low latency required, self-hosted → all-MiniLM-L6-v2

2. LANGUAGE
   - English only → bge-large-en-v1.5, all-mpnet-base-v2
   - Multilingual → paraphrase-multilingual-mpnet, multilingual-e5
   - Italian specific → custom fine-tune or multilingual model

3. CORPUS SIZE
   - < 100k docs → any model + FAISS IndexFlat
   - 100k-10M docs → medium model + FAISS IVF
   - > 10M docs → fast model + FAISS IVFPQ or dedicated vector DB

4. DOMAIN
   - General → all-mpnet-base-v2 (good all-rounder)
   - Code → codellama-embed, code-bert
   - Medical → PubMedBERT, BioSentVec
   - Legal → legal-bert-base-uncased
   - Custom → fine-tune a base model on your data

5. PRIVACY (data cannot leave premises)
   - Self-hosted required → open-source models on local infrastructure
   - No cloud at all → all-MiniLM-L6-v2, bge-small-en-v1.5

8.2 Performance vs Cost

Production Tradeoffs: Performance vs Cost

Scenario	Recommended Model	Cost/1M tokens	Dimensions
Maximum quality (budget available)	text-embedding-3-large	$0.13	3072
Balanced quality/cost (OpenAI)	text-embedding-3-small	$0.02	1536
Best free self-hosted	bge-large-en-v1.5	Free	1024
High volume, resource constrained	all-MiniLM-L6-v2	Free	384
State-of-the-art local	e5-mistral-7b-instruct	Free (high GPU)	4096

9. Embeddings in a Complete RAG Pipeline

Now let us put everything together to see how embeddings integrate into a complete RAG pipeline, from document ingestion to generating the final answer.

Complete RAG Pipeline with Embeddings


from sentence_transformers import SentenceTransformer
import faiss
import numpy as np
from openai import OpenAI
from typing import List

class RAGPipeline:
    """Complete RAG pipeline with semantic search"""

    def __init__(
        self,
        embedding_model: str = 'all-mpnet-base-v2',
        llm_model: str = 'gpt-4o-mini'
    ):
        self.embedder = SentenceTransformer(embedding_model)
        self.llm = OpenAI()
        self.llm_model = llm_model
        self.dim = self.embedder.get_sentence_embedding_dimension()
        self.index = faiss.IndexFlatIP(self.dim)
        self.chunks = []

    def ingest_documents(self, documents: List[str], chunk_size: int = 500):
        """Ingest and index documents"""
        all_chunks = []
        for doc in documents:
            # Simple chunking (in production use recursive splitting)
            words = doc.split()
            for i in range(0, len(words), chunk_size // 5):
                chunk = ' '.join(words[i:i + chunk_size // 5])
                if chunk:
                    all_chunks.append(chunk)

        embeddings = self.embedder.encode(
            all_chunks,
            normalize_embeddings=True,
            batch_size=64
        ).astype('float32')

        self.index.add(embeddings)
        self.chunks.extend(all_chunks)
        return len(all_chunks)

    def retrieve(self, query: str, top_k: int = 5) -> List[tuple]:
        """Retrieve most relevant chunks for the query"""
        query_emb = self.embedder.encode(
            [query],
            normalize_embeddings=True
        ).astype('float32')

        scores, indices = self.index.search(query_emb, top_k)
        return [(self.chunks[i], float(s)) for s, i in zip(scores[0], indices[0]) if i != -1]

    def generate(self, query: str, top_k: int = 5) -> str:
        """Generate response with retrieved context"""
        relevant_chunks = self.retrieve(query, top_k)

        context = "\n\n".join([
            f"[Relevance: {score:.3f}]\n{chunk}"
            for chunk, score in relevant_chunks
        ])

        prompt = f"""Answer the question based on the provided context.
If the context does not contain enough information, say so explicitly.

Context:
{context}

Question: {query}

Answer:"""

        response = self.llm.chat.completions.create(
            model=self.llm_model,
            messages=[{"role": "user", "content": prompt}],
            temperature=0.1,
            max_tokens=1000
        )

        return response.choices[0].message.content

# Usage
rag = RAGPipeline()
documents = ["Your documents here..."]
rag.ingest_documents(documents)
answer = rag.generate("Your question here?")
print(answer)

10. Best Practices and Anti-Patterns

10.1 Best Practices

Embedding Best Practices

Always normalize embeddings before indexing and searching, especially for cosine similarity. Unnormalized embeddings with dot product give incorrect results.
Match instruction prefix for asymmetric models (e.g. E5, BGE): use "query: " for questions and "passage: " for documents. Failing to do this can degrade quality by 10-20%.
Chunk appropriately: embedding quality degrades with very short (<50 tokens) and very long (>512 tokens) texts. Aim for 200-400 tokens.
Use batch processing: encode hundreds of documents at a time, not one by one. With GPU, batching can be 50-100x faster.
Evaluate on your data: MTEB benchmarks are useful as a starting point but do not replace evaluation on your specific data and use case.

10.2 Common Anti-Patterns

Anti-Patterns to Avoid

Using vanilla BERT for similarity: [CLS] pooling on BERT-base without SBERT fine-tuning gives poor results. Always use models specifically fine-tuned for semantic similarity.
Ignoring the embedding/search metric mismatch: if the model was trained with cosine similarity, do not use Euclidean distance in the index. They give different rankings.
Never re-embedding after model change: if you change embedding model, you must re-index the entire corpus. Old embeddings are incompatible with the new model.
Using oversized models without justification: a 7B parameter model costs 50x a 22M model in inference. Measure quality improvement before committing to expensive infrastructure.
Ignoring multilingual limitations: English-only models used on Italian or mixed text will perform very poorly. Use multilingual models for non-English content.

Conclusions

Embeddings and semantic search are the fundamental building blocks of any advanced RAG system. We have covered the complete journey from classic Word2Vec word vectors to modern Sentence Transformers, understanding why BERT alone is not enough and how SBERT solves the problem.

The key points to remember:

Embeddings capture semantic meaning in vector geometry
Classic models (Word2Vec, GloVe) lack context and sentence-level representations
BERT provides contextual embeddings but is not designed for similarity search
Sentence Transformers (SBERT) are specifically optimized for semantic similarity
FAISS enables efficient similarity search from thousands to billions of vectors
Choosing the right model requires balancing quality, speed, cost and language requirements
Domain fine-tuning can significantly improve performance on specialized corpora

In the next article we will explore Vector Databases in depth: Qdrant, Pinecone, Weaviate and Milvus, comparing architectures, performance and when to choose one over the other. We will also see how vector databases extend FAISS's capabilities with persistence, filtering and distributed scalability.

Continue the Series

Article 1: RAG Explained - Fundamentals
Article 2: Embeddings and Semantic Search (current)
Article 3: Vector Database - Qdrant vs Pinecone vs Milvus
Article 4: Hybrid Retrieval: BM25 + Vector Search
Article 5: LangChain for RAG: Advanced Patterns

Also explore related articles: BERT and Transformers in NLP and pgvector for Semantic Search in PostgreSQL.

Model	Year	Approach	Strength	Limitation
Word2Vec	2013	Local context prediction	Fast, effective	No OOV, no context
GloVe	2014	Global co-occurrence	Global relationships	No OOV, no context
FastText	2016	Character n-grams	Handles OOV	One vector per word

Model	Parameters	Strength	Use Case
BERT-base	110M	Balanced	General NLP tasks
BERT-large	340M	High performance	High-stakes tasks
RoBERTa	125M	Better training	Classification, NER
DistilBERT	66M	40% faster, 97% quality	Production with constraints
DeBERTa	86M-1.5B	State-of-the-art NLU	Complex understanding tasks

Model	Dimensions	Parameters	MTEB Score	Speed	Best For
text-embedding-3-large	3072	~570M	64.6	API only	Maximum quality, OpenAI API
e5-mistral-7b-instruct	4096	7B	66.6	Slow	Sota local, multilingual
all-mpnet-base-v2	768	109M	57.8	Fast	Best open-source general
all-MiniLM-L6-v2	384	22M	56.2	Very fast	Production with constraints
bge-large-en-v1.5	1024	335M	63.6	Medium	Best free English model
paraphrase-multilingual-mpnet	768	278M	53.1	Medium	50+ languages including Italian

Metric	Range	Best For	Considerations
Cosine Similarity	[-1, 1]	Most embedding models	Invariant to magnitude, 1 = identical
Dot Product	(-inf, +inf)	Normalized embeddings	Faster than cosine if normalized
Euclidean Distance	[0, +inf)	Dense embeddings with learned scale	0 = identical, less common for NLP
Manhattan Distance	[0, +inf)	Sparse vectors	Less sensitive to outliers