Ciao! Sono

Federico Calò

Sviluppatore Software | Divulgatore Tecnico

Creo applicazioni web moderne e strumenti digitali personalizzati per aiutare le attività a crescere attraverso l'innovazione tecnologica. La mia passione è unire informatica ed economia per generare valore reale.

Contattami

Chi Sono

La mia passione per l'informatica è nata tra i banchi dell'Istituto Tecnico Commerciale di Maglie, dove ho scoperto il potere della programmazione e il fascino di creare soluzioni digitali. Fin da subito, ho capito che l'informatica non era solo codice, ma uno strumento straordinario per trasformare idee in realtà.

Durante gli studi superiori in Sistemi Informativi Aziendali, ho iniziato a intrecciare informatica ed economia, comprendendo come la tecnologia possa essere il motore della crescita per qualsiasi attività. Questa visione mi ha accompagnato all'Università degli Studi di Bari, dove ho conseguito la Laurea in Informatica, approfondendo le mie competenze tecniche e la mia passione per lo sviluppo software.

Oggi metto questa esperienza al servizio di imprese, professionisti e startup, creando soluzioni digitali su misura che automatizzano processi, ottimizzano risorse e aprono nuove opportunità di business. Perché la vera innovazione inizia quando la tecnologia incontra le esigenze reali delle persone.

Le Mie Competenze

Analisi Dati & Modelli Previsionali

Trasformo i dati in insights strategici con analisi approfondite e modelli predittivi per decisioni informate

Automazione Processi

Creo strumenti personalizzati che automatizzano operazioni ripetitive e liberano tempo per attività a valore aggiunto

Sistemi Custom

Sviluppo sistemi software su misura, dalle integrazioni tra piattaforme alle dashboard personalizzate

const federico = {
  nome: "Federico Calò",
  ruolo: "Sviluppatore Software",
  città: "Bari, Italia",
  missione: "Aiutare attraverso l'informatica",
  passioni: [
    "Codice Pulito",
    "Innovazione",
    "Crescita Continua"
  ]
};

La Mia Missione

Credo fermamente che l'informatica sia lo strumento più potente per trasformare le idee in realtà e migliorare la vita delle persone.

🚀

Democratizzare la Tecnologia

La mia missione è rendere l'informatica accessibile a tutti: dalle piccole imprese locali alle startup innovative, fino ai professionisti che vogliono digitalizzare la propria attività. Ogni realtà merita di sfruttare le potenzialità del digitale.

💡

Unire Informatica ed Economia

Non è solo questione di scrivere codice: è capire come la tecnologia possa generare valore reale. Intrecciando competenze informatiche e visione economica, aiuto le attività a crescere, ottimizzare processi e raggiungere nuovi traguardi di efficienza e redditività.

🎯

Creare Soluzioni su Misura

Ogni attività è unica, e così devono esserlo le soluzioni. Sviluppo strumenti personalizzati che rispondono alle esigenze specifiche di ciascun cliente, automatizzando processi ripetitivi e liberando tempo per ciò che conta davvero: far crescere il business.

Trasforma la Tua Attività con la Tecnologia

Dicembre 2024

Visualizza

Master SQL

RoadMap.sh

Novembre 2024

Visualizza

Oracle Certified Foundations Associate

Oracle

Ottobre 2024

Visualizza

People Leadership Credential

Connect

Settembre 2024

💻 Linguaggi & Tecnologie

☕Java

🐍Python

📜JavaScript

🅰️Angular

⚛️React

🔷TypeScript

🗄️SQL

🐘PHP

🎨CSS/SCSS

🔧Node.js

🐳Docker

🌿Git

💼

12/2024 - Presente

Custom Software Engineering Analyst

Accenture

Bari, Puglia, Italia · Ibrida Analisi e sviluppo di sistemi informatici attraverso l'utilizzo di Java e Quarkus in Health and Public Sector. Formazione continua su tecnologie moderne per la creazione di soluzioni software personalizzate ed efficienti e sugli agenti.

💼

06/2022 - 12/2024

Analista software e Back End Developer Associate Consultant

Links Management and Technology SpA

Esperienza nell'analisi di sistemi software as-is e flussi ETL utilizzando PowerCenter. Formazione completata su Spring Boot per lo sviluppo di applicazioni backend moderne e scalabili. Sviluppatore Backend specializzato in Spring Boot, con esperienza in progettazione di database, analisi, sviluppo e testing dei task assegnati.

💼

02/2021 - 10/2021

Programmatore software

Adesso.it (prima era WebScience srl)

Esperienza nell'analisi AS-IS e TO-BE, evoluzioni SEO ed evoluzioni website per migliorare le performance e l'engagement degli utenti.

🎓

2018 - 2025

Laurea in Informatica

Università degli Studi di Bari Aldo Moro

Bachelor's degree in Computer Science, focusing on software engineering, algorithms, and modern development practices.

📚

2013 - 2018

Diploma - Sistemi Informativi Aziendali

Istituto Tecnico Commerciale di Maglie

Technical diploma specializing in Business Information Systems, combining IT knowledge with business management.

Contattami

Hai un progetto in mente? Parliamone! Compila il form qui sotto e ti risponderò al più presto.

* Campi obbligatori. I tuoi dati saranno utilizzati solo per rispondere alla tua richiesta.

Semantic Similarity and Sentence Embeddings: Comparing Texts

How similar are two sentences? Not in the lexical sense (same words), but in the semantic sense (same meaning). "The dog chases the cat" and "The feline is being pursued by the canine" are semantically almost identical but lexically very different. Answering this question is the challenge of Semantic Similarity.

Applications are everywhere: semantic search engines, recommendation systems, content deduplication, question answering, RAG (Retrieval-Augmented Generation), chatbots, and FAQ matching. In this article we build semantic similarity systems from scratch: from cosine similarity to sentence embeddings with Sentence-BERT, to fast vector search with FAISS.

This is the ninth article in the Modern NLP: from BERT to LLMs series. This topic connects directly with the AI Engineering/RAG series where semantic embeddings are the heart of dense retrieval.

What You Will Learn

Cosine similarity and dot product: formulas and when to use them
Why standard BERT fails for semantic similarity and why Sentence-BERT is needed
Sentence-BERT (SBERT): siamese architecture and training with triplet loss
sentence-transformers models on HuggingFace: which to choose
Semantic search on large corpora with FAISS
Sentence embeddings for Italian and multilingual text
Benchmarking: STS-B, SICK and evaluation metrics
Cross-encoder vs bi-encoder: quality/speed trade-offs
Fine-tuning a sentence transformer on your domain
Complete implementation of a FAQ matching system
Production-ready pipeline with caching and optimization

1. The Semantic Similarity Problem

Consider these four groups of sentences and their challenges:

      Semantic Similarity Examples
      High similarity: "The bank raised interest rates" / "Interest rates were increased by the financial institution"
Low similarity: "The bank raised interest rates" / "The cat sleeps on the sofa"
Misleading (same words, different meaning): "She sat on the river bank" / "He went to the bank to withdraw money"
Cross-lingual: "The dog runs fast" / "Il cane corre veloce" (same semantics, different languages)

    

Traditional metrics like Jaccard similarity or BM25 rely on lexical overlap and fail completely with synonyms and paraphrases. Even TF-IDF cannot capture meaning. The solution lies in semantic embeddings: dense vector representations where geometric proximity reflects semantic proximity.

1.1 Cosine Similarity: The Fundamental Metric

Cosine similarity measures the angle between two vectors in embedding space. It ranges from -1 (opposite) to 1 (identical), with 0 for orthogonal vectors. The mathematical formula is:

cos(A, B) = (A · B) / (||A|| · ||B||)

When vectors are normalized to unit norm, cosine similarity equals the dot product, making computation much more efficient on GPU hardware.

import numpy as np
import torch
from torch.nn import functional as F

def cosine_similarity(vec1, vec2):
    """Cosine similarity between two numpy vectors."""
    dot_product = np.dot(vec1, vec2)
    norm1 = np.linalg.norm(vec1)
    norm2 = np.linalg.norm(vec2)
    return dot_product / (norm1 * norm2)

# PyTorch batch version (efficient for batches)
def cosine_similarity_batch(emb1, emb2):
    """Cosine similarity between batches of embeddings (normalized)."""
    # Normalize to unit norm
    emb1_norm = F.normalize(emb1, p=2, dim=1)
    emb2_norm = F.normalize(emb2, p=2, dim=1)
    return (emb1_norm * emb2_norm).sum(dim=1)

# Example with simple vectors
vec_a = np.array([1.0, 0.5, 0.3, 0.8])
vec_b = np.array([0.9, 0.4, 0.4, 0.7])  # similar to a
vec_c = np.array([-0.2, 0.8, -0.5, 0.1])  # different from a

print(f"sim(a, b) = {cosine_similarity(vec_a, vec_b):.4f}")  # high
print(f"sim(a, c) = {cosine_similarity(vec_a, vec_c):.4f}")  # low

# Similarity matrix for a sentence corpus
def similarity_matrix(embeddings):
    """N x N similarity matrix for a set of embeddings."""
    # Normalize
    norms = np.linalg.norm(embeddings, axis=1, keepdims=True)
    normalized = embeddings / norms
    # Matrix product for all pairs
    return normalized @ normalized.T

# Output: (N, N) matrix where [i,j] = sim(sentence_i, sentence_j)

1.2 Other Distance Metrics

Comparison of Similarity/Distance Metrics

Metric	Formula	Range	Use Case
Cosine Similarity	cos(A, B)	[-1, 1]	Standard semantic similarity
Euclidean Distance	\|\|A - B\|\|	[0, +inf)	Clustering, k-NN
Dot Product	A · B	(-inf, +inf)	With normalized vectors = cosine
Manhattan Distance	sum(\|A-B\|)	[0, +inf)	Robustness to outliers
Pearson Correlation	cov(A,B)/sigma	[-1, 1]	Evaluation on STS benchmark

2. Why Standard BERT Fails for Similarity

Intuitively, we might use BERT to extract sentence embeddings and compare them. But research by Reimers & Gurevych (2019) showed that this approach is surprisingly ineffective.

The core problem is that BERT is pre-trained with Masked Language Modeling (MLM) and Next Sentence Prediction (NSP). The [CLS] token encodes information useful for classifying sentence pairs (NSP), but it is not optimized to produce embeddings that reflect semantic similarity when compared via cosine similarity.

Furthermore, mean pooling over all tokens produces an anisotropic embedding space: directions are not uniformly distributed, and clusters of semantically different sentences overlap significantly.

from transformers import BertModel, BertTokenizer
import torch
import numpy as np

tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained('bert-base-uncased')

def bert_mean_pooling(text):
    """Sentence embedding with mean pooling over BERT."""
    inputs = tokenizer(text, return_tensors='pt',
                      truncation=True, max_length=128, padding=True)
    with torch.no_grad():
        outputs = model(**inputs)
    # Mean pooling (excludes padding)
    mask = inputs['attention_mask'].unsqueeze(-1)
    embeddings = (outputs.last_hidden_state * mask).sum(1) / mask.sum(1)
    return embeddings[0].numpy()

# Test: semantically similar vs different sentences
sent1 = "The weather is lovely today."
sent2 = "It's so beautiful today outside."   # similar
sent3 = "My dog bit the mailman."             # different

emb1 = bert_mean_pooling(sent1)
emb2 = bert_mean_pooling(sent2)
emb3 = bert_mean_pooling(sent3)

sim_1_2 = np.dot(emb1, emb2) / (np.linalg.norm(emb1) * np.linalg.norm(emb2))
sim_1_3 = np.dot(emb1, emb3) / (np.linalg.norm(emb1) * np.linalg.norm(emb3))

print(f"sim(sent1, sent2) = {sim_1_2:.4f}")  # ~0.93 - ok
print(f"sim(sent1, sent3) = {sim_1_3:.4f}")  # ~0.87 - too high!

# Problem: BERT tends to produce similar embeddings for all sentences
# because [CLS] is trained on NSP, not semantic similarity
# The solution is Sentence-BERT

BERT Performance on STS-B (Benchmark)

On the STS-B (Semantic Textual Similarity Benchmark) task, BERT with mean pooling achieves only Pearson r = 0.54, well below supervised approaches like SBERT (0.87). Even the [CLS] token alone reaches only 0.20. For semantic similarity tasks, SBERT is the correct choice.

3. Sentence-BERT (SBERT): The Solution

Sentence-BERT (Reimers and Gurevych, EMNLP 2019) solves the problem with a siamese architecture: two weight-sharing BERT instances process two sentences separately, and the loss function forces semantically similar representations to be close in vector space.

3.1 The Siamese Architecture

The key insight is that both "networks" share exactly the same weights. It is not two separate models but the same model called twice. The loss is computed on the pair of outputs:

Regression objective: MSE between predicted cosine similarity and human score (for STS)
Classification objective: Cross-entropy on [u, v, |u-v|] (for NLI)
Triplet loss: margin loss on anchor/positive/negative (for paraphrase mining)

from sentence_transformers import SentenceTransformer, util
import torch

# Load a sentence-transformers model
# Multilingual model (includes Italian, Spanish, French, German, Chinese...)
model = SentenceTransformer('paraphrase-multilingual-MiniLM-L12-v2')

# Optimized for English (higher accuracy for English-only)
# model = SentenceTransformer('all-MiniLM-L6-v2')

# Encode sentences (batch-optimized)
sentences = [
    "The weather is lovely today.",
    "It's so beautiful today outside.",
    "He drove to the stadium.",
    "La giornata e bellissima oggi.",        # Italian
    "Il tempo e meraviglioso questa mattina.",  # similar Italian
]

# Encode everything at once (much more efficient than a loop)
embeddings = model.encode(sentences, batch_size=32, show_progress_bar=False)
print(f"Embedding shape: {embeddings.shape}")  # (5, 384)

# Calculate similarities
cos_scores = util.cos_sim(embeddings, embeddings)
print("\nSimilarity matrix (pairs with score > 0.6):")
for i in range(len(sentences)):
    for j in range(i+1, len(sentences)):
        score = cos_scores[i][j].item()
        if score > 0.6:   # show only similar pairs
            print(f"  {i+1} vs {j+1}: {score:.4f}")
            print(f"    '{sentences[i][:50]}'")
            print(f"    '{sentences[j][:50]}'")

# Pairwise similarity for specific pairs
sim = util.cos_sim(embeddings[0], embeddings[1]).item()
print(f"\nsim(EN1, EN2) = {sim:.4f}")    # ~0.85 (similar sentences)
sim_cross = util.cos_sim(embeddings[0], embeddings[3]).item()
print(f"sim(EN1, IT1) = {sim_cross:.4f}")  # ~0.75 (cross-lingual!)

4. sentence-transformers Models: Which to Choose

Main sentence-transformers Models (2024-2025)

Model	Languages	Dim	Speed	STS-B Pearson
all-MiniLM-L6-v2	EN	384	Very fast	0.834
all-mpnet-base-v2	EN	768	Medium	0.869
paraphrase-multilingual-MiniLM-L12-v2	50+ languages	384	Fast	0.821
paraphrase-multilingual-mpnet-base-v2	50+ languages	768	Medium	0.853
intfloat/multilingual-e5-large	100+ languages	1024	Slow	0.892
text-embedding-3-small (OpenAI)	Multilingual	1536	API only	~0.90

4.1 Model Selection: Practical Guide

The choice depends on three main factors: language, speed, and required quality.

from sentence_transformers import SentenceTransformer
import time
import numpy as np

def benchmark_model(model_name, sentences, n_runs=3):
    """Benchmark speed and quality of a sentence-transformer model."""
    model = SentenceTransformer(model_name)

    # Warmup
    model.encode(sentences[:2])

    # Measure speed
    times = []
    for _ in range(n_runs):
        start = time.time()
        embs = model.encode(sentences)
        times.append(time.time() - start)

    avg_time = np.mean(times)
    dim = embs.shape[1]

    print(f"Model: {model_name}")
    print(f"  Embedding dim: {dim}")
    print(f"  Avg encoding time ({len(sentences)} sentences): {avg_time*1000:.1f}ms")
    print(f"  Throughput: {len(sentences)/avg_time:.0f} sentences/sec")

sentences_test = [
    "The sun shines brightly over the city.",
    "It is a beautiful sunny day today.",
    "Rome is the capital city of Italy.",
    "Juventus won the championship last year.",
    "Artificial intelligence is changing the world.",
] * 20   # 100 sentences

# Benchmark multilingual models
for model_name in [
    'paraphrase-multilingual-MiniLM-L12-v2',
    'paraphrase-multilingual-mpnet-base-v2',
    'intfloat/multilingual-e5-small',
]:
    benchmark_model(model_name, sentences_test)
    print()

5. Semantic Search with FAISS

For large corpora (millions of documents), brute-force search (computing similarity with every document) is too slow. FAISS (Facebook AI Similarity Search) enables approximate nearest neighbor search in sub-linear time with different index types.

5.1 FAISS Index Types

FAISS Indexes: Speed/Accuracy Trade-offs

Index	Type	Use Case	Recall (%)	Speed
IndexFlatL2	Exact	< 100K docs	100%	Slow
IndexFlatIP	Exact (cosine)	< 100K docs	100%	Slow
IndexIVFFlat	Approximate	100K - 10M	~95%	Fast
IndexHNSW	Approximate	1M+	~99%	Very fast
IndexIVFPQ	Compressed	10M+, limited RAM	~85%	Very fast

import faiss
import numpy as np
from sentence_transformers import SentenceTransformer
import time

model = SentenceTransformer('all-MiniLM-L6-v2')

# Example corpus: Wikipedia-style articles
corpus = [
    "The Eiffel Tower is a wrought-iron lattice tower on the Champ de Mars in Paris.",
    "Apple Inc. is an American multinational technology company founded by Steve Jobs.",
    "Python is a high-level, general-purpose programming language.",
    "The Mediterranean diet is based on traditional foods from countries bordering the sea.",
    "Quantum computing uses quantum-mechanical phenomena such as superposition.",
    "The Amazon River is the largest river by discharge volume in the world.",
    "Artificial neural networks are computing systems inspired by biological neural networks.",
    "The Sistine Chapel ceiling was painted by Michelangelo between 1508 and 1512.",
    "Machine learning is a subset of artificial intelligence focused on algorithms.",
    "The Colosseum is an oval amphitheatre in the centre of Rome, Italy.",
]

# Encode the corpus (offline, done once)
print("Encoding corpus...")
start = time.time()
corpus_embeddings = model.encode(corpus, convert_to_numpy=True, show_progress_bar=False)
print(f"Encoded {len(corpus)} docs in {time.time()-start:.2f}s")
print(f"Embeddings shape: {corpus_embeddings.shape}")  # (10, 384)

# Build FAISS index
dim = corpus_embeddings.shape[1]  # 384

# IndexFlatIP: exact, cosine similarity with normalized vectors
index_ip = faiss.IndexFlatIP(dim)

# Normalize for cosine similarity (dot product on unit-norm vectors)
faiss.normalize_L2(corpus_embeddings)
index_ip.add(corpus_embeddings)
print(f"Index size: {index_ip.ntotal} vectors")

# IndexHNSW: approximate but very fast, good for production
# M = number of connections per node (16-64 in production)
index_hnsw = faiss.IndexHNSWFlat(dim, 32, faiss.METRIC_INNER_PRODUCT)
index_hnsw.hnsw.efConstruction = 200   # higher = better recall at build
index_hnsw.hnsw.efSearch = 128         # higher = better recall at search

# Semantic search function
def semantic_search(query, index, corpus, model, k=3):
    """Semantic search: returns the k most similar documents to the query."""
    query_emb = model.encode([query], convert_to_numpy=True)
    faiss.normalize_L2(query_emb)

    start = time.time()
    distances, indices = index.search(query_emb, k)
    search_time = (time.time() - start) * 1000

    print(f"\nQuery: '{query}'")
    print(f"Search time: {search_time:.2f}ms")
    for rank, (dist, idx) in enumerate(zip(distances[0], indices[0]), 1):
        print(f"  {rank}. [{dist:.4f}] {corpus[idx][:80]}")
    return [(corpus[i], float(d)) for i, d in zip(indices[0], distances[0])]

# Test queries
semantic_search("ancient Roman architecture", index_ip, corpus, model)
semantic_search("programming language features", index_ip, corpus, model)
semantic_search("painting and art in Italy", index_ip, corpus, model)

5.2 Persistence and Loading the Index

import faiss
import numpy as np
import json
import os

def build_and_save_index(corpus, model, index_path="faiss_index.bin",
                          corpus_path="corpus.json"):
    """Build and save a FAISS index to disk."""
    # Encode
    embeddings = model.encode(corpus, convert_to_numpy=True, show_progress_bar=True)
    faiss.normalize_L2(embeddings)

    dim = embeddings.shape[1]
    index = faiss.IndexFlatIP(dim)
    index.add(embeddings)

    # Save FAISS index
    faiss.write_index(index, index_path)

    # Save corpus (to retrieve texts)
    with open(corpus_path, 'w', encoding='utf-8') as f:
        json.dump(corpus, f, ensure_ascii=False, indent=2)

    print(f"Index saved: {index.ntotal} vectors -> {index_path}")
    return index

def load_index(index_path="faiss_index.bin", corpus_path="corpus.json"):
    """Load FAISS index and corpus from disk."""
    if not os.path.exists(index_path):
        raise FileNotFoundError(f"Index not found: {index_path}")

    index = faiss.read_index(index_path)
    with open(corpus_path, 'r', encoding='utf-8') as f:
        corpus = json.load(f)

    print(f"Index loaded: {index.ntotal} vectors")
    return index, corpus

# Usage
# First run: build and save
# index = build_and_save_index(my_corpus, model)

# Subsequent restarts: load directly (much faster)
# index, corpus = load_index()

6. FAQ Matching: Complete Use Case

A practical application of semantic similarity: automatic matching of user questions with existing FAQs. This pattern is the foundation of many chatbots and customer support systems.

from sentence_transformers import SentenceTransformer, util
import torch
import json

class FAQMatcher:
    """Semantic FAQ matching system with caching and persistence."""

    def __init__(self, model_name='paraphrase-multilingual-MiniLM-L12-v2',
                 threshold=0.7):
        self.model = SentenceTransformer(model_name)
        self.threshold = threshold
        self.faqs = []
        self.faq_embeddings = None

    def load_faqs(self, faqs: list):
        """
        faqs: list of dicts with 'question', 'answer', 'category'
        """
        self.faqs = faqs
        questions = [faq['question'] for faq in faqs]
        print(f"Encoding {len(questions)} FAQs...")
        self.faq_embeddings = self.model.encode(
            questions,
            convert_to_tensor=True,
            show_progress_bar=False
        )
        print("FAQs ready for search!")

    def match(self, user_query: str, top_k: int = 3) -> list:
        """Find the FAQs most similar to the user question."""
        if self.faq_embeddings is None:
            raise ValueError("Load FAQs first with load_faqs()")

        query_emb = self.model.encode(user_query, convert_to_tensor=True)
        scores = util.cos_sim(query_emb, self.faq_embeddings)[0]

        top_k_indices = torch.topk(scores, k=min(top_k, len(self.faqs))).indices

        results = []
        for idx in top_k_indices:
            score = scores[idx].item()
            if score >= self.threshold:
                results.append({
                    'question': self.faqs[idx]['question'],
                    'answer': self.faqs[idx]['answer'],
                    'category': self.faqs[idx].get('category', 'N/A'),
                    'score': round(score, 4)
                })

        return results

    def respond(self, user_query: str) -> str:
        """Automatic response to the user question."""
        matches = self.match(user_query, top_k=1)
        if not matches:
            return f"Sorry, no answer found for '{user_query}'. Please contact support."
        best = matches[0]
        return f"[{best['category']}] {best['answer']} (Confidence: {best['score']:.2f})"

# Usage example
faqs_ecommerce = [
    {
        "question": "How can I return a product?",
        "answer": "You can return any product within 30 days of purchase by contacting support.",
        "category": "Returns"
    },
    {
        "question": "How long does shipping take?",
        "answer": "Standard delivery takes 3-5 business days; express shipping takes 24 hours.",
        "category": "Shipping"
    },
    {
        "question": "What payment methods do you accept?",
        "answer": "We accept credit cards, PayPal, bank transfer, and cash on delivery.",
        "category": "Payments"
    },
    {
        "question": "Is the product under warranty?",
        "answer": "All products come with a 2-year statutory consumer warranty.",
        "category": "Warranty"
    },
    {
        "question": "Can I track my order?",
        "answer": "Yes, you will receive an email with a tracking number once shipped.",
        "category": "Orders"
    },
]

matcher = FAQMatcher()
matcher.load_faqs(faqs_ecommerce)

# Test with paraphrased questions
test_queries = [
    "I want to send an item back",
    "When will my package arrive?",
    "Do you accept bank transfers?",
    "I need the tracking code",
    "The item broke, what should I do?",   # not exact, will map to nearest
]

print("\n=== FAQ Matching ===")
for query in test_queries:
    response = matcher.respond(query)
    print(f"\nQuestion: {query}")
    print(f"Response: {response}")

7. Cross-encoder vs Bi-encoder

There are two approaches to semantic similarity offering different quality/speed trade-offs. Understanding them is essential for choosing the right architecture.

Bi-encoder vs Cross-encoder Comparison

Aspect	Bi-encoder (SBERT)	Cross-encoder
Architecture	Two separate BERTs, produces embeddings	One BERT processes the pair together
Speed	Very fast (pre-computed embeddings)	Slow (processes every pair)
Scalability	Millions of documents	Only hundreds of pairs
Quality	Good (~0.87 Pearson on STS-B)	Excellent (~0.92 Pearson)
Use case	Retrieval, semantic search	Reranking retrieved results
Complexity O(n)	O(1) per query (pre-computed embeddings)	O(n) per every query

from sentence_transformers import SentenceTransformer, CrossEncoder, util

# Bi-encoder for initial retrieval (fast)
bi_encoder = SentenceTransformer('all-MiniLM-L6-v2')

# Cross-encoder for reranking (accurate)
cross_encoder = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2')

# Two-stage pipeline (best of both worlds)
def retrieval_and_rerank(query, corpus, corpus_embeddings, top_k=100, final_k=5):
    """
    Stage 1: Bi-encoder retrieval (fast, returns top 100)
    Stage 2: Cross-encoder reranking (accurate, over top 100)
    """
    # Stage 1: Bi-encoder retrieval
    query_emb = bi_encoder.encode(query, convert_to_tensor=True)
    hits = util.semantic_search(query_emb, corpus_embeddings, top_k=top_k)[0]

    # Stage 2: Cross-encoder reranking
    cross_inp = [[query, corpus[hit['corpus_id']]] for hit in hits]
    cross_scores = cross_encoder.predict(cross_inp)

    # Combine and reorder
    for hit, score in zip(hits, cross_scores):
        hit['cross_score'] = score

    hits = sorted(hits, key=lambda x: x['cross_score'], reverse=True)[:final_k]

    print(f"\nQuery: '{query}'")
    for rank, hit in enumerate(hits, 1):
        bi_score = hit['score']
        cross_score = hit['cross_score']
        doc = corpus[hit['corpus_id']][:80]
        print(f"  {rank}. [bi={bi_score:.3f}, cross={cross_score:.3f}] {doc}")

    return hits

# Encode corpus once
corpus_embs = bi_encoder.encode(corpus, convert_to_tensor=True)
retrieval_and_rerank("ancient Roman buildings", corpus, corpus_embs)

8. Evaluation: STS-B and Metrics

Correct evaluation of a semantic similarity system requires standardized benchmark datasets. STS-B is the main reference for English, while multilingual benchmarks like MTEB cover Italian and other languages.

from sentence_transformers import SentenceTransformer
from sentence_transformers.evaluation import EmbeddingSimilarityEvaluator
from datasets import load_dataset
import numpy as np
from scipy.stats import pearsonr, spearmanr

# Load STS-B for evaluation
stsb = load_dataset("mteb/stsbenchmark-sts")
val_data = stsb['validation']

# Prepare data for the evaluator
sentences1 = val_data['sentence1']
sentences2 = val_data['sentence2']
scores = [s / 5.0 for s in val_data['score']]   # normalize 0-5 to 0-1

model = SentenceTransformer('paraphrase-multilingual-MiniLM-L12-v2')

# Automatic evaluation using built-in evaluator
evaluator = EmbeddingSimilarityEvaluator(
    sentences1=sentences1,
    sentences2=sentences2,
    scores=scores,
    name="sts-val"
)
pearson = model.evaluate(evaluator)
print(f"STS-B validation - Pearson: {pearson:.4f}")

# Manual evaluation with Pearson and Spearman correlation
emb1 = model.encode(sentences1, show_progress_bar=False)
emb2 = model.encode(sentences2, show_progress_bar=False)

from numpy.linalg import norm
cos_sims = [
    np.dot(e1, e2) / (norm(e1) * norm(e2))
    for e1, e2 in zip(emb1, emb2)
]

pearson_r, _ = pearsonr(cos_sims, scores)
spearman_r, _ = spearmanr(cos_sims, scores)
print(f"Pearson:  {pearson_r:.4f}")
print(f"Spearman: {spearman_r:.4f}")

# Error analysis: find the most misrepresented pairs
errors = [(abs(p - t), s1, s2, p, t)
          for p, t, s1, s2 in zip(cos_sims, scores, sentences1, sentences2)]
errors.sort(reverse=True)
print("\n=== Top 3 Errors ===")
for err, s1, s2, pred, true in errors[:3]:
    print(f"  Error: {err:.3f} | Pred: {pred:.3f} | True: {true:.3f}")
    print(f"  '{s1[:60]}'")
    print(f"  '{s2[:60]}'")

9. Fine-tuning a Sentence Transformer on Your Domain

Pre-trained models perform well on general text, but for specific domains (medical, legal, technical) it is worthwhile to fine-tune with annotated sentence pairs.

from sentence_transformers import (
    SentenceTransformer,
    InputExample,
    losses,
    evaluation
)
from torch.utils.data import DataLoader

# Training data: pairs (sentence1, sentence2, score)
# Score: 0.0 = completely different, 1.0 = identical
train_examples = [
    InputExample(texts=["Type 2 diabetes diagnosis", "Patient with chronic hyperglycemia"], label=0.85),
    InputExample(texts=["Antibiotic prescription", "Amoxicillin therapy"], label=0.80),
    InputExample(texts=["Knee surgery", "Meniscus arthroscopy"], label=0.75),
    InputExample(texts=["High blood pressure", "Arterial hypertension"], label=0.95),
    InputExample(texts=["Chest pain", "Heartburn"], label=0.30),
    InputExample(texts=["Femur fracture", "Heart attack"], label=0.05),
]

# Load base model
model = SentenceTransformer('paraphrase-multilingual-MiniLM-L12-v2')

# DataLoader
train_dataloader = DataLoader(train_examples, shuffle=True, batch_size=16)

# Loss: CosineSimilarityLoss for regression on continuous score
train_loss = losses.CosineSimilarityLoss(model)

# Validation evaluator
test_examples = [
    InputExample(texts=["Tension headache", "Stress headache"], label=0.88),
    InputExample(texts=["Gestational diabetes", "Pregnancy diabetes"], label=0.92),
]
evaluator_sentences1 = [e.texts[0] for e in test_examples]
evaluator_sentences2 = [e.texts[1] for e in test_examples]
evaluator_scores = [e.label for e in test_examples]

val_evaluator = evaluation.EmbeddingSimilarityEvaluator(
    evaluator_sentences1, evaluator_sentences2, evaluator_scores
)

# Fine-tuning
model.fit(
    train_objectives=[(train_dataloader, train_loss)],
    evaluator=val_evaluator,
    epochs=10,
    evaluation_steps=50,
    warmup_steps=100,
    output_path='./medical-sentence-transformer',
    save_best_model=True
)

print("Fine-tuning complete!")
print("Model saved to './medical-sentence-transformer'")

10. Production-Ready Pipeline

A semantic similarity system in production must handle embedding caching, incremental corpus updates, and quality monitoring.

import faiss
import numpy as np
import json
import hashlib
from sentence_transformers import SentenceTransformer, util
from pathlib import Path
from typing import List, Dict, Optional

class SemanticSearchEngine:
    """
    Production-ready semantic search engine with:
    - On-disk embedding caching
    - Incremental updates
    - Configurable threshold
    - Query logging
    """

    def __init__(
        self,
        model_name: str = 'paraphrase-multilingual-MiniLM-L12-v2',
        cache_dir: str = './search_cache',
        similarity_threshold: float = 0.5
    ):
        self.model = SentenceTransformer(model_name)
        self.cache_dir = Path(cache_dir)
        self.cache_dir.mkdir(exist_ok=True)
        self.threshold = similarity_threshold
        self.documents: List[Dict] = []
        self.embeddings: Optional[np.ndarray] = None
        self.index: Optional[faiss.Index] = None

    def add_documents(self, documents: List[Dict], text_field: str = 'text'):
        """Add documents to the corpus with caching."""
        texts = [doc[text_field] for doc in documents]
        new_embeddings = self.model.encode(texts, convert_to_numpy=True, show_progress_bar=True)

        if self.embeddings is None:
            self.embeddings = new_embeddings
        else:
            self.embeddings = np.vstack([self.embeddings, new_embeddings])

        self.documents.extend(documents)
        self._rebuild_index()
        print(f"Corpus: {len(self.documents)} documents")

    def _rebuild_index(self):
        """Rebuild the FAISS index."""
        dim = self.embeddings.shape[1]
        self.index = faiss.IndexFlatIP(dim)
        embs_normalized = self.embeddings.copy()
        faiss.normalize_L2(embs_normalized)
        self.index.add(embs_normalized)

    def search(self, query: str, k: int = 5, text_field: str = 'text') -> List[Dict]:
        """Search for the most relevant documents for the query."""
        if self.index is None or len(self.documents) == 0:
            return []

        query_emb = self.model.encode([query], convert_to_numpy=True)
        faiss.normalize_L2(query_emb)

        distances, indices = self.index.search(query_emb, min(k, len(self.documents)))

        results = []
        for dist, idx in zip(distances[0], indices[0]):
            if dist >= self.threshold:
                result = dict(self.documents[idx])
                result['score'] = float(dist)
                results.append(result)

        return results

    def save(self):
        """Persist the search engine to disk."""
        faiss.write_index(self.index, str(self.cache_dir / 'index.faiss'))
        np.save(str(self.cache_dir / 'embeddings.npy'), self.embeddings)
        with open(self.cache_dir / 'documents.json', 'w', encoding='utf-8') as f:
            json.dump(self.documents, f, ensure_ascii=False, indent=2)
        print(f"Engine saved to {self.cache_dir}")

# Usage
engine = SemanticSearchEngine(similarity_threshold=0.6)

docs = [
    {"text": "Setting up a Python virtual environment with virtualenv.", "id": "py001", "category": "python"},
    {"text": "Installing and configuring Docker on Ubuntu.", "id": "docker001", "category": "devops"},
    {"text": "Introduction to neural networks with PyTorch.", "id": "ml001", "category": "ml"},
    {"text": "REST API security best practices.", "id": "api001", "category": "security"},
    {"text": "Optimizing SQL queries with indexes.", "id": "db001", "category": "database"},
]
engine.add_documents(docs)

# Search
results = engine.search("how to create a Python virtual environment")
for r in results:
    print(f"[{r['score']:.3f}] {r['text']}")

11. Common Errors and Anti-Patterns

Anti-Pattern: Using BERT [CLS] Directly

The [CLS] token of BERT is not optimized for semantic similarity. Using it directly (without fine-tuning on a similarity task) produces results much worse than SBERT. Always use a dedicated sentence-transformers model.

Anti-Pattern: Comparing Embeddings from Different Models

Embeddings from all-MiniLM-L6-v2 and paraphrase-multilingual-mpnet-base-v2 live in completely different vector spaces. You cannot compare embeddings produced by different models. Always use the same model for all sentences in your corpus.

Anti-Pattern: Forgetting Normalization

When using FAISS with IndexFlatIP for cosine similarity, you must normalize vectors to unit norm with faiss.normalize_L2() both during indexing and during search. Forgetting this step produces incorrect results without any explicit errors.

Best Practices: Checklist

Use sentence-transformers instead of raw BERT for semantic similarity
Choose multilingual models for Italian or cross-lingual content
Always normalize vectors before FAISS IndexFlatIP indexing
Persist embeddings to disk to avoid re-encoding on every restart
Bi-encoder + cross-encoder pipeline for scalable retrieval + high quality
Evaluate on STS-B or a domain-specific dataset before deploying
Monitor similarity score distributions in production to detect drift
Set a minimum confidence threshold to filter irrelevant matches

12. Semantic Similarity Benchmarks (MTEB 2024-2025)

The MTEB (Massive Text Embedding Benchmark) is the most comprehensive evaluation suite for embedding models, covering 56 tasks across 112 languages. It provides a single leaderboard to compare models on retrieval, clustering, classification, and semantic similarity tasks simultaneously.

Top Models on MTEB (Semantic Textual Similarity, 2025)

Model	Params	STS Avg	Retrieval Avg	Multilingual	License
intfloat/multilingual-e5-large	560M	88.3	54.7	Yes (100+ lang)	MIT
BAAI/bge-m3	570M	87.6	57.2	Yes (100+ lang)	MIT
all-mpnet-base-v2	109M	86.9	43.8	EN only	Apache 2.0
paraphrase-multilingual-mpnet-base-v2	278M	85.3	39.2	Yes (50+ lang)	Apache 2.0
all-MiniLM-L6-v2	23M	83.4	41.9	EN only	Apache 2.0
text-embedding-3-small (OpenAI)	API	89.1	62.3	Yes	Proprietary

from sentence_transformers import SentenceTransformer
from sentence_transformers.evaluation import EmbeddingSimilarityEvaluator
from datasets import load_dataset
import numpy as np

# Quick MTEB-style evaluation on STS-B
def evaluate_on_stsb(model_name: str) -> dict:
    """
    Evaluate a sentence-transformer model on STS-B validation set.
    Returns Pearson and Spearman correlation coefficients.
    """
    model = SentenceTransformer(model_name)
    stsb = load_dataset("mteb/stsbenchmark-sts", split="validation")

    sentences1 = stsb['sentence1']
    sentences2 = stsb['sentence2']
    scores = [s / 5.0 for s in stsb['score']]  # normalize 0-5 to 0-1

    evaluator = EmbeddingSimilarityEvaluator(
        sentences1=sentences1,
        sentences2=sentences2,
        scores=scores,
        name="stsb-val"
    )

    # The evaluate() method returns the Pearson correlation
    pearson = model.evaluate(evaluator)

    return {
        "model": model_name,
        "stsb_pearson": round(pearson, 4),
        "num_pairs": len(sentences1)
    }

# Compare multiple models
models_to_compare = [
    "all-MiniLM-L6-v2",
    "paraphrase-multilingual-MiniLM-L12-v2",
    "paraphrase-multilingual-mpnet-base-v2",
]

print("=== STS-B Validation Comparison ===")
for model_name in models_to_compare:
    try:
        result = evaluate_on_stsb(model_name)
        print(f"  {result['model']:<50s} Pearson: {result['stsb_pearson']:.4f}")
    except Exception as e:
        print(f"  {model_name}: Error - {e}")

Conclusions and Next Steps

Semantic similarity with sentence embeddings is a fundamental component of many modern NLP applications: semantic search, FAQ matching, deduplication, recommendation, and RAG systems. SBERT and sentence-transformers models have made these capabilities accessible with just a few lines of code, while FAISS enables scaling to millions of documents with millisecond latency.

For Italian, multilingual models like paraphrase-multilingual-mpnet-base-v2 and intfloat/multilingual-e5-large deliver excellent performance even in cross-lingual contexts.

Key Takeaways

Use SBERT instead of standard BERT for semantic similarity (Pearson 0.87 vs 0.54)
FAISS is essential for search on large corpora
Bi-encoder + cross-encoder pipeline: retrieval speed + reranking quality
Multilingual models for Italian: paraphrase-multilingual-mpnet-base-v2 or multilingual-e5-large
Always evaluate on STS-B or a dataset from your domain
Domain-specific fine-tuning with CosineSimilarityLoss for maximum quality

Continue the Series

Article 10: NLP Monitoring in Production — drift detection and automated retraining
Article 8: Local LoRA Fine-tuning — adapting LLMs to your domain on consumer GPUs
Related series: AI Engineering/RAG — semantic similarity as the core of dense retrieval
Related series: Advanced Deep Learning — triplet loss, metric learning and contrastive learning