Semantic Similarity and Sentence Embeddings: Comparing Texts
How similar are two sentences? Not in the lexical sense (same words), but in the semantic sense (same meaning). "The dog chases the cat" and "The feline is being pursued by the canine" are semantically almost identical but lexically very different. Answering this question is the challenge of Semantic Similarity.
Applications are everywhere: semantic search engines, recommendation systems, content deduplication, question answering, RAG (Retrieval-Augmented Generation), chatbots, and FAQ matching. In this article we build semantic similarity systems from scratch: from cosine similarity to sentence embeddings with Sentence-BERT, to fast vector search with FAISS.
This is the ninth article in the Modern NLP: from BERT to LLMs series. This topic connects directly with the AI Engineering/RAG series where semantic embeddings are the heart of dense retrieval.
What You Will Learn
- Cosine similarity and dot product: formulas and when to use them
- Why standard BERT fails for semantic similarity and why Sentence-BERT is needed
- Sentence-BERT (SBERT): siamese architecture and training with triplet loss
- sentence-transformers models on HuggingFace: which to choose
- Semantic search on large corpora with FAISS
- Sentence embeddings for Italian and multilingual text
- Benchmarking: STS-B, SICK and evaluation metrics
- Cross-encoder vs bi-encoder: quality/speed trade-offs
- Fine-tuning a sentence transformer on your domain
- Complete implementation of a FAQ matching system
- Production-ready pipeline with caching and optimization
1. The Semantic Similarity Problem
Consider these four groups of sentences and their challenges:
Semantic Similarity Examples
- High similarity: "The bank raised interest rates" / "Interest rates were increased by the financial institution"
- Low similarity: "The bank raised interest rates" / "The cat sleeps on the sofa"
- Misleading (same words, different meaning): "She sat on the river bank" / "He went to the bank to withdraw money"
- Cross-lingual: "The dog runs fast" / "Il cane corre veloce" (same semantics, different languages)
Traditional metrics like Jaccard similarity or BM25 rely on lexical overlap and fail completely with synonyms and paraphrases. Even TF-IDF cannot capture meaning. The solution lies in semantic embeddings: dense vector representations where geometric proximity reflects semantic proximity.
1.1 Cosine Similarity: The Fundamental Metric
Cosine similarity measures the angle between two vectors in embedding space. It ranges from -1 (opposite) to 1 (identical), with 0 for orthogonal vectors. The mathematical formula is:
cos(A, B) = (A · B) / (||A|| · ||B||)
When vectors are normalized to unit norm, cosine similarity equals the dot product, making computation much more efficient on GPU hardware.
import numpy as np
import torch
from torch.nn import functional as F
def cosine_similarity(vec1, vec2):
"""Cosine similarity between two numpy vectors."""
dot_product = np.dot(vec1, vec2)
norm1 = np.linalg.norm(vec1)
norm2 = np.linalg.norm(vec2)
return dot_product / (norm1 * norm2)
# PyTorch batch version (efficient for batches)
def cosine_similarity_batch(emb1, emb2):
"""Cosine similarity between batches of embeddings (normalized)."""
# Normalize to unit norm
emb1_norm = F.normalize(emb1, p=2, dim=1)
emb2_norm = F.normalize(emb2, p=2, dim=1)
return (emb1_norm * emb2_norm).sum(dim=1)
# Example with simple vectors
vec_a = np.array([1.0, 0.5, 0.3, 0.8])
vec_b = np.array([0.9, 0.4, 0.4, 0.7]) # similar to a
vec_c = np.array([-0.2, 0.8, -0.5, 0.1]) # different from a
print(f"sim(a, b) = {cosine_similarity(vec_a, vec_b):.4f}") # high
print(f"sim(a, c) = {cosine_similarity(vec_a, vec_c):.4f}") # low
# Similarity matrix for a sentence corpus
def similarity_matrix(embeddings):
"""N x N similarity matrix for a set of embeddings."""
# Normalize
norms = np.linalg.norm(embeddings, axis=1, keepdims=True)
normalized = embeddings / norms
# Matrix product for all pairs
return normalized @ normalized.T
# Output: (N, N) matrix where [i,j] = sim(sentence_i, sentence_j)
1.2 Other Distance Metrics
Comparison of Similarity/Distance Metrics
| Metric | Formula | Range | Use Case |
|---|---|---|---|
| Cosine Similarity | cos(A, B) | [-1, 1] | Standard semantic similarity |
| Euclidean Distance | ||A - B|| | [0, +inf) | Clustering, k-NN |
| Dot Product | A · B | (-inf, +inf) | With normalized vectors = cosine |
| Manhattan Distance | sum(|A-B|) | [0, +inf) | Robustness to outliers |
| Pearson Correlation | cov(A,B)/sigma | [-1, 1] | Evaluation on STS benchmark |
2. Why Standard BERT Fails for Similarity
Intuitively, we might use BERT to extract sentence embeddings and compare them. But research by Reimers & Gurevych (2019) showed that this approach is surprisingly ineffective.
The core problem is that BERT is pre-trained with Masked Language Modeling (MLM) and
Next Sentence Prediction (NSP). The [CLS] token encodes information
useful for classifying sentence pairs (NSP), but it is not optimized to produce
embeddings that reflect semantic similarity when compared via cosine similarity.
Furthermore, mean pooling over all tokens produces an anisotropic embedding space: directions are not uniformly distributed, and clusters of semantically different sentences overlap significantly.
from transformers import BertModel, BertTokenizer
import torch
import numpy as np
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained('bert-base-uncased')
def bert_mean_pooling(text):
"""Sentence embedding with mean pooling over BERT."""
inputs = tokenizer(text, return_tensors='pt',
truncation=True, max_length=128, padding=True)
with torch.no_grad():
outputs = model(**inputs)
# Mean pooling (excludes padding)
mask = inputs['attention_mask'].unsqueeze(-1)
embeddings = (outputs.last_hidden_state * mask).sum(1) / mask.sum(1)
return embeddings[0].numpy()
# Test: semantically similar vs different sentences
sent1 = "The weather is lovely today."
sent2 = "It's so beautiful today outside." # similar
sent3 = "My dog bit the mailman." # different
emb1 = bert_mean_pooling(sent1)
emb2 = bert_mean_pooling(sent2)
emb3 = bert_mean_pooling(sent3)
sim_1_2 = np.dot(emb1, emb2) / (np.linalg.norm(emb1) * np.linalg.norm(emb2))
sim_1_3 = np.dot(emb1, emb3) / (np.linalg.norm(emb1) * np.linalg.norm(emb3))
print(f"sim(sent1, sent2) = {sim_1_2:.4f}") # ~0.93 - ok
print(f"sim(sent1, sent3) = {sim_1_3:.4f}") # ~0.87 - too high!
# Problem: BERT tends to produce similar embeddings for all sentences
# because [CLS] is trained on NSP, not semantic similarity
# The solution is Sentence-BERT
BERT Performance on STS-B (Benchmark)
On the STS-B (Semantic Textual Similarity Benchmark) task, BERT with mean pooling achieves only Pearson r = 0.54, well below supervised approaches like SBERT (0.87). Even the [CLS] token alone reaches only 0.20. For semantic similarity tasks, SBERT is the correct choice.
3. Sentence-BERT (SBERT): The Solution
Sentence-BERT (Reimers and Gurevych, EMNLP 2019) solves the problem with a siamese architecture: two weight-sharing BERT instances process two sentences separately, and the loss function forces semantically similar representations to be close in vector space.
3.1 The Siamese Architecture
The key insight is that both "networks" share exactly the same weights. It is not two separate models but the same model called twice. The loss is computed on the pair of outputs:
- Regression objective: MSE between predicted cosine similarity and human score (for STS)
- Classification objective: Cross-entropy on [u, v, |u-v|] (for NLI)
- Triplet loss: margin loss on anchor/positive/negative (for paraphrase mining)
from sentence_transformers import SentenceTransformer, util
import torch
# Load a sentence-transformers model
# Multilingual model (includes Italian, Spanish, French, German, Chinese...)
model = SentenceTransformer('paraphrase-multilingual-MiniLM-L12-v2')
# Optimized for English (higher accuracy for English-only)
# model = SentenceTransformer('all-MiniLM-L6-v2')
# Encode sentences (batch-optimized)
sentences = [
"The weather is lovely today.",
"It's so beautiful today outside.",
"He drove to the stadium.",
"La giornata e bellissima oggi.", # Italian
"Il tempo e meraviglioso questa mattina.", # similar Italian
]
# Encode everything at once (much more efficient than a loop)
embeddings = model.encode(sentences, batch_size=32, show_progress_bar=False)
print(f"Embedding shape: {embeddings.shape}") # (5, 384)
# Calculate similarities
cos_scores = util.cos_sim(embeddings, embeddings)
print("\nSimilarity matrix (pairs with score > 0.6):")
for i in range(len(sentences)):
for j in range(i+1, len(sentences)):
score = cos_scores[i][j].item()
if score > 0.6: # show only similar pairs
print(f" {i+1} vs {j+1}: {score:.4f}")
print(f" '{sentences[i][:50]}'")
print(f" '{sentences[j][:50]}'")
# Pairwise similarity for specific pairs
sim = util.cos_sim(embeddings[0], embeddings[1]).item()
print(f"\nsim(EN1, EN2) = {sim:.4f}") # ~0.85 (similar sentences)
sim_cross = util.cos_sim(embeddings[0], embeddings[3]).item()
print(f"sim(EN1, IT1) = {sim_cross:.4f}") # ~0.75 (cross-lingual!)
4. sentence-transformers Models: Which to Choose
Main sentence-transformers Models (2024-2025)
| Model | Languages | Dim | Speed | STS-B Pearson |
|---|---|---|---|---|
| all-MiniLM-L6-v2 | EN | 384 | Very fast | 0.834 |
| all-mpnet-base-v2 | EN | 768 | Medium | 0.869 |
| paraphrase-multilingual-MiniLM-L12-v2 | 50+ languages | 384 | Fast | 0.821 |
| paraphrase-multilingual-mpnet-base-v2 | 50+ languages | 768 | Medium | 0.853 |
| intfloat/multilingual-e5-large | 100+ languages | 1024 | Slow | 0.892 |
| text-embedding-3-small (OpenAI) | Multilingual | 1536 | API only | ~0.90 |
4.1 Model Selection: Practical Guide
The choice depends on three main factors: language, speed, and required quality.
from sentence_transformers import SentenceTransformer
import time
import numpy as np
def benchmark_model(model_name, sentences, n_runs=3):
"""Benchmark speed and quality of a sentence-transformer model."""
model = SentenceTransformer(model_name)
# Warmup
model.encode(sentences[:2])
# Measure speed
times = []
for _ in range(n_runs):
start = time.time()
embs = model.encode(sentences)
times.append(time.time() - start)
avg_time = np.mean(times)
dim = embs.shape[1]
print(f"Model: {model_name}")
print(f" Embedding dim: {dim}")
print(f" Avg encoding time ({len(sentences)} sentences): {avg_time*1000:.1f}ms")
print(f" Throughput: {len(sentences)/avg_time:.0f} sentences/sec")
sentences_test = [
"The sun shines brightly over the city.",
"It is a beautiful sunny day today.",
"Rome is the capital city of Italy.",
"Juventus won the championship last year.",
"Artificial intelligence is changing the world.",
] * 20 # 100 sentences
# Benchmark multilingual models
for model_name in [
'paraphrase-multilingual-MiniLM-L12-v2',
'paraphrase-multilingual-mpnet-base-v2',
'intfloat/multilingual-e5-small',
]:
benchmark_model(model_name, sentences_test)
print()
5. Semantic Search with FAISS
For large corpora (millions of documents), brute-force search (computing similarity with every document) is too slow. FAISS (Facebook AI Similarity Search) enables approximate nearest neighbor search in sub-linear time with different index types.
5.1 FAISS Index Types
FAISS Indexes: Speed/Accuracy Trade-offs
| Index | Type | Use Case | Recall (%) | Speed |
|---|---|---|---|---|
| IndexFlatL2 | Exact | < 100K docs | 100% | Slow |
| IndexFlatIP | Exact (cosine) | < 100K docs | 100% | Slow |
| IndexIVFFlat | Approximate | 100K - 10M | ~95% | Fast |
| IndexHNSW | Approximate | 1M+ | ~99% | Very fast |
| IndexIVFPQ | Compressed | 10M+, limited RAM | ~85% | Very fast |
import faiss
import numpy as np
from sentence_transformers import SentenceTransformer
import time
model = SentenceTransformer('all-MiniLM-L6-v2')
# Example corpus: Wikipedia-style articles
corpus = [
"The Eiffel Tower is a wrought-iron lattice tower on the Champ de Mars in Paris.",
"Apple Inc. is an American multinational technology company founded by Steve Jobs.",
"Python is a high-level, general-purpose programming language.",
"The Mediterranean diet is based on traditional foods from countries bordering the sea.",
"Quantum computing uses quantum-mechanical phenomena such as superposition.",
"The Amazon River is the largest river by discharge volume in the world.",
"Artificial neural networks are computing systems inspired by biological neural networks.",
"The Sistine Chapel ceiling was painted by Michelangelo between 1508 and 1512.",
"Machine learning is a subset of artificial intelligence focused on algorithms.",
"The Colosseum is an oval amphitheatre in the centre of Rome, Italy.",
]
# Encode the corpus (offline, done once)
print("Encoding corpus...")
start = time.time()
corpus_embeddings = model.encode(corpus, convert_to_numpy=True, show_progress_bar=False)
print(f"Encoded {len(corpus)} docs in {time.time()-start:.2f}s")
print(f"Embeddings shape: {corpus_embeddings.shape}") # (10, 384)
# Build FAISS index
dim = corpus_embeddings.shape[1] # 384
# IndexFlatIP: exact, cosine similarity with normalized vectors
index_ip = faiss.IndexFlatIP(dim)
# Normalize for cosine similarity (dot product on unit-norm vectors)
faiss.normalize_L2(corpus_embeddings)
index_ip.add(corpus_embeddings)
print(f"Index size: {index_ip.ntotal} vectors")
# IndexHNSW: approximate but very fast, good for production
# M = number of connections per node (16-64 in production)
index_hnsw = faiss.IndexHNSWFlat(dim, 32, faiss.METRIC_INNER_PRODUCT)
index_hnsw.hnsw.efConstruction = 200 # higher = better recall at build
index_hnsw.hnsw.efSearch = 128 # higher = better recall at search
# Semantic search function
def semantic_search(query, index, corpus, model, k=3):
"""Semantic search: returns the k most similar documents to the query."""
query_emb = model.encode([query], convert_to_numpy=True)
faiss.normalize_L2(query_emb)
start = time.time()
distances, indices = index.search(query_emb, k)
search_time = (time.time() - start) * 1000
print(f"\nQuery: '{query}'")
print(f"Search time: {search_time:.2f}ms")
for rank, (dist, idx) in enumerate(zip(distances[0], indices[0]), 1):
print(f" {rank}. [{dist:.4f}] {corpus[idx][:80]}")
return [(corpus[i], float(d)) for i, d in zip(indices[0], distances[0])]
# Test queries
semantic_search("ancient Roman architecture", index_ip, corpus, model)
semantic_search("programming language features", index_ip, corpus, model)
semantic_search("painting and art in Italy", index_ip, corpus, model)
5.2 Persistence and Loading the Index
import faiss
import numpy as np
import json
import os
def build_and_save_index(corpus, model, index_path="faiss_index.bin",
corpus_path="corpus.json"):
"""Build and save a FAISS index to disk."""
# Encode
embeddings = model.encode(corpus, convert_to_numpy=True, show_progress_bar=True)
faiss.normalize_L2(embeddings)
dim = embeddings.shape[1]
index = faiss.IndexFlatIP(dim)
index.add(embeddings)
# Save FAISS index
faiss.write_index(index, index_path)
# Save corpus (to retrieve texts)
with open(corpus_path, 'w', encoding='utf-8') as f:
json.dump(corpus, f, ensure_ascii=False, indent=2)
print(f"Index saved: {index.ntotal} vectors -> {index_path}")
return index
def load_index(index_path="faiss_index.bin", corpus_path="corpus.json"):
"""Load FAISS index and corpus from disk."""
if not os.path.exists(index_path):
raise FileNotFoundError(f"Index not found: {index_path}")
index = faiss.read_index(index_path)
with open(corpus_path, 'r', encoding='utf-8') as f:
corpus = json.load(f)
print(f"Index loaded: {index.ntotal} vectors")
return index, corpus
# Usage
# First run: build and save
# index = build_and_save_index(my_corpus, model)
# Subsequent restarts: load directly (much faster)
# index, corpus = load_index()
6. FAQ Matching: Complete Use Case
A practical application of semantic similarity: automatic matching of user questions with existing FAQs. This pattern is the foundation of many chatbots and customer support systems.
from sentence_transformers import SentenceTransformer, util
import torch
import json
class FAQMatcher:
"""Semantic FAQ matching system with caching and persistence."""
def __init__(self, model_name='paraphrase-multilingual-MiniLM-L12-v2',
threshold=0.7):
self.model = SentenceTransformer(model_name)
self.threshold = threshold
self.faqs = []
self.faq_embeddings = None
def load_faqs(self, faqs: list):
"""
faqs: list of dicts with 'question', 'answer', 'category'
"""
self.faqs = faqs
questions = [faq['question'] for faq in faqs]
print(f"Encoding {len(questions)} FAQs...")
self.faq_embeddings = self.model.encode(
questions,
convert_to_tensor=True,
show_progress_bar=False
)
print("FAQs ready for search!")
def match(self, user_query: str, top_k: int = 3) -> list:
"""Find the FAQs most similar to the user question."""
if self.faq_embeddings is None:
raise ValueError("Load FAQs first with load_faqs()")
query_emb = self.model.encode(user_query, convert_to_tensor=True)
scores = util.cos_sim(query_emb, self.faq_embeddings)[0]
top_k_indices = torch.topk(scores, k=min(top_k, len(self.faqs))).indices
results = []
for idx in top_k_indices:
score = scores[idx].item()
if score >= self.threshold:
results.append({
'question': self.faqs[idx]['question'],
'answer': self.faqs[idx]['answer'],
'category': self.faqs[idx].get('category', 'N/A'),
'score': round(score, 4)
})
return results
def respond(self, user_query: str) -> str:
"""Automatic response to the user question."""
matches = self.match(user_query, top_k=1)
if not matches:
return f"Sorry, no answer found for '{user_query}'. Please contact support."
best = matches[0]
return f"[{best['category']}] {best['answer']} (Confidence: {best['score']:.2f})"
# Usage example
faqs_ecommerce = [
{
"question": "How can I return a product?",
"answer": "You can return any product within 30 days of purchase by contacting support.",
"category": "Returns"
},
{
"question": "How long does shipping take?",
"answer": "Standard delivery takes 3-5 business days; express shipping takes 24 hours.",
"category": "Shipping"
},
{
"question": "What payment methods do you accept?",
"answer": "We accept credit cards, PayPal, bank transfer, and cash on delivery.",
"category": "Payments"
},
{
"question": "Is the product under warranty?",
"answer": "All products come with a 2-year statutory consumer warranty.",
"category": "Warranty"
},
{
"question": "Can I track my order?",
"answer": "Yes, you will receive an email with a tracking number once shipped.",
"category": "Orders"
},
]
matcher = FAQMatcher()
matcher.load_faqs(faqs_ecommerce)
# Test with paraphrased questions
test_queries = [
"I want to send an item back",
"When will my package arrive?",
"Do you accept bank transfers?",
"I need the tracking code",
"The item broke, what should I do?", # not exact, will map to nearest
]
print("\n=== FAQ Matching ===")
for query in test_queries:
response = matcher.respond(query)
print(f"\nQuestion: {query}")
print(f"Response: {response}")
7. Cross-encoder vs Bi-encoder
There are two approaches to semantic similarity offering different quality/speed trade-offs. Understanding them is essential for choosing the right architecture.
Bi-encoder vs Cross-encoder Comparison
| Aspect | Bi-encoder (SBERT) | Cross-encoder |
|---|---|---|
| Architecture | Two separate BERTs, produces embeddings | One BERT processes the pair together |
| Speed | Very fast (pre-computed embeddings) | Slow (processes every pair) |
| Scalability | Millions of documents | Only hundreds of pairs |
| Quality | Good (~0.87 Pearson on STS-B) | Excellent (~0.92 Pearson) |
| Use case | Retrieval, semantic search | Reranking retrieved results |
| Complexity O(n) | O(1) per query (pre-computed embeddings) | O(n) per every query |
from sentence_transformers import SentenceTransformer, CrossEncoder, util
# Bi-encoder for initial retrieval (fast)
bi_encoder = SentenceTransformer('all-MiniLM-L6-v2')
# Cross-encoder for reranking (accurate)
cross_encoder = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2')
# Two-stage pipeline (best of both worlds)
def retrieval_and_rerank(query, corpus, corpus_embeddings, top_k=100, final_k=5):
"""
Stage 1: Bi-encoder retrieval (fast, returns top 100)
Stage 2: Cross-encoder reranking (accurate, over top 100)
"""
# Stage 1: Bi-encoder retrieval
query_emb = bi_encoder.encode(query, convert_to_tensor=True)
hits = util.semantic_search(query_emb, corpus_embeddings, top_k=top_k)[0]
# Stage 2: Cross-encoder reranking
cross_inp = [[query, corpus[hit['corpus_id']]] for hit in hits]
cross_scores = cross_encoder.predict(cross_inp)
# Combine and reorder
for hit, score in zip(hits, cross_scores):
hit['cross_score'] = score
hits = sorted(hits, key=lambda x: x['cross_score'], reverse=True)[:final_k]
print(f"\nQuery: '{query}'")
for rank, hit in enumerate(hits, 1):
bi_score = hit['score']
cross_score = hit['cross_score']
doc = corpus[hit['corpus_id']][:80]
print(f" {rank}. [bi={bi_score:.3f}, cross={cross_score:.3f}] {doc}")
return hits
# Encode corpus once
corpus_embs = bi_encoder.encode(corpus, convert_to_tensor=True)
retrieval_and_rerank("ancient Roman buildings", corpus, corpus_embs)
8. Evaluation: STS-B and Metrics
Correct evaluation of a semantic similarity system requires standardized benchmark datasets. STS-B is the main reference for English, while multilingual benchmarks like MTEB cover Italian and other languages.
from sentence_transformers import SentenceTransformer
from sentence_transformers.evaluation import EmbeddingSimilarityEvaluator
from datasets import load_dataset
import numpy as np
from scipy.stats import pearsonr, spearmanr
# Load STS-B for evaluation
stsb = load_dataset("mteb/stsbenchmark-sts")
val_data = stsb['validation']
# Prepare data for the evaluator
sentences1 = val_data['sentence1']
sentences2 = val_data['sentence2']
scores = [s / 5.0 for s in val_data['score']] # normalize 0-5 to 0-1
model = SentenceTransformer('paraphrase-multilingual-MiniLM-L12-v2')
# Automatic evaluation using built-in evaluator
evaluator = EmbeddingSimilarityEvaluator(
sentences1=sentences1,
sentences2=sentences2,
scores=scores,
name="sts-val"
)
pearson = model.evaluate(evaluator)
print(f"STS-B validation - Pearson: {pearson:.4f}")
# Manual evaluation with Pearson and Spearman correlation
emb1 = model.encode(sentences1, show_progress_bar=False)
emb2 = model.encode(sentences2, show_progress_bar=False)
from numpy.linalg import norm
cos_sims = [
np.dot(e1, e2) / (norm(e1) * norm(e2))
for e1, e2 in zip(emb1, emb2)
]
pearson_r, _ = pearsonr(cos_sims, scores)
spearman_r, _ = spearmanr(cos_sims, scores)
print(f"Pearson: {pearson_r:.4f}")
print(f"Spearman: {spearman_r:.4f}")
# Error analysis: find the most misrepresented pairs
errors = [(abs(p - t), s1, s2, p, t)
for p, t, s1, s2 in zip(cos_sims, scores, sentences1, sentences2)]
errors.sort(reverse=True)
print("\n=== Top 3 Errors ===")
for err, s1, s2, pred, true in errors[:3]:
print(f" Error: {err:.3f} | Pred: {pred:.3f} | True: {true:.3f}")
print(f" '{s1[:60]}'")
print(f" '{s2[:60]}'")
9. Fine-tuning a Sentence Transformer on Your Domain
Pre-trained models perform well on general text, but for specific domains (medical, legal, technical) it is worthwhile to fine-tune with annotated sentence pairs.
from sentence_transformers import (
SentenceTransformer,
InputExample,
losses,
evaluation
)
from torch.utils.data import DataLoader
# Training data: pairs (sentence1, sentence2, score)
# Score: 0.0 = completely different, 1.0 = identical
train_examples = [
InputExample(texts=["Type 2 diabetes diagnosis", "Patient with chronic hyperglycemia"], label=0.85),
InputExample(texts=["Antibiotic prescription", "Amoxicillin therapy"], label=0.80),
InputExample(texts=["Knee surgery", "Meniscus arthroscopy"], label=0.75),
InputExample(texts=["High blood pressure", "Arterial hypertension"], label=0.95),
InputExample(texts=["Chest pain", "Heartburn"], label=0.30),
InputExample(texts=["Femur fracture", "Heart attack"], label=0.05),
]
# Load base model
model = SentenceTransformer('paraphrase-multilingual-MiniLM-L12-v2')
# DataLoader
train_dataloader = DataLoader(train_examples, shuffle=True, batch_size=16)
# Loss: CosineSimilarityLoss for regression on continuous score
train_loss = losses.CosineSimilarityLoss(model)
# Validation evaluator
test_examples = [
InputExample(texts=["Tension headache", "Stress headache"], label=0.88),
InputExample(texts=["Gestational diabetes", "Pregnancy diabetes"], label=0.92),
]
evaluator_sentences1 = [e.texts[0] for e in test_examples]
evaluator_sentences2 = [e.texts[1] for e in test_examples]
evaluator_scores = [e.label for e in test_examples]
val_evaluator = evaluation.EmbeddingSimilarityEvaluator(
evaluator_sentences1, evaluator_sentences2, evaluator_scores
)
# Fine-tuning
model.fit(
train_objectives=[(train_dataloader, train_loss)],
evaluator=val_evaluator,
epochs=10,
evaluation_steps=50,
warmup_steps=100,
output_path='./medical-sentence-transformer',
save_best_model=True
)
print("Fine-tuning complete!")
print("Model saved to './medical-sentence-transformer'")
10. Production-Ready Pipeline
A semantic similarity system in production must handle embedding caching, incremental corpus updates, and quality monitoring.
import faiss
import numpy as np
import json
import hashlib
from sentence_transformers import SentenceTransformer, util
from pathlib import Path
from typing import List, Dict, Optional
class SemanticSearchEngine:
"""
Production-ready semantic search engine with:
- On-disk embedding caching
- Incremental updates
- Configurable threshold
- Query logging
"""
def __init__(
self,
model_name: str = 'paraphrase-multilingual-MiniLM-L12-v2',
cache_dir: str = './search_cache',
similarity_threshold: float = 0.5
):
self.model = SentenceTransformer(model_name)
self.cache_dir = Path(cache_dir)
self.cache_dir.mkdir(exist_ok=True)
self.threshold = similarity_threshold
self.documents: List[Dict] = []
self.embeddings: Optional[np.ndarray] = None
self.index: Optional[faiss.Index] = None
def add_documents(self, documents: List[Dict], text_field: str = 'text'):
"""Add documents to the corpus with caching."""
texts = [doc[text_field] for doc in documents]
new_embeddings = self.model.encode(texts, convert_to_numpy=True, show_progress_bar=True)
if self.embeddings is None:
self.embeddings = new_embeddings
else:
self.embeddings = np.vstack([self.embeddings, new_embeddings])
self.documents.extend(documents)
self._rebuild_index()
print(f"Corpus: {len(self.documents)} documents")
def _rebuild_index(self):
"""Rebuild the FAISS index."""
dim = self.embeddings.shape[1]
self.index = faiss.IndexFlatIP(dim)
embs_normalized = self.embeddings.copy()
faiss.normalize_L2(embs_normalized)
self.index.add(embs_normalized)
def search(self, query: str, k: int = 5, text_field: str = 'text') -> List[Dict]:
"""Search for the most relevant documents for the query."""
if self.index is None or len(self.documents) == 0:
return []
query_emb = self.model.encode([query], convert_to_numpy=True)
faiss.normalize_L2(query_emb)
distances, indices = self.index.search(query_emb, min(k, len(self.documents)))
results = []
for dist, idx in zip(distances[0], indices[0]):
if dist >= self.threshold:
result = dict(self.documents[idx])
result['score'] = float(dist)
results.append(result)
return results
def save(self):
"""Persist the search engine to disk."""
faiss.write_index(self.index, str(self.cache_dir / 'index.faiss'))
np.save(str(self.cache_dir / 'embeddings.npy'), self.embeddings)
with open(self.cache_dir / 'documents.json', 'w', encoding='utf-8') as f:
json.dump(self.documents, f, ensure_ascii=False, indent=2)
print(f"Engine saved to {self.cache_dir}")
# Usage
engine = SemanticSearchEngine(similarity_threshold=0.6)
docs = [
{"text": "Setting up a Python virtual environment with virtualenv.", "id": "py001", "category": "python"},
{"text": "Installing and configuring Docker on Ubuntu.", "id": "docker001", "category": "devops"},
{"text": "Introduction to neural networks with PyTorch.", "id": "ml001", "category": "ml"},
{"text": "REST API security best practices.", "id": "api001", "category": "security"},
{"text": "Optimizing SQL queries with indexes.", "id": "db001", "category": "database"},
]
engine.add_documents(docs)
# Search
results = engine.search("how to create a Python virtual environment")
for r in results:
print(f"[{r['score']:.3f}] {r['text']}")
11. Common Errors and Anti-Patterns
Anti-Pattern: Using BERT [CLS] Directly
The [CLS] token of BERT is not optimized for semantic similarity.
Using it directly (without fine-tuning on a similarity task) produces results
much worse than SBERT. Always use a dedicated sentence-transformers model.
Anti-Pattern: Comparing Embeddings from Different Models
Embeddings from all-MiniLM-L6-v2 and
paraphrase-multilingual-mpnet-base-v2 live in completely different
vector spaces. You cannot compare embeddings produced by different models.
Always use the same model for all sentences in your corpus.
Anti-Pattern: Forgetting Normalization
When using FAISS with IndexFlatIP for cosine similarity,
you must normalize vectors to unit norm with faiss.normalize_L2()
both during indexing and during search. Forgetting this step produces incorrect
results without any explicit errors.
Best Practices: Checklist
- Use sentence-transformers instead of raw BERT for semantic similarity
- Choose multilingual models for Italian or cross-lingual content
- Always normalize vectors before FAISS IndexFlatIP indexing
- Persist embeddings to disk to avoid re-encoding on every restart
- Bi-encoder + cross-encoder pipeline for scalable retrieval + high quality
- Evaluate on STS-B or a domain-specific dataset before deploying
- Monitor similarity score distributions in production to detect drift
- Set a minimum confidence threshold to filter irrelevant matches
12. Semantic Similarity Benchmarks (MTEB 2024-2025)
The MTEB (Massive Text Embedding Benchmark) is the most comprehensive evaluation suite for embedding models, covering 56 tasks across 112 languages. It provides a single leaderboard to compare models on retrieval, clustering, classification, and semantic similarity tasks simultaneously.
Top Models on MTEB (Semantic Textual Similarity, 2025)
| Model | Params | STS Avg | Retrieval Avg | Multilingual | License |
|---|---|---|---|---|---|
| intfloat/multilingual-e5-large | 560M | 88.3 | 54.7 | Yes (100+ lang) | MIT |
| BAAI/bge-m3 | 570M | 87.6 | 57.2 | Yes (100+ lang) | MIT |
| all-mpnet-base-v2 | 109M | 86.9 | 43.8 | EN only | Apache 2.0 |
| paraphrase-multilingual-mpnet-base-v2 | 278M | 85.3 | 39.2 | Yes (50+ lang) | Apache 2.0 |
| all-MiniLM-L6-v2 | 23M | 83.4 | 41.9 | EN only | Apache 2.0 |
| text-embedding-3-small (OpenAI) | API | 89.1 | 62.3 | Yes | Proprietary |
from sentence_transformers import SentenceTransformer
from sentence_transformers.evaluation import EmbeddingSimilarityEvaluator
from datasets import load_dataset
import numpy as np
# Quick MTEB-style evaluation on STS-B
def evaluate_on_stsb(model_name: str) -> dict:
"""
Evaluate a sentence-transformer model on STS-B validation set.
Returns Pearson and Spearman correlation coefficients.
"""
model = SentenceTransformer(model_name)
stsb = load_dataset("mteb/stsbenchmark-sts", split="validation")
sentences1 = stsb['sentence1']
sentences2 = stsb['sentence2']
scores = [s / 5.0 for s in stsb['score']] # normalize 0-5 to 0-1
evaluator = EmbeddingSimilarityEvaluator(
sentences1=sentences1,
sentences2=sentences2,
scores=scores,
name="stsb-val"
)
# The evaluate() method returns the Pearson correlation
pearson = model.evaluate(evaluator)
return {
"model": model_name,
"stsb_pearson": round(pearson, 4),
"num_pairs": len(sentences1)
}
# Compare multiple models
models_to_compare = [
"all-MiniLM-L6-v2",
"paraphrase-multilingual-MiniLM-L12-v2",
"paraphrase-multilingual-mpnet-base-v2",
]
print("=== STS-B Validation Comparison ===")
for model_name in models_to_compare:
try:
result = evaluate_on_stsb(model_name)
print(f" {result['model']:<50s} Pearson: {result['stsb_pearson']:.4f}")
except Exception as e:
print(f" {model_name}: Error - {e}")
Conclusions and Next Steps
Semantic similarity with sentence embeddings is a fundamental component of many modern NLP applications: semantic search, FAQ matching, deduplication, recommendation, and RAG systems. SBERT and sentence-transformers models have made these capabilities accessible with just a few lines of code, while FAISS enables scaling to millions of documents with millisecond latency.
For Italian, multilingual models like paraphrase-multilingual-mpnet-base-v2
and intfloat/multilingual-e5-large deliver excellent performance
even in cross-lingual contexts.
Key Takeaways
- Use SBERT instead of standard BERT for semantic similarity (Pearson 0.87 vs 0.54)
- FAISS is essential for search on large corpora
- Bi-encoder + cross-encoder pipeline: retrieval speed + reranking quality
- Multilingual models for Italian:
paraphrase-multilingual-mpnet-base-v2ormultilingual-e5-large - Always evaluate on STS-B or a dataset from your domain
- Domain-specific fine-tuning with
CosineSimilarityLossfor maximum quality
Continue the Series
- Article 10: NLP Monitoring in Production — drift detection and automated retraining
- Article 8: Local LoRA Fine-tuning — adapting LLMs to your domain on consumer GPUs
- Related series: AI Engineering/RAG — semantic similarity as the core of dense retrieval
- Related series: Advanced Deep Learning — triplet loss, metric learning and contrastive learning







