Ciao! Sono

Federico Calò

Sviluppatore Software | Divulgatore Tecnico

Creo applicazioni web moderne e strumenti digitali personalizzati per aiutare le attività a crescere attraverso l'innovazione tecnologica. La mia passione è unire informatica ed economia per generare valore reale.

Contattami

Chi Sono

La mia passione per l'informatica è nata tra i banchi dell'Istituto Tecnico Commerciale di Maglie, dove ho scoperto il potere della programmazione e il fascino di creare soluzioni digitali. Fin da subito, ho capito che l'informatica non era solo codice, ma uno strumento straordinario per trasformare idee in realtà.

Durante gli studi superiori in Sistemi Informativi Aziendali, ho iniziato a intrecciare informatica ed economia, comprendendo come la tecnologia possa essere il motore della crescita per qualsiasi attività. Questa visione mi ha accompagnato all'Università degli Studi di Bari, dove ho conseguito la Laurea in Informatica, approfondendo le mie competenze tecniche e la mia passione per lo sviluppo software.

Oggi metto questa esperienza al servizio di imprese, professionisti e startup, creando soluzioni digitali su misura che automatizzano processi, ottimizzano risorse e aprono nuove opportunità di business. Perché la vera innovazione inizia quando la tecnologia incontra le esigenze reali delle persone.

Le Mie Competenze

Analisi Dati & Modelli Previsionali

Trasformo i dati in insights strategici con analisi approfondite e modelli predittivi per decisioni informate

Automazione Processi

Creo strumenti personalizzati che automatizzano operazioni ripetitive e liberano tempo per attività a valore aggiunto

Sistemi Custom

Sviluppo sistemi software su misura, dalle integrazioni tra piattaforme alle dashboard personalizzate

const federico = {
  nome: "Federico Calò",
  ruolo: "Sviluppatore Software",
  città: "Bari, Italia",
  missione: "Aiutare attraverso l'informatica",
  passioni: [
    "Codice Pulito",
    "Innovazione",
    "Crescita Continua"
  ]
};

La Mia Missione

Credo fermamente che l'informatica sia lo strumento più potente per trasformare le idee in realtà e migliorare la vita delle persone.

🚀

Democratizzare la Tecnologia

La mia missione è rendere l'informatica accessibile a tutti: dalle piccole imprese locali alle startup innovative, fino ai professionisti che vogliono digitalizzare la propria attività. Ogni realtà merita di sfruttare le potenzialità del digitale.

💡

Unire Informatica ed Economia

Non è solo questione di scrivere codice: è capire come la tecnologia possa generare valore reale. Intrecciando competenze informatiche e visione economica, aiuto le attività a crescere, ottimizzare processi e raggiungere nuovi traguardi di efficienza e redditività.

🎯

Creare Soluzioni su Misura

Ogni attività è unica, e così devono esserlo le soluzioni. Sviluppo strumenti personalizzati che rispondono alle esigenze specifiche di ciascun cliente, automatizzando processi ripetitivi e liberando tempo per ciò che conta davvero: far crescere il business.

Trasforma la Tua Attività con la Tecnologia

Dicembre 2024

Visualizza

Master SQL

RoadMap.sh

Novembre 2024

Visualizza

Oracle Certified Foundations Associate

Oracle

Ottobre 2024

Visualizza

People Leadership Credential

Connect

Settembre 2024

💻 Linguaggi & Tecnologie

☕Java

🐍Python

📜JavaScript

🅰️Angular

⚛️React

🔷TypeScript

🗄️SQL

🐘PHP

🎨CSS/SCSS

🔧Node.js

🐳Docker

🌿Git

💼

12/2024 - Presente

Custom Software Engineering Analyst

Accenture

Bari, Puglia, Italia · Ibrida Analisi e sviluppo di sistemi informatici attraverso l'utilizzo di Java e Quarkus in Health and Public Sector. Formazione continua su tecnologie moderne per la creazione di soluzioni software personalizzate ed efficienti e sugli agenti.

💼

06/2022 - 12/2024

Analista software e Back End Developer Associate Consultant

Links Management and Technology SpA

Esperienza nell'analisi di sistemi software as-is e flussi ETL utilizzando PowerCenter. Formazione completata su Spring Boot per lo sviluppo di applicazioni backend moderne e scalabili. Sviluppatore Backend specializzato in Spring Boot, con esperienza in progettazione di database, analisi, sviluppo e testing dei task assegnati.

💼

02/2021 - 10/2021

Programmatore software

Adesso.it (prima era WebScience srl)

Esperienza nell'analisi AS-IS e TO-BE, evoluzioni SEO ed evoluzioni website per migliorare le performance e l'engagement degli utenti.

🎓

2018 - 2025

Laurea in Informatica

Università degli Studi di Bari Aldo Moro

Bachelor's degree in Computer Science, focusing on software engineering, algorithms, and modern development practices.

📚

2013 - 2018

Diploma - Sistemi Informativi Aziendali

Istituto Tecnico Commerciale di Maglie

Technical diploma specializing in Business Information Systems, combining IT knowledge with business management.

Contattami

Hai un progetto in mente? Parliamone! Compila il form qui sotto e ti risponderò al più presto.

* Campi obbligatori. I tuoi dati saranno utilizzati solo per rispondere alla tua richiesta.

Hybrid Retrieval: Combining BM25 and Vector Search for Production RAG

Semantic search with dense embeddings has transformed how we retrieve information in RAG systems, but it hides a fundamental limitation that surfaces consistently in production: if a user searches for "GPT-4 hallucination rate benchmark Q3 2024", an embedding model will find documents semantically close to the concept of "language model hallucination", but may completely miss the document containing that exact string. Keyword search, conversely, finds that phrase precisely, but doesn't understand that "LLM factuality issue" is conceptually identical.

Hybrid Retrieval addresses this tension directly. By combining sparse retrieval (BM25 and variants) with dense retrieval (vector search), you get a system that is both precise on exact matches and robust on semantic understanding. Research shows hybrid systems improve retrieval quality by 48% compared to single-method approaches on BEIR and MTEB benchmarks, with the largest gains on technical queries, proper nouns, and domain-specific terminology.

This article is a technical deep dive into hybrid retrieval architecture: from how BM25 works internally, to fusion methods (Reciprocal Rank Fusion, weighted fusion), cross-encoder re-ranking, and practical implementation with Qdrant and evaluation using NDCG/MRR metrics. The goal is to give you the tools to build and optimize retrieval pipelines that work in production, not just in benchmarks.

What You Will Learn

Why semantic-only search fails and why BM25 remains essential in 2025
BM25 internals: term frequency saturation, IDF weighting, length normalization
Hybrid search architecture: sparse and dense in parallel
Reciprocal Rank Fusion (RRF): formula, implementation, and tuning the k parameter
Weighted score fusion: normalization and balancing retriever contributions
Cross-encoder re-ranking: when to use it and how to optimize latency vs accuracy
Implementation with Qdrant sparse vectors and Query API
Evaluation metrics: NDCG@k, MRR, Precision@k for hybrid search
Production pipeline with caching, monitoring, and progressive optimization

Why Semantic-Only Search Falls Short

Dense vector search is powerful for capturing latent meaning in text, but has structural vulnerabilities that become apparent with real production queries. Understanding these limits is the first step in understanding why hybrid retrieval is necessary, not optional, in serious RAG systems.

The core problem is what researchers call vocabulary mismatch: embedding models are trained on general text distributions and do not always capture the relevance of specific technical terms, acronyms, product names, software versions, or identifiers. An embedding model doesn't know that "MSMARCO-v2.1" refers to a specific dataset, or that "CVE-2024-4577" is a critical PHP vulnerability, unless it was fine-tuned on that domain.

Where Semantic Search Fails

Version-specific queries: "Python 3.12 asyncio.TaskGroup" vs "Python async patterns"
Unique identifiers: CVE IDs, order numbers, tax codes, ISBN
Uncommon acronyms: domain terms, company abbreviations, regulatory codes
Rare proper nouns: people names, small companies, geographic localities
Very short queries: with 1-2 tokens, embeddings are not discriminative
Recent technical terminology: models with knowledge cutoffs miss new terms

A second problem is score calibration: dense vector similarity scores (typically cosine similarity in [-1, 1] or unbounded dot product) have no absolute semantics. A document with score 0.85 is not necessarily more relevant than one with 0.82 in different contexts. This makes it hard to compare or combine scores from different systems without appropriate normalization.

Finally, semantic search suffers from semantic drift on ambiguous queries: a query like "Java" in a programming context might retrieve documents about "Java Island" if the document context isn't clear enough for the embedding model, especially with very short or decontextualized text chunks.

BM25 Internals: A Technical Refresh

BM25 (Best Match 25) is a ranking function developed in the 1990s that remains, in 2025, one of the most effective keyword information retrieval algorithms. Understanding its internals is necessary both for using it correctly and for understanding why it complements semantic search so well.

BM25 extends TF-IDF with two key mechanisms: term frequency saturation and length normalization. The full formula for scoring document D against query Q with terms {q1, ..., qn} is:

# BM25 Formula (mathematical pseudocode)
# score(D, Q) = sum over qi in Q of:
#   IDF(qi) * (TF(qi, D) * (k1 + 1)) / (TF(qi, D) + k1 * (1 - b + b * |D| / avgdl))
#
# Where:
#   IDF(qi) = log((N - df_i + 0.5) / (df_i + 0.5) + 1)
#   TF(qi, D) = frequency of term qi in document D
#   |D| = document length in terms
#   avgdl = average document length across the collection
#   N = total number of documents
#   df_i = number of documents containing qi
#   k1 = TF saturation parameter (default: 1.2-2.0)
#   b = length normalization parameter (default: 0.75)

# Python implementation with rank_bm25
from rank_bm25 import BM25Okapi
from nltk.tokenize import word_tokenize

class BM25Retriever:
    def __init__(self, corpus: list[str], k1: float = 1.5, b: float = 0.75):
        self.k1 = k1
        self.b = b
        # Tokenize and lowercase
        self.tokenized_corpus = [
            word_tokenize(doc.lower()) for doc in corpus
        ]
        self.bm25 = BM25Okapi(self.tokenized_corpus, k1=k1, b=b)
        self.corpus = corpus

    def retrieve(self, query: str, top_k: int = 20) -> list[dict]:
        tokenized_query = word_tokenize(query.lower())
        scores = self.bm25.get_scores(tokenized_query)

        ranked = sorted(
            enumerate(scores),
            key=lambda x: x[1],
            reverse=True
        )[:top_k]

        return [
            {"doc_id": idx, "text": self.corpus[idx], "score": score}
            for idx, score in ranked
            if score > 0  # Filter zero-match docs
        ]

# Usage
corpus = [
    "BM25 is a ranking function used in information retrieval",
    "Vector search uses dense embeddings for semantic similarity",
    "Hybrid search combines BM25 and vector search for better recall",
    "Python asyncio enables concurrent programming",
]

retriever = BM25Retriever(corpus, k1=1.5, b=0.75)
results = retriever.retrieve("BM25 hybrid search retrieval", top_k=3)

for r in results:
    print(f"Score: {r['score']:.4f} | Text: {r['text'][:60]}...")

The k1 parameter controls term frequency saturation: with a low k1 (0.5), the difference between 1 and 2 occurrences of a term counts almost as much as the difference between 10 and 100 occurrences; with a high k1 (2.0), TF continues to matter even at high frequencies. The b parameter controls how much to penalize long documents: b=0 disables length normalization, b=1 normalizes completely.

An often-overlooked aspect of BM25 is that its scores are unbounded above: a highly relevant document with many query term occurrences can score 10, 50, or 100 depending on the corpus. This creates a compatibility problem with dense vector cosine similarity scores in [-1, 1]. Hybrid fusion must handle this discrepancy.

Hybrid Search Architecture

The base architecture of a hybrid retrieval system runs sparse and dense search in parallel on separate indexes, then merges the results before returning them to the user (or the LLM in a RAG context). There are three architectural points where fusion can happen, each with different tradeoffs:

Early fusion (pre-retrieval): documents are represented with a hybrid vector combining sparse and dense features before indexing. Examples: SPLADE, ColBERT in end-to-end mode. More expensive at indexing time, but more coherent at query time.
Late fusion (post-retrieval): the two retrievers operate independently on separate indexes and results are merged at the ranking level. This is the most common and flexible approach, allowing independent component updates.
Re-ranking stage: a separate model (cross-encoder) re-orders the fused results from late fusion. Adds latency but significantly improves precision@k.

# Base hybrid retrieval architecture with late fusion
from typing import Protocol
import asyncio

class Retriever(Protocol):
    async def search(self, query: str, top_k: int) -> list[dict]:
        """Returns list of {'doc_id': str, 'text': str, 'score': float}"""
        ...

class HybridRetriever:
    def __init__(
        self,
        sparse_retriever: Retriever,
        dense_retriever: Retriever,
        fusion_method: str = "rrf",  # "rrf" | "weighted" | "dbsf"
        sparse_weight: float = 0.4,
        dense_weight: float = 0.6,
        top_k_per_retriever: int = 50,
    ):
        self.sparse = sparse_retriever
        self.dense = dense_retriever
        self.fusion_method = fusion_method
        self.sparse_weight = sparse_weight
        self.dense_weight = dense_weight
        self.top_k_per_retriever = top_k_per_retriever

    async def search(self, query: str, final_top_k: int = 10) -> list[dict]:
        # Parallel execution of both retrievers
        sparse_results, dense_results = await asyncio.gather(
            self.sparse.search(query, self.top_k_per_retriever),
            self.dense.search(query, self.top_k_per_retriever)
        )

        if self.fusion_method == "rrf":
            return self._rrf_fusion(sparse_results, dense_results, final_top_k)
        elif self.fusion_method == "weighted":
            return self._weighted_fusion(sparse_results, dense_results, final_top_k)
        else:
            raise ValueError(f"Unknown fusion method: {self.fusion_method}")

Reciprocal Rank Fusion (RRF)

RRF is the most widely used fusion algorithm in hybrid search due to its simplicity, robustness, and score-scale independence. Originally proposed by Cormack, Clarke and Buettcher in 2009, it assigns each document a score based solely on its position in the ranking list of each retriever, completely ignoring the absolute score value.

The RRF formula for a document D appearing in lists L1, L2, ..., Lm is:

RRF(D) = sum over i=1..m of: 1 / (k + rank_i(D))

Where k is a constant (typically 60) that dampens the impact of top-ranked documents. If D does not appear in list i, its contribution is 0. The k=60 default was determined empirically; typical values range from 10 to 100.

# Full RRF implementation
from collections import defaultdict

def reciprocal_rank_fusion(
    result_lists: list[list[dict]],
    k: int = 60,
    id_field: str = "doc_id"
) -> list[dict]:
    """
    Merges N result lists using Reciprocal Rank Fusion.

    Args:
        result_lists: List of lists, each sorted by relevance descending
        k: Damping constant (default 60, recommended range 10-100)
        id_field: Field used as unique document identifier

    Returns:
        Merged list sorted by RRF score descending
    """
    rrf_scores = defaultdict(float)
    doc_registry = {}  # Map doc_id -> full document

    for result_list in result_lists:
        for rank, doc in enumerate(result_list, start=1):
            doc_id = doc[id_field]
            # RRF formula: 1 / (k + rank)
            rrf_scores[doc_id] += 1.0 / (k + rank)
            if doc_id not in doc_registry:
                doc_registry[doc_id] = doc

    # Sort by RRF score descending
    sorted_docs = sorted(
        rrf_scores.items(),
        key=lambda x: x[1],
        reverse=True
    )

    return [
        {**doc_registry[doc_id], "rrf_score": score}
        for doc_id, score in sorted_docs
    ]


# Usage example
sparse_results = [
    {"doc_id": "doc_A", "text": "BM25 text...", "score": 12.5},
    {"doc_id": "doc_B", "text": "...", "score": 8.3},
    {"doc_id": "doc_C", "text": "...", "score": 5.1},
]

dense_results = [
    {"doc_id": "doc_C", "text": "...", "score": 0.92},   # doc_C first in dense
    {"doc_id": "doc_A", "text": "BM25 text...", "score": 0.88},
    {"doc_id": "doc_D", "text": "...", "score": 0.85},   # Only in dense
]

fused = reciprocal_rank_fusion(
    [sparse_results, dense_results],
    k=60
)

# RRF scores:
# doc_A: 1/(60+1) + 1/(60+2) = 0.01639 + 0.01613 = 0.03252
# doc_C: 1/(60+3) + 1/(60+1) = 0.01587 + 0.01639 = 0.03226
# doc_B: 1/(60+2) = 0.01613
# doc_D: 1/(60+3) = 0.01587

for doc in fused:
    print(f"{doc['doc_id']}: RRF={doc['rrf_score']:.5f}")

The power of RRF lies in its robustness to score outliers: it doesn't matter if BM25 assigns score 100 to the first document and 50 to the second, while cosine similarity gives 0.99 and 0.97. Only the relative position matters. This makes it particularly suitable when the two retrievers have completely different scoring scales.

The k parameter controls how much weight top-ranked documents receive versus lower-ranked ones. With k=60, rank 1 gets 1/61 = 0.0164, rank 60 gets 1/120 = 0.0083: the first is worth less than twice the last. With k=10, rank 1 (1/11 = 0.091) is worth nearly 7x rank 60 (1/70 = 0.014): a more "winner-takes-all" effect. For most use cases, k=60 is an excellent starting point.

Weighted Score Fusion with Normalization

Weighted fusion combines absolute scores instead of ranks, allowing control over how much weight each retriever contributes. The main challenge is score normalization: BM25 and cosine similarity live on completely different scales, so direct combination ("bm25_score * 0.4 + dense_score * 0.6") without normalization is meaningless.

# Weighted fusion with Min-Max and Z-score normalization
import numpy as np
from typing import Optional

def min_max_normalize(scores: list[float]) -> list[float]:
    """Normalizes scores to [0, 1] using Min-Max scaling."""
    if not scores:
        return []
    min_val = min(scores)
    max_val = max(scores)
    if max_val == min_val:
        return [1.0] * len(scores)
    return [(s - min_val) / (max_val - min_val) for s in scores]

def dbsf_normalize(scores: list[float]) -> list[float]:
    """
    Distribution-Based Score Fusion (DBSF) normalization.
    Uses mean and std for normalization more robust to outliers.
    """
    if not scores:
        return []
    mean = np.mean(scores)
    std = np.std(scores)
    if std == 0:
        return [0.5] * len(scores)
    normalized = [(s - mean) / (3 * std) + 0.5 for s in scores]
    return [max(0.0, min(1.0, n)) for n in normalized]

def weighted_fusion(
    sparse_results: list[dict],
    dense_results: list[dict],
    sparse_weight: float = 0.3,
    dense_weight: float = 0.7,
    normalization: str = "minmax",  # "minmax" | "dbsf"
    top_k: Optional[int] = None,
    id_field: str = "doc_id"
) -> list[dict]:
    """
    Combines sparse and dense results with normalized weighted fusion.
    """
    sparse_map = {d[id_field]: d for d in sparse_results}
    dense_map = {d[id_field]: d for d in dense_results}
    all_ids = set(sparse_map.keys()) | set(dense_map.keys())

    normalize_fn = min_max_normalize if normalization == "minmax" else dbsf_normalize

    if sparse_results:
        sparse_scores_norm = dict(zip(
            [d[id_field] for d in sparse_results],
            normalize_fn([d["score"] for d in sparse_results])
        ))
    else:
        sparse_scores_norm = {}

    if dense_results:
        dense_scores_norm = dict(zip(
            [d[id_field] for d in dense_results],
            normalize_fn([d["score"] for d in dense_results])
        ))
    else:
        dense_scores_norm = {}

    fused_docs = []
    for doc_id in all_ids:
        sparse_score = sparse_scores_norm.get(doc_id, 0.0)
        dense_score = dense_scores_norm.get(doc_id, 0.0)
        combined_score = sparse_weight * sparse_score + dense_weight * dense_score

        doc = sparse_map.get(doc_id) or dense_map.get(doc_id)
        fused_docs.append({
            **doc,
            "combined_score": combined_score,
            "sparse_score_norm": sparse_score,
            "dense_score_norm": dense_score,
        })

    fused_docs.sort(key=lambda x: x["combined_score"], reverse=True)
    return fused_docs[:top_k] if top_k else fused_docs

# When to use weighted vs RRF:
# - RRF: when retrievers have very different scales, as a starting point
# - Weighted + DBSF: when you want data-driven sparse/dense balance
# - Weighted + MinMax: simpler, sensitive to score outliers

Cross-Encoder Re-Ranking

Fusion (RRF or weighted) produces a ranked list of candidates. But both BM25 and dense retrievers use bi-encoders: query and document are encoded separately and similarity is computed post-hoc. This is efficient but misses fine-grained query-document interactions.

Cross-encoders process query and document together through a transformer model, allowing the self-attention mechanism to capture direct interactions between query tokens and document tokens. The result is a significantly more accurate relevance score, but at a computational cost proportional to the number of (query, document) pairs being evaluated.

# Cross-encoder re-ranking with sentence-transformers
from sentence_transformers import CrossEncoder
import time
from typing import Optional
import logging

logger = logging.getLogger(__name__)

class CrossEncoderReranker:
    """
    Cross-encoder based reranker for the precision refinement stage.
    Uses ms-marco model for query-document relevance scoring.
    """

    def __init__(
        self,
        model_name: str = "cross-encoder/ms-marco-MiniLM-L-6-v2",
        max_length: int = 512,
        batch_size: int = 32,
        device: Optional[str] = None,
    ):
        self.model = CrossEncoder(
            model_name,
            max_length=max_length,
            device=device
        )
        self.batch_size = batch_size

    def rerank(
        self,
        query: str,
        documents: list[dict],
        text_field: str = "text",
        top_k: Optional[int] = None,
    ) -> list[dict]:
        """
        Re-orders documents using the cross-encoder.

        Args:
            query: Original user query
            documents: Documents to reorder (output of hybrid retriever)
            text_field: Document field containing text
            top_k: Return only top_k most relevant

        Returns:
            Re-ordered documents with "rerank_score" field added
        """
        if not documents:
            return []

        start_time = time.time()

        # Create (query, document_text) pairs for cross-encoder
        query_doc_pairs = [
            (query, doc[text_field]) for doc in documents
        ]

        # Batch inference for efficiency
        scores = self.model.predict(
            query_doc_pairs,
            batch_size=self.batch_size,
            show_progress_bar=False,
        )

        elapsed = time.time() - start_time
        logger.debug(
            f"Cross-encoder scored {len(documents)} docs in {elapsed:.3f}s "
            f"({elapsed/len(documents)*1000:.1f}ms/doc)"
        )

        reranked = [
            {**doc, "rerank_score": float(score)}
            for doc, score in zip(documents, scores)
        ]
        reranked.sort(key=lambda x: x["rerank_score"], reverse=True)

        return reranked[:top_k] if top_k else reranked


# Full pipeline: Hybrid Retrieval + Cross-Encoder Reranking
class RAGRetrievalPipeline:
    def __init__(
        self,
        hybrid_retriever,
        reranker: CrossEncoderReranker,
        retrieval_top_k: int = 50,   # Large pool for reranker input
        final_top_k: int = 5,        # Final top-K for LLM context
    ):
        self.hybrid_retriever = hybrid_retriever
        self.reranker = reranker
        self.retrieval_top_k = retrieval_top_k
        self.final_top_k = final_top_k

    async def retrieve_for_llm(self, query: str) -> list[dict]:
        """
        Full pipeline: hybrid retrieval -> cross-encoder reranking.
        Optimized to maximize precision@5 (the 5 docs passed to the LLM).
        """
        # Step 1: Hybrid retrieval with large top_k for reranker
        candidates = await self.hybrid_retriever.search(
            query, final_top_k=self.retrieval_top_k
        )
        if not candidates:
            return []

        # Step 2: Cross-encoder reranking
        reranked = self.reranker.rerank(
            query=query,
            documents=candidates,
            top_k=self.final_top_k
        )
        return reranked

# Typical performance (T4 GPU):
# - Hybrid retrieval (BM25 + HNSW): ~10-20ms
# - Cross-encoder reranking (20 docs): ~80-120ms
# - Cross-encoder reranking (50 docs): ~200-350ms
# Total pipeline: ~100-370ms depending on reranker top_k

Recommended Cross-Encoder Models (2025)

cross-encoder/ms-marco-MiniLM-L-6-v2: Best speed/accuracy balance. MAP 0.82 on MS MARCO. ~12ms/doc on GPU. Ideal for production.
cross-encoder/ms-marco-MiniLM-L-12-v2: More accurate, ~2x slower. For high-priority queries.
BAAI/bge-reranker-v2-m3: Multilingual, great for non-English content. Supports up to 8192 tokens. Recommended for multilingual RAG.
Cohere Rerank API: Managed solution, ~50ms latency, excellent accuracy. Per-query cost. Great for rapid proof-of-concept.
Jina Reranker v2: Open-source, 8192 token context, excellent on technical text.

Implementation with Qdrant Sparse + Dense Vectors

Qdrant natively supports hybrid search through its Query API with sparse vectors and the prefetch mechanism. Unlike solutions requiring separate systems for sparse and dense, Qdrant manages both in a single collection, significantly simplifying the architecture.

# Qdrant Hybrid Search: full setup and query
from qdrant_client import QdrantClient
from qdrant_client import models
from fastembed import TextEmbedding, SparseTextEmbedding

client = QdrantClient("localhost", port=6333)

# Dense model: all-MiniLM-L6-v2 (384 dims, fast)
dense_model = TextEmbedding("sentence-transformers/all-MiniLM-L6-v2")

# Sparse model: BM25 via FastEmbed
sparse_model = SparseTextEmbedding("Qdrant/bm25")

COLLECTION_NAME = "hybrid_rag_collection"

def create_hybrid_collection():
    """Create collection with sparse + dense vector support."""
    client.create_collection(
        collection_name=COLLECTION_NAME,
        vectors_config={
            "dense": models.VectorParams(
                size=384,
                distance=models.Distance.COSINE,
                on_disk=False,
            )
        },
        sparse_vectors_config={
            "sparse": models.SparseVectorParams(
                modifier=models.Modifier.IDF,  # BM25-style IDF weighting
            )
        },
    )


def index_documents(documents: list[dict]):
    """Index documents with dense and sparse vectors."""
    texts = [doc["text"] for doc in documents]
    dense_embeddings = list(dense_model.embed(texts))
    sparse_embeddings = list(sparse_model.embed(texts))

    points = []
    for i, doc in enumerate(documents):
        sparse_emb = sparse_embeddings[i]
        points.append(
            models.PointStruct(
                id=i,
                payload={"text": doc["text"], **doc.get("metadata", {})},
                vector={
                    "dense": dense_embeddings[i].tolist(),
                    "sparse": models.SparseVector(
                        indices=sparse_emb.indices.tolist(),
                        values=sparse_emb.values.tolist(),
                    )
                }
            )
        )

    client.upsert(collection_name=COLLECTION_NAME, points=points, wait=True)


def hybrid_search_qdrant(
    query: str,
    top_k: int = 10,
    prefetch_k: int = 50,
    fusion: str = "rrf",
) -> list[dict]:
    """
    Hybrid search with Qdrant Query API.

    Uses prefetch mechanism: retrieves prefetch_k candidates from each
    retriever, then fuses with RRF or DBSF.
    """
    query_dense = list(dense_model.embed([query]))[0].tolist()
    query_sparse_emb = list(sparse_model.embed([query]))[0]
    query_sparse = models.SparseVector(
        indices=query_sparse_emb.indices.tolist(),
        values=query_sparse_emb.values.tolist(),
    )

    fusion_model = (
        models.Fusion.RRF if fusion == "rrf"
        else models.Fusion.DBSF
    )

    results = client.query_points(
        collection_name=COLLECTION_NAME,
        prefetch=[
            models.Prefetch(
                query=query_sparse,
                using="sparse",
                limit=prefetch_k,
            ),
            models.Prefetch(
                query=query_dense,
                using="dense",
                limit=prefetch_k,
            ),
        ],
        query=models.FusionQuery(fusion=fusion_model),
        limit=top_k,
        with_payload=True,
    )

    return [
        {
            "doc_id": str(point.id),
            "text": point.payload.get("text", ""),
            "score": point.score,
            "payload": point.payload
        }
        for point in results.points
    ]

Evaluation: NDCG, MRR and Precision@k

Building a hybrid retrieval system without an evaluation framework is building without measurement. Before tuning any parameter (RRF k, sparse/dense weight, reranker threshold), you need a test dataset with ground truth and defined metrics. The three most important retrieval metrics are NDCG, MRR and Precision@k.

# Evaluation framework for hybrid retrieval
import numpy as np
from typing import Optional

def ndcg_at_k(
    retrieved_ids: list[str],
    relevant_ids: list[str],
    k: int,
    relevance_grades: Optional[dict] = None
) -> float:
    """
    Normalized Discounted Cumulative Gain @k.
    Measures ranking quality considering position of relevant documents.
    Values in [0, 1], 1 = perfect ranking.
    """
    relevant_set = set(relevant_ids)
    top_k = retrieved_ids[:k]

    dcg = 0.0
    for i, doc_id in enumerate(top_k):
        if relevance_grades:
            grade = relevance_grades.get(doc_id, 0)
        else:
            grade = 1.0 if doc_id in relevant_set else 0.0
        dcg += grade / np.log2(i + 2)

    if relevance_grades:
        ideal_grades = sorted(
            [relevance_grades.get(rid, 0) for rid in relevant_ids],
            reverse=True
        )[:k]
    else:
        ideal_grades = [1.0] * min(len(relevant_ids), k)

    idcg = sum(
        grade / np.log2(i + 2)
        for i, grade in enumerate(ideal_grades)
    )

    return dcg / idcg if idcg > 0 else 0.0


def mrr(
    retrieved_results: list[list[str]],
    relevant_results: list[list[str]]
) -> float:
    """
    Mean Reciprocal Rank.
    Average of the reciprocal of the first relevant result rank.
    """
    reciprocal_ranks = []
    for retrieved, relevant in zip(retrieved_results, relevant_results):
        relevant_set = set(relevant)
        rr = 0.0
        for rank, doc_id in enumerate(retrieved, start=1):
            if doc_id in relevant_set:
                rr = 1.0 / rank
                break
        reciprocal_ranks.append(rr)
    return np.mean(reciprocal_ranks)


def evaluate_retriever(
    retriever_fn,
    test_queries: list[dict],
    k_values: list[int] = [1, 3, 5, 10]
) -> dict:
    """Evaluates a retriever on a test set with multiple metrics."""
    all_retrieved = []
    all_relevant = []
    ndcg_scores = {k: [] for k in k_values}

    for item in test_queries:
        query = item["query"]
        relevant_ids = item["relevant_ids"]

        results = retriever_fn(query, top_k=max(k_values))
        retrieved_ids = [r["doc_id"] for r in results]

        all_retrieved.append(retrieved_ids)
        all_relevant.append(relevant_ids)

        for k in k_values:
            ndcg = ndcg_at_k(retrieved_ids, relevant_ids, k)
            ndcg_scores[k].append(ndcg)

    metrics = {"MRR": mrr(all_retrieved, all_relevant)}
    for k in k_values:
        metrics[f"NDCG@{k}"] = np.mean(ndcg_scores[k])

    return metrics


# Ablation study: compare methods
def run_ablation_study(test_queries, sparse_retriever, dense_retriever, hybrid_retriever):
    print("=== Ablation Study: Retrieval Methods ===\n")

    for name, retriever in [
        ("BM25 only", sparse_retriever.retrieve),
        ("Dense only", dense_retriever.search),
        ("Hybrid RRF", hybrid_retriever.search),
    ]:
        metrics = evaluate_retriever(retriever, test_queries)
        print(f"{name}:")
        for metric, value in metrics.items():
            print(f"  {metric}: {value:.4f}")
        print()

Best Practices and Anti-Patterns

Building an effective hybrid retrieval system requires avoiding common pitfalls that emerge in production and are not obvious from tutorials.

Hybrid Retrieval Best Practices

Start with RRF k=60: This is the empirically most robust default. Experiment with other values only after establishing a baseline with NDCG metrics.
top_k per retriever >= 3x final_top_k: If you want the top 5 final results, retrieve at least 15 from each retriever to give the fusion enough material.
Consistent tokenization: BM25 and the embedding model should use the same preprocessing pipeline (lowercase, stopwords, stemming) for coherence.
Cross-encoder on max 20-50 docs: Beyond 50 candidates, precision gains are marginal versus the added latency cost.
Evaluate sparse and dense separately first: Before integration, measure each component's metrics. If your dense retriever is already at 90% NDCG@5, hybrid may not add value for your specific dataset.
Cache with normalized query: Lowercase and trim the query before hashing to maximize cache hit rate.

Anti-Patterns to Avoid

Combining non-normalized scores: "BM25_score + cosine_score" without normalization produces results dominated by the retriever with the larger scale (almost always BM25).
Running the reranker on all retrieval results: Reranking 200 docs adds 2-3 seconds of latency. Always limit the reranker to 20-50 candidates.
Ignoring chunk quality: Hybrid retrieval doesn't fix poorly formed chunks (too short, cut mid-concept). Indexing quality is the fundamental prerequisite.
Tuning without a test set: Changing sparse/dense weights or RRF k without measuring on a test dataset leads to overfitting on subjective impressions.
Missing fallback handling: If the BM25 index goes offline, the system must degrade gracefully to dense-only, not crash entirely.

When Hybrid Retrieval Is Not Enough

Hybrid retrieval solves many problems, but not all. If retrieval metrics are still insufficient after implementation, consider these advanced approaches:

HyDE (Hypothetical Document Embeddings): The LLM generates a hypothetical answer to the query, which is then used as the query for the retriever. Improves semantic recall on abstract or poorly-formed queries.
Query expansion: Generate query variants (synonyms, rephrasings) with the LLM and retrieve on all of them, then merge results with RRF.
SPLADE: Learned sparse model that produces "intelligent" sparse vectors instead of raw term frequency. More accurate than BM25 but requires ML inference.
ColBERT/ColPali: Late interaction model that compares each query token with each document token. Superior accuracy with retrieval-stage (not reranking) latency.
GraphRAG: Augments vector retrieval with a knowledge graph capturing structured relationships between entities. Ideal for questions requiring multi-hop reasoning.

Conclusions

Hybrid Retrieval is today the standard strategy for production RAG systems that must handle heterogeneous queries: from exact technical terms to vague conceptual questions. The BM25 + dense combination with RRF provides an already very robust baseline, which cross-encoder re-ranking brings to precision levels difficult to reach with single-method approaches.

The key to implementing it successfully is the order of operations: first build a test set with real ground truth from your domain, establish separate baselines for BM25 and dense, then experiment with fusion and measure the delta. Only with concrete metrics (NDCG@5, MRR) can you determine whether adding the reranker is worth the extra 200ms of latency for your use case.

Next Steps

Continue with LangChain RAG Pipeline: from Document to Answer to integrate this retriever into a full pipeline with an LLM.
Read RAG in Production: Monitoring, Evaluation, Optimization for a comprehensive evaluation and monitoring framework.
Explore Embeddings and Semantic Search: Choosing the Right Model to deepen your understanding of optimal dense model selection for your domain.
Consider pgvector and PostgreSQL AI if you want to implement hybrid search directly in your existing PostgreSQL database.

References

Qdrant Hybrid Search Documentation - Query API and Sparse Vectors
Cormack, Clarke, Buettcher (2009) - "Reciprocal Rank Fusion outperforms Condorcet and individual Rank Learning Methods"
BEIR Benchmark - Heterogeneous Retrieval Benchmark
sentence-transformers cross-encoder documentation
MTEB (Massive Text Embedding Benchmark) - 2025 Leaderboard