RAG Architecture: Naive, Advanced and Modular RAG Patterns
The term "RAG" actually covers a very broad spectrum of architectures, from the simple pattern to three steps from 2023 to modular systems from 2026 that integrate query routing, reranking, self-RAG and consistency checks. Understanding this evolution is fundamental: the Naive RAG It is quick to implement but produces low-quality retrievals on complex documents; theAdvanced RAG solves specific retrieval problems; The Modular RAG offers maximum flexibility for systems in production.
This guide covers the three architectures with real Python code, comparative quality metrics and criteria for choosing the right level of complexity for your use case.
What You Will Learn
- Naive RAG: basic architecture, limits and when it is sufficient
- Advanced RAG: pre-retrieval (query rewriting, HyDE), post-retrieval (reranking)
- Modular RAG: Routing, self-RAG, CRAG and composable pipelines
- RAGAS metrics to compare architectures objectively
- Complete Python code for each architecture
- Decision guide: when to advance to the next level
Naive RAG: The Basic Pattern
The Naive RAG follows the index-retrieve-generate flow without optimizations:
- Index documents with fixed chunks (typically 512-1024 tokens)
- Converts the query to embedding and searches for the k most similar chunks
- Concatenate the chunks into the prompt and generate the response
# Naive RAG con LangChain — implementazione completa
from langchain_openai import ChatOpenAI, OpenAIEmbeddings
from langchain_community.vectorstores import Qdrant
from langchain.chains import RetrievalQA
from langchain.prompts import PromptTemplate
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.document_loaders import DirectoryLoader, UnstructuredMarkdownLoader
# --- FASE 1: Indicizzazione ---
loader = DirectoryLoader(
"./docs",
glob="**/*.md",
loader_cls=UnstructuredMarkdownLoader
)
documents = loader.load()
# Chunking fisso — il limite principale del Naive RAG
splitter = RecursiveCharacterTextSplitter(
chunk_size=512,
chunk_overlap=64,
separators=["\n\n", "\n", ".", " "]
)
chunks = splitter.split_documents(documents)
embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
vectorstore = Qdrant.from_documents(
chunks, embeddings,
url="http://localhost:6333",
collection_name="naive_rag"
)
# --- FASE 2 + 3: Retrieval + Generation ---
NAIVE_RAG_PROMPT = PromptTemplate(
input_variables=["context", "question"],
template="""Rispondi alla domanda basandoti SOLO sul contesto fornito.
Se il contesto non contiene la risposta, dì "Non ho informazioni su questo argomento".
Contesto:
{context}
Domanda: {question}
Risposta:"""
)
llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)
retriever = vectorstore.as_retriever(search_kwargs={"k": 5})
rag_chain = RetrievalQA.from_chain_type(
llm=llm,
chain_type="stuff",
retriever=retriever,
chain_type_kwargs={"prompt": NAIVE_RAG_PROMPT},
return_source_documents=True
)
result = rag_chain.invoke({"query": "Come gestire gli errori di timeout?"})
print(result["result"])
Limits of the Naive RAG: Poor performance on ambiguous queries, chunk retrieval partially relevant, no case management where the recovered documents contradict each other, variable quality with structured documents (tables, code, lists).
Advanced RAG: Pre and Post Retrieval Optimizations
Advanced RAG adds optimizations in the pre- and post-retrieval phases. The most techniques impacting:
Pre-retrieval: Query Rewriting and HyDE
User queries are often ambiguous or poorly worded. Query rewriting uses the LLM to reformulate the query in forms more suitable for semantic search.
# Advanced RAG: Query Rewriting + HyDE (Hypothetical Document Embeddings)
from langchain_openai import ChatOpenAI, OpenAIEmbeddings
from langchain.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser
llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)
# 1. Multi-query: genera query alternative per copertura piu ampia
MULTI_QUERY_PROMPT = ChatPromptTemplate.from_messages([
("system", """Sei un esperto di information retrieval.
Genera 3 varianti della query fornita per recuperare documenti rilevanti
da diverse angolazioni. Restituisci solo le query, una per riga."""),
("human", "Query originale: {query}")
])
multi_query_chain = MULTI_QUERY_PROMPT | llm | StrOutputParser()
def generate_multiple_queries(query: str) -> list[str]:
result = multi_query_chain.invoke({"query": query})
queries = [q.strip() for q in result.strip().split('\n') if q.strip()]
return [query] + queries[:3] # query originale + 3 varianti
# 2. HyDE: genera un documento ipotetico che conterrebbe la risposta
HYDE_PROMPT = ChatPromptTemplate.from_messages([
("system", """Scrivi un breve paragrafo tecnico che risponderebbe
alla seguente domanda, come se fosse tratto da una documentazione ufficiale.
Usa terminologia tecnica precisa."""),
("human", "{query}")
])
hyde_chain = HYDE_PROMPT | llm | StrOutputParser()
def hyde_search(query: str, vectorstore, k: int = 5):
# Genera documento ipotetico
hypothetical_doc = hyde_chain.invoke({"query": query})
# Cerca usando il documento ipotetico come query (invece della query diretta)
results = vectorstore.similarity_search(hypothetical_doc, k=k)
return results
# 3. Multi-query retrieval con deduplicazione
from langchain.retrievers import MergerRetriever
from langchain_community.document_transformers import EmbeddingsRedundantFilter
def advanced_retrieve(query: str, vectorstore, k: int = 5) -> list:
queries = generate_multiple_queries(query)
# Raccogli risultati da tutte le query
all_docs = []
for q in queries:
docs = vectorstore.similarity_search(q, k=k)
all_docs.extend(docs)
# Deduplica per contenuto simile
seen_content = set()
unique_docs = []
for doc in all_docs:
content_hash = hash(doc.page_content[:200])
if content_hash not in seen_content:
seen_content.add(content_hash)
unique_docs.append(doc)
return unique_docs[:k * 2] # ritorna il doppio dei risultati per il reranker
Post-retrieval: Reranking with Cross-Encoder
Vector embeddings use a "bi-encoder" representation (separate query and document): and fast but less precise. Cross-encoder reranking (query + document together) improves precision by 15-25% at the cost of additional latency (typically 50-150ms).
# Post-retrieval: Reranking con Cohere Rerank o cross-encoder locale
import cohere
from sentence_transformers import CrossEncoder
# Opzione 1: Cohere Rerank API (managed, accurato)
co = cohere.Client("your-api-key")
def rerank_with_cohere(query: str, documents: list[str], top_n: int = 5) -> list[dict]:
response = co.rerank(
query=query,
documents=documents,
top_n=top_n,
model="rerank-v3.5"
)
return [
{"content": documents[r.index], "relevance_score": r.relevance_score}
for r in response.results
]
# Opzione 2: Cross-encoder locale (gratuito, ~100MB)
cross_encoder = CrossEncoder("cross-encoder/ms-marco-MiniLM-L-6-v2")
def rerank_local(query: str, documents: list[str], top_n: int = 5) -> list[dict]:
# Crea coppie (query, documento) per il cross-encoder
pairs = [[query, doc] for doc in documents]
scores = cross_encoder.predict(pairs)
# Ordina per score decrescente
ranked = sorted(zip(documents, scores), key=lambda x: x[1], reverse=True)
return [{"content": doc, "relevance_score": float(score)} for doc, score in ranked[:top_n]]
# Advanced RAG completo: multi-query + HyDE + reranking
def advanced_rag(query: str, vectorstore) -> dict:
# 1. Retrieval ampliato
candidates = advanced_retrieve(query, vectorstore, k=8)
candidate_texts = [doc.page_content for doc in candidates]
# 2. Reranking
reranked = rerank_local(query, candidate_texts, top_n=5)
# 3. Generation con contesto di qualita
context = "\n\n---\n\n".join([r["content"] for r in reranked])
response = llm.invoke(f"""Contesto:\n{context}\n\nDomanda: {query}\nRisposta:""")
return {"answer": response.content, "sources": reranked}
Modular RAG: Modular Architecture
The 2026 Modular RAG treats each stage of the pipeline as an interchangeable module. The patterns most important:
CRAG: Corrective RAG
CRAG adds a relevance classifier: if the retrieved documents have a low score, the system performs a backup web search instead of generating with irrelevant context.
# Modular RAG: CRAG (Corrective RAG) con LangGraph
from langgraph.graph import StateGraph, END
from typing import TypedDict, Annotated
from langchain_community.tools.tavily_search import TavilySearchResults
class RAGState(TypedDict):
query: str
documents: list
relevance_scores: list[float]
web_results: list
answer: str
retrieval_quality: str # "high" | "low" | "ambiguous"
def retrieve(state: RAGState) -> RAGState:
"""Retrieval dal vector store"""
docs = vectorstore.similarity_search_with_score(state["query"], k=5)
documents = [doc for doc, _ in docs]
scores = [float(score) for _, score in docs]
return {**state, "documents": documents, "relevance_scores": scores}
def assess_relevance(state: RAGState) -> RAGState:
"""Valuta se i documenti sono sufficientemente rilevanti"""
avg_score = sum(state["relevance_scores"]) / len(state["relevance_scores"])
if avg_score > 0.85:
quality = "high"
elif avg_score > 0.70:
quality = "ambiguous"
else:
quality = "low"
return {**state, "retrieval_quality": quality}
def web_search_fallback(state: RAGState) -> RAGState:
"""Fallback: web search quando il retrieval e scarso"""
search_tool = TavilySearchResults(max_results=3)
results = search_tool.invoke(state["query"])
return {**state, "web_results": results}
def generate_answer(state: RAGState) -> RAGState:
"""Genera risposta usando documenti disponibili"""
if state["retrieval_quality"] == "low" and state["web_results"]:
context = "\n".join([r["content"] for r in state["web_results"]])
source_type = "web search"
else:
context = "\n".join([doc.page_content for doc in state["documents"]])
source_type = "knowledge base"
response = llm.invoke(
f"Contesto ({source_type}):\n{context}\n\nDomanda: {state['query']}\nRisposta:"
)
return {**state, "answer": response.content}
# Routing basato sulla qualita del retrieval
def should_web_search(state: RAGState) -> str:
return "web_search" if state["retrieval_quality"] == "low" else "generate"
# Costruzione del grafo
graph = StateGraph(RAGState)
graph.add_node("retrieve", retrieve)
graph.add_node("assess_relevance", assess_relevance)
graph.add_node("web_search", web_search_fallback)
graph.add_node("generate", generate_answer)
graph.set_entry_point("retrieve")
graph.add_edge("retrieve", "assess_relevance")
graph.add_conditional_edges(
"assess_relevance",
should_web_search,
{"web_search": "web_search", "generate": "generate"}
)
graph.add_edge("web_search", "generate")
graph.add_edge("generate", END)
crag = graph.compile()
# Esecuzione
result = crag.invoke({"query": "Qual e la versione piu recente di Qiskit?"})
print(result["answer"])
Quality Comparison: Naive vs Advanced vs Modular
Benchmark su dataset di test enterprise (500 domande, base di conoscenza 50K docs)
Metrica | Naive RAG | Advanced RAG | Modular RAG (CRAG)
--------------------|-----------|--------------|--------------------
Faithfulness | 0.71 | 0.88 | 0.92
Answer Relevancy | 0.74 | 0.86 | 0.89
Context Recall | 0.65 | 0.81 | 0.84
Context Precision | 0.72 | 0.87 | 0.88
--------------------|-----------|--------------|--------------------
Latenza p50 | 850ms | 1.4s | 1.8s (con web fallback: 3.2s)
Costo per query | $0.003 | $0.007 | $0.009 (avg)
--------------------|-----------|--------------|--------------------
"Hallucination rate"| 18% | 6% | 4%
Domande senza risp. | 12% | 8% | 3% (web fallback)
When to Advance to the Next Level
- Naive -> Advanced: if faithfulness < 0.80 or users report responses irrelevant frequent; additional cost ~2x
- Advanced -> Modular: If your knowledge base only covers a subset of the topics requested, or if the queries range across heterogeneous topics; additional cost ~1.3x
- Keep Naive: if your knowledge base is well structured, the queries are homogeneous and faithfulness > 0.85 already with the basic pattern
Conclusions
The right RAG architecture depends on the complexity of your use case. Always start with Naive RAG, measure with RAGAS and advance only when the data warrants it. Add complexity without measurement leads to over-engineered systems that cost more without improvements measurable.
The next article delves into chunking strategies — the retrieval pipeline component which has the greatest impact on the quality of the Naive RAG and which is often overlooked.







