Ciao! Sono

Federico Calò

Sviluppatore Software | Divulgatore Tecnico

Creo applicazioni web moderne e strumenti digitali personalizzati per aiutare le attività a crescere attraverso l'innovazione tecnologica. La mia passione è unire informatica ed economia per generare valore reale.

Contattami

Chi Sono

La mia passione per l'informatica è nata tra i banchi dell'Istituto Tecnico Commerciale di Maglie, dove ho scoperto il potere della programmazione e il fascino di creare soluzioni digitali. Fin da subito, ho capito che l'informatica non era solo codice, ma uno strumento straordinario per trasformare idee in realtà.

Durante gli studi superiori in Sistemi Informativi Aziendali, ho iniziato a intrecciare informatica ed economia, comprendendo come la tecnologia possa essere il motore della crescita per qualsiasi attività. Questa visione mi ha accompagnato all'Università degli Studi di Bari, dove ho conseguito la Laurea in Informatica, approfondendo le mie competenze tecniche e la mia passione per lo sviluppo software.

Oggi metto questa esperienza al servizio di imprese, professionisti e startup, creando soluzioni digitali su misura che automatizzano processi, ottimizzano risorse e aprono nuove opportunità di business. Perché la vera innovazione inizia quando la tecnologia incontra le esigenze reali delle persone.

Le Mie Competenze

Analisi Dati & Modelli Previsionali

Trasformo i dati in insights strategici con analisi approfondite e modelli predittivi per decisioni informate

Automazione Processi

Creo strumenti personalizzati che automatizzano operazioni ripetitive e liberano tempo per attività a valore aggiunto

Sistemi Custom

Sviluppo sistemi software su misura, dalle integrazioni tra piattaforme alle dashboard personalizzate

const federico = {
  nome: "Federico Calò",
  ruolo: "Sviluppatore Software",
  città: "Bari, Italia",
  missione: "Aiutare attraverso l'informatica",
  passioni: [
    "Codice Pulito",
    "Innovazione",
    "Crescita Continua"
  ]
};

La Mia Missione

Credo fermamente che l'informatica sia lo strumento più potente per trasformare le idee in realtà e migliorare la vita delle persone.

🚀

Democratizzare la Tecnologia

La mia missione è rendere l'informatica accessibile a tutti: dalle piccole imprese locali alle startup innovative, fino ai professionisti che vogliono digitalizzare la propria attività. Ogni realtà merita di sfruttare le potenzialità del digitale.

💡

Unire Informatica ed Economia

Non è solo questione di scrivere codice: è capire come la tecnologia possa generare valore reale. Intrecciando competenze informatiche e visione economica, aiuto le attività a crescere, ottimizzare processi e raggiungere nuovi traguardi di efficienza e redditività.

🎯

Creare Soluzioni su Misura

Ogni attività è unica, e così devono esserlo le soluzioni. Sviluppo strumenti personalizzati che rispondono alle esigenze specifiche di ciascun cliente, automatizzando processi ripetitivi e liberando tempo per ciò che conta davvero: far crescere il business.

Trasforma la Tua Attività con la Tecnologia

Dicembre 2024

Visualizza

Master SQL

RoadMap.sh

Novembre 2024

Visualizza

Oracle Certified Foundations Associate

Oracle

Ottobre 2024

Visualizza

People Leadership Credential

Connect

Settembre 2024

💻 Linguaggi & Tecnologie

☕Java

🐍Python

📜JavaScript

🅰️Angular

⚛️React

🔷TypeScript

🗄️SQL

🐘PHP

🎨CSS/SCSS

🔧Node.js

🐳Docker

🌿Git

💼

12/2024 - Presente

Custom Software Engineering Analyst

Accenture

Bari, Puglia, Italia · Ibrida Analisi e sviluppo di sistemi informatici attraverso l'utilizzo di Java e Quarkus in Health and Public Sector. Formazione continua su tecnologie moderne per la creazione di soluzioni software personalizzate ed efficienti e sugli agenti.

💼

06/2022 - 12/2024

Analista software e Back End Developer Associate Consultant

Links Management and Technology SpA

Esperienza nell'analisi di sistemi software as-is e flussi ETL utilizzando PowerCenter. Formazione completata su Spring Boot per lo sviluppo di applicazioni backend moderne e scalabili. Sviluppatore Backend specializzato in Spring Boot, con esperienza in progettazione di database, analisi, sviluppo e testing dei task assegnati.

💼

02/2021 - 10/2021

Programmatore software

Adesso.it (prima era WebScience srl)

Esperienza nell'analisi AS-IS e TO-BE, evoluzioni SEO ed evoluzioni website per migliorare le performance e l'engagement degli utenti.

🎓

2018 - 2025

Laurea in Informatica

Università degli Studi di Bari Aldo Moro

Bachelor's degree in Computer Science, focusing on software engineering, algorithms, and modern development practices.

📚

2013 - 2018

Diploma - Sistemi Informativi Aziendali

Istituto Tecnico Commerciale di Maglie

Technical diploma specializing in Business Information Systems, combining IT knowledge with business management.

Contattami

Hai un progetto in mente? Parliamone! Compila il form qui sotto e ti risponderò al più presto.

* Campi obbligatori. I tuoi dati saranno utilizzati solo per rispondere alla tua richiesta.

RAG Explained: Retrieval-Augmented Generation Fundamentals

Large Language Models (LLMs) have transformed how we interact with information. GPT-4, Claude, Gemini, and Llama can write code, summarize documents, answer complex questions, and reason about abstract problems. Yet, they suffer from a fundamental flaw that severely limits their use in professional settings: hallucinations.

When an LLM does not know the answer, it does not say "I don't know." Instead, it generates plausible-sounding text that can be entirely fabricated, delivered with the same confidence as verified facts. In enterprise, legal, or medical contexts, this behavior is unacceptable. The most effective and mature solution to this problem is called Retrieval-Augmented Generation (RAG).

In this first article of the AI Engineering and Advanced RAG series, we will start from the ground up: what RAG is, how it works, why it solves the hallucination problem, and how to build a working RAG system from scratch. By the end, you will have a solid understanding of the entire architecture and be ready to dive deeper into each component in the following articles.

Series Overview

#	Article	Focus
1	You are here - RAG Explained	Fundamentals and full architecture
2	Embeddings and Semantic Search	How text becomes vectors
3	Vector Databases Deep Dive	Storage, indexing, similarity search
4	RAG with LangChain and Python	End-to-end practical implementation
5	Hybrid Retrieval and Reranking	Keyword + semantic search combined
6	Context Window and Prompt Engineering	Optimizing context for the LLM
7	RAG in Production	Scaling, monitoring, evaluation
8	Knowledge Graphs and RAG	Structured knowledge + retrieval
9	Multi-Agent Systems	Collaborative AI agents
10	The Future of RAG	Trends, research, and next steps

What You Will Learn

What RAG is and the core problem it solves for LLMs
The complete architecture: from document preparation to answer generation
How embeddings, vector stores, and similarity search work
Chunking strategies and their trade-offs
How to build a minimal working RAG system with Python
The differences between Naive RAG and Advanced RAG
When to use RAG vs fine-tuning vs prompt engineering

1. The Hallucination Problem in LLMs

To understand why RAG is necessary, we must first understand the problem it solves. Large Language Models are fundamentally probabilistic language models: given an input text, they predict the most likely sequence of tokens as a continuation. This architecture, built on the Transformer attention mechanism, produces remarkable results but has an intrinsic limitation: the model does not "know" anything in the human sense. It generates statistically plausible text based on patterns learned during training.

      The Four Structural Problems of LLMs
      
            Problem
            Description
            Practical Impact
          
            Knowledge Cutoff
            Knowledge is frozen at the training date
            No information about recent events or proprietary data
          
            Hallucinations
            Generates plausible but entirely fabricated answers
            False information presented with high confidence
          
            No Citations
            Cannot point to the source of its claims
            Impossible to verify the correctness of answers
          
            No Private Data
            Does not know your organization's internal documentation
            Useless for domain-specific use cases without context

Hallucinations are not an occasional bug: they are a direct consequence of the architecture. When the model lacks sufficient information to answer, it does not return an error. Instead, it generates the most probable text continuation, which can be entirely fabricated. The model was trained to produce fluent and coherent text, not to be factually accurate.

Real-World Hallucination Examples

To grasp the severity of the problem, consider real-world scenarios where hallucinations have significant consequences:

Legal domain: An LLM might cite non-existent court rulings or fabricate statute numbers with convincing formatting
Medical domain: It could suggest incorrect drug dosages or document undocumented drug interactions
Technical documentation: It might describe APIs with parameters that do not exist or features never implemented
Customer support: It could invent company policies, return procedures, or warranties that do not exist

The data confirms the scope of the problem: hallucination rates vary significantly across domains, with specialized fields like medicine and law still exhibiting rates of 10-20% even with the most advanced models. RAG represents the most effective technique for mitigating this issue, with documented reductions of up to 71%.

2. What is RAG: Definition and Origins

Retrieval-Augmented Generation (RAG) is an architectural paradigm that combines an information retrieval system with a generative model (LLM) to produce answers grounded in real data. The concept was formalized in 2020 by Patrick Lewis and colleagues at Meta AI in the seminal paper "Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks".

The core intuition is simple: instead of asking the model to "remember" all information from its training, we provide it with the relevant documents at generation time. This process is called grounding: anchoring the generation to concrete, verifiable sources.

RAG vs Traditional LLM - Architectural Comparison


TRADITIONAL LLM:
  User question ────────────────────> [LLM] ──> Answer
                                        |
                             Based ONLY on training data
                             Hallucination risk: HIGH

RAG (Retrieval-Augmented Generation):
  User question ──> [RETRIEVAL] ──> Relevant documents
                        |                    |
                        |                    v
                        └────────> [LLM + Context] ──> Answer with citations
                                        |
                             Based on REAL DATA retrieved
                             Hallucination risk: LOW

Analogy: The Open-Book Exam

Think of a university exam. A traditional LLM is like a student answering from memory: they know a lot, but might confuse details or make things up. RAG is like an open-book exam: the student can consult their notes and textbooks before answering. The answer will be more accurate and verifiable because it is grounded in concrete sources.

The concept of grounding is crucial: every claim in the generated answer can be traced back to a specific document in the knowledge base. This means:

Answers are verifiable: you can check the source document
Answers are updatable: updating the documents automatically updates the answers
Answers are controllable: you decide which documents the system can consult
Answers are traceable: you can include citations with source references

3. How RAG Works: The Complete Pipeline

A RAG system consists of two macro-phases: the indexing pipeline, which prepares documents once (or periodically), and the query pipeline, which handles user questions in real time.

3.1 Indexing Pipeline (Offline Phase)

The indexing phase transforms raw documents into a structure optimized for semantic search. It consists of four sequential steps:

Indexing Pipeline - Complete Flow


[Source Documents]
    |  PDF, HTML, Markdown, CSV, databases, APIs, emails...
    v
[1. DOCUMENT LOADING]
    |  Raw text extraction from source formats
    |  Metadata preservation (author, date, title)
    v
[2. CHUNKING (Text Splitting)]
    |  Split into manageable fragments
    |  Strategies: fixed-size, semantic, recursive
    |  Overlap between chunks to preserve context
    v
[3. EMBEDDING]
    |  Transform each chunk into a numeric vector
    |  Models: OpenAI, Sentence Transformers, Cohere
    |  Typical dimensions: 384, 768, 1536, 3072
    v
[4. VECTOR STORE]
    |  Save vectors in a specialized database
    |  Indexing for fast search (HNSW, IVF)
    |  ChromaDB, Pinecone, Weaviate, Milvus, Qdrant
    v
[Knowledge Base Ready for Queries]

3.2 Query Pipeline (Online Phase)

When a user asks a question, the query pipeline kicks in to find the most relevant documents and generate a grounded answer:

Query Pipeline - Real-Time Flow


[User Question]
    |  "How do I configure OAuth authentication in our app?"
    v
[1. QUERY EMBEDDING]
    |  The question is transformed into a vector
    |  Same model used during indexing!
    v
[2. SIMILARITY SEARCH]
    |  Find the most similar chunks in the vector store
    |  Metrics: cosine similarity, L2, dot product
    |  Returns the top-k results (typically k=3..10)
    v
[3. CONTEXT ASSEMBLY]
    |  Retrieved chunks are assembled into a context
    |  Ordered by relevance, deduplicated
    v
[4. PROMPT CONSTRUCTION]
    |  Build the prompt with context + question
    |  Template: "Based on the following documents, answer..."
    v
[5. LLM GENERATION]
    |  The model generates the answer based on context
    |  Can include citations to source documents
    v
[Grounded Answer + Citations]

Fundamental Rule: Embedding Consistency

It is mandatory to use the same embedding model for both the indexing phase and the query phase. Different models produce incompatible vector spaces: vectors generated by one model cannot be compared with those from another. Switching embedding models requires complete re-indexing of all documents.

4. Document Processing: Chunking and Preparation

Chunking is one of the most critical phases in the entire RAG pipeline. The quality of results depends largely on how documents are split. Chunks that are too large dilute the semantic signal and waste space in the LLM's context window. Chunks that are too small lose the context needed to be meaningful.

4.1 Fixed-Size Chunking

The simplest strategy: split text into blocks of fixed size (e.g., 500 tokens) with optional overlap between consecutive chunks.

      Fixed-Size Chunking Parameters
      
            Parameter
            Description
            Typical Value
          
            chunk_size
            Maximum size of each chunk in tokens or characters
            300-500 tokens
          
            chunk_overlap
            Overlap between consecutive chunks
            10-20% of chunk_size
          
            separator
            Character/string used for splitting
            "\n\n", "\n", " "

Fixed-Size Chunking Example


Original document (1500 tokens):
"Lorem ipsum dolor sit amet... [1500 tokens of text]"

With chunk_size=500 and overlap=50:

Chunk 1: tokens 1-500     ────────────┐
Chunk 2: tokens 451-950   ──┐ overlap │
Chunk 3: tokens 901-1400  ──┘         │
Chunk 4: tokens 1351-1500             │
                           ────────────┘
The overlap ensures that context at chunk boundaries is not lost.

4.2 Recursive Character Splitting

A more sophisticated strategy that attempts to split while respecting document structure. It uses a hierarchy of separators: first tries paragraphs ("\n\n"), then lines ("\n"), then sentences (". "), then words (" "). This preserves semantic context better than fixed-size splitting.

4.3 Semantic Chunking

The most advanced strategy: uses embeddings themselves to determine where to split. It calculates similarity between consecutive sentences and creates a new chunk when similarity drops below a threshold, indicating a topic change. This produces variable-sized but semantically coherent chunks.

      Chunking Strategy Comparison
      
            Strategy
            Quality
            Complexity
            When to Use
          
            Fixed-Size
            Basic
            Minimal
            Prototyping, homogeneous documents
          
            Recursive
            Good
            Low
            General use, structured documents
          
            Semantic
            Excellent
            High
            High quality required, heterogeneous docs

The Importance of Metadata

Every chunk should carry metadata: document title, author, date, section, page number. These metadata are essential for filtering during search and for generating accurate citations in the answer. A chunk without metadata is like a paragraph without context.

5. Embeddings: Transforming Text into Vectors

Embeddings are the mathematical heart of RAG. An embedding is a numeric representation (a vector of decimal numbers) that captures the semantic meaning of a text. Two sentences with similar meaning will have vectors that are "close" in multidimensional space, regardless of the words used.

How Embeddings Work - Concept


INPUT: Text string
OUTPUT: Vector of decimal numbers (e.g., 1536 dimensions)

Example:
"The cat sleeps on the couch"   --> [0.23, -0.45, 0.67, 0.12, -0.89, ...]
"The feline rests on the sofa"  --> [0.22, -0.44, 0.68, 0.11, -0.88, ...]
                                      ^^ Very SIMILAR vectors (same meaning)

"The price of gold is rising"   --> [-0.56, 0.78, -0.12, 0.91, 0.34, ...]
                                      ^^ Very DIFFERENT vector (different meaning)

The embedding model is a neural network trained on massive amounts of text to learn semantic relationships between words, phrases, and concepts. It does not produce a merely syntactic representation (like bag-of-words or TF-IDF), but captures the deeper meaning of the text.

      Popular Embedding Models (2025-2026)
      
        
            Model
            Dimensions
            Provider
            Approximate Cost
          

        
            text-embedding-3-small
            1536
            OpenAI
            ~$0.02 / 1M tokens
          

            text-embedding-3-large
            3072
            OpenAI
            ~$0.13 / 1M tokens
          

            voyage-3-large
            1024
            Voyage AI
            ~$0.06 / 1M tokens
          

            all-MiniLM-L6-v2
            384
            HuggingFace
            Free (self-hosted)
          

            nomic-embed-text
            768
            Ollama (local)
            Free (local)
          

            embed-v4
            1024
            Cohere
            ~$0.10 / 1M tokens
          

      
    

The choice of embedding model depends on the use case: larger models (3072 dimensions) offer a richer representation but cost more in terms of storage and computation. For many use cases, models with 768-1536 dimensions offer an excellent balance between quality and cost.

6. Vector Store: The Database for Embeddings

A vector store (or vector database) is a specialized database for storing, indexing, and searching high-dimensional vectors. Unlike traditional databases that look for exact matches (SQL WHERE), a vector store finds the vectors most similar to the query vector.

6.1 Similarity Metrics

Search in a vector store relies on distance/similarity metrics between vectors. The three most common metrics are:

      Similarity Metrics Compared
      
            Metric
            Range
            Description
            When to Use
          
            Cosine Similarity
            [-1, 1]
            Measures the angle between two vectors, ignores magnitude
            Default for most cases (recommended)
          
            Euclidean Distance (L2)
            [0, +inf)
            Geometric distance in space, sensitive to magnitude
            When vector magnitude is meaningful
          
            Dot Product
            (-inf, +inf)
            Scalar product, combines direction and magnitude
            Already normalized vectors, Maximum Inner Product Search

6.2 Vector Database Overview

The vector database market has exploded with RAG adoption. Here is an overview of the main tools available:

      Vector Databases Compared
      
        
            Database
            Type
            Language
            Best For
          

        
            ChromaDB
            Embedded / Server
            Python
            Prototyping, local development, small datasets
          

            Pinecone
            Cloud managed
            Multi-lang
            Production, auto-scaling, zero-ops
          

            Weaviate
            Self-hosted / Cloud
            Go
            Hybrid search, multi-tenancy, GraphQL
          

            Milvus
            Self-hosted / Cloud
            Go / C++
            Large volumes, high performance, enterprise
          

            Qdrant
            Self-hosted / Cloud
            Rust
            Performance, advanced filtering, REST API
          

            pgvector
            PostgreSQL extension
            C
            Existing PostgreSQL stack, relational + vector data
          

            FAISS
            In-memory library
            C++ / Python
            Research, benchmarking, maximum optimization
          

      
    

To get started, ChromaDB is the simplest choice: it installs via pip, works in-memory or with disk persistence, and integrates natively with LangChain. For production, Pinecone (managed) and Qdrant (self-hosted) are among the most popular options.

7. Retrieval: Finding the Relevant Documents

The retrieval phase is where the user's question gets transformed into a vector and compared against all vectors in the store to find the most relevant chunks. This process happens in milliseconds even with millions of indexed documents, thanks to approximate nearest neighbor algorithms like HNSW (Hierarchical Navigable Small World).

Retrieval Flow


Query: "How do I configure OAuth in our app?"
  |
  v
[1] Query Embedding ──> [0.34, -0.21, 0.56, ...]
  |
  v
[2] Similarity Search in Vector Store
  |   Compares query vector against all stored vectors
  |   Uses cosine similarity as metric
  |
  v
[3] Top-K Results (e.g., k=5)
  |
  |   Score: 0.92 - "OAuth 2.0 configuration for web apps..."
  |   Score: 0.87 - "Guide to OAuth authentication flows..."
  |   Score: 0.83 - "Setting up redirect URIs for OAuth..."
  |   Score: 0.76 - "Comparing OAuth and SAML for SSO..."
  |   Score: 0.71 - "API security with JWT tokens..."
  |
  v
[4] Filtering and Reranking (optional)
  |   Filter by metadata (date, category, source)
  |   Reranking with cross-encoder model
  |
  v
[Relevant Chunks Ready for the Prompt]

7.1 The Top-K Parameter

The top-k parameter determines how many chunks are retrieved. The choice is a trade-off:

K too low (1-2): Risk of missing relevant information
K too high (20+): Too much noise in context, wasted tokens, risk of confusing the model
Optimal K (3-7): Good balance between coverage and precision

7.2 Relevance Scoring

Each result has a relevance score. Chunks with low scores should be filtered out because they add noise without value. A common threshold is 0.7 for cosine similarity: anything below gets discarded. In practice, the optimal threshold depends on the domain and should be calibrated experimentally.

8. Generation: Building the Answer with the LLM

The final phase of the RAG pipeline: retrieved chunks are assembled into a structured context and inserted into the LLM prompt alongside the user's question. The model generates the answer based exclusively (or primarily) on the provided context.

8.1 The Prompt Template

A good RAG prompt template must instruct the model to:

Use only information from the provided context
Acknowledge when it cannot find the answer in the documents
Cite sources when possible
Not fabricate information absent from the context

RAG Prompt Template Example

You are a technical assistant. Answer questions ONLY based on
the provided context. If the answer is not in the context,
reply "I could not find sufficient information in the available
documents."

CONTEXT:
---
{context}
---

QUESTION: {question}

INSTRUCTIONS:
1. Answer clearly and concisely
2. Cite the source document in square brackets [Source: doc_name]
3. If the context does not contain the answer, say so explicitly
4. Do not fabricate information not present in the context

ANSWER:

8.2 Citation Tracking

One of RAG's most valuable features is the ability to track citations. Each chunk in the context can be labeled with an identifier (e.g., [DOC-1], [DOC-2]) and the model is instructed to reference these identifiers in its answer. This allows users to verify every claim by consulting the original document.

Context Window and Limits

The amount of context you can insert is limited by the LLM's context window. If retrieved chunks exceed the limit, you will need to select only the most relevant ones or summarize them. Modern models like GPT-4o (128K tokens) and Claude 3.5 (200K tokens) have very large context windows, but even so, inserting too much irrelevant context degrades answer quality (the so-called "lost in the middle" problem).

9. Complete RAG Architecture: The Big Picture

Now that we have examined each component individually, let us assemble the complete end-to-end architecture of a RAG system:

End-to-End RAG Architecture


                    OFFLINE PHASE (Indexing)
    ┌──────────────────────────────────────────────────────┐
    │                                                      │
    │  [Documents]  ──>  [Loader]  ──>  [Chunker]         │
    │   PDF, HTML        Text           Split into         │
    │   MD, CSV          extraction     fragments          │
    │                                       │              │
    │                                       v              │
    │                    [Embedding Model]                  │
    │                     Text ──> Vector                   │
    │                          │                           │
    │                          v                           │
    │                    [Vector Store]                     │
    │                     ChromaDB, Pinecone                │
    │                     Weaviate, Qdrant                  │
    │                                                      │
    └──────────────────────────────────────────────────────┘

                    ONLINE PHASE (Query)
    ┌──────────────────────────────────────────────────────┐
    │                                                      │
    │  [User Question]                                     │
    │       │                                              │
    │       v                                              │
    │  [Query Embedding] ──> [Similarity Search]           │
    │       Same model           Top-K chunks              │
    │       as indexing               │                    │
    │                                 v                    │
    │                    [Context Assembly]                 │
    │                     Chunks + Metadata                 │
    │                          │                           │
    │                          v                           │
    │  [Prompt Template + Context + Question]              │
    │                          │                           │
    │                          v                           │
    │                       [LLM]                          │
    │                    GPT-4, Claude                      │
    │                    Llama, Gemini                      │
    │                          │                           │
    │                          v                           │
    │              [Grounded Answer + Citations]            │
    │                                                      │
    └──────────────────────────────────────────────────────┘

Components and Responsibilities

Component	Responsibility	Typical Tools
Document Loader	Load documents from various sources	LangChain loaders, Unstructured, LlamaIndex
Text Splitter	Split documents into optimal chunks	RecursiveCharacterTextSplitter, SemanticChunker
Embedding Model	Transform text into semantic vectors	OpenAI Embeddings, Sentence Transformers, Cohere
Vector Store	Store and index vectors	ChromaDB, Pinecone, Qdrant, pgvector
Retriever	Find the most relevant chunks	Similarity search, MMR, hybrid retrieval
LLM	Generate the final answer from context	GPT-4o, Claude 3.5, Llama 3, Gemini Pro

10. Practical Example: Minimal RAG with Python

Let us build a working RAG system with the minimum code necessary. We will use LangChain as the orchestration framework, OpenAI for embeddings and generation, and ChromaDB as the local vector store.

10.1 Project Setup

Installing Dependencies

# Create a virtual environment
python -m venv rag-env
source rag-env/bin/activate  # Linux/Mac

# Install dependencies
pip install langchain langchain-openai langchain-community
pip install chromadb
pip install pypdf  # For loading PDFs

10.2 Indexing Pipeline

indexing.py - Document Preparation

import os
from langchain_community.document_loaders import (
    PyPDFLoader,
    TextLoader,
    DirectoryLoader
)
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_openai import OpenAIEmbeddings
from langchain_community.vectorstores import Chroma

# Configure API key
os.environ["OPENAI_API_KEY"] = "sk-..."

# 1. DOCUMENT LOADING
# Load all PDFs from a directory
loader = DirectoryLoader(
    "./documents/",
    glob="**/*.pdf",
    loader_cls=PyPDFLoader
)
documents = loader.load()
print(f"Loaded {len(documents)} documents")

# 2. CHUNKING
# Recursive splitting with overlap
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=500,       # 500 characters per chunk
    chunk_overlap=50,     # 50 characters overlap
    length_function=len,
    separators=["\n\n", "\n", ". ", " ", ""]
)
chunks = text_splitter.split_documents(documents)
print(f"Created {len(chunks)} chunks")

# 3. EMBEDDING + 4. VECTOR STORE
# Create embeddings and save to ChromaDB
embedding_model = OpenAIEmbeddings(
    model="text-embedding-3-small"
)

vectorstore = Chroma.from_documents(
    documents=chunks,
    embedding=embedding_model,
    persist_directory="./chroma_db",
    collection_name="company_documents"
)
print("Vector store created and persisted to disk")

10.3 Query Pipeline

query.py - Querying the RAG System

import os
from langchain_openai import OpenAIEmbeddings, ChatOpenAI
from langchain_community.vectorstores import Chroma
from langchain.chains import RetrievalQA
from langchain.prompts import PromptTemplate

os.environ["OPENAI_API_KEY"] = "sk-..."

# Load the existing vector store
embedding_model = OpenAIEmbeddings(
    model="text-embedding-3-small"
)
vectorstore = Chroma(
    persist_directory="./chroma_db",
    embedding_function=embedding_model,
    collection_name="company_documents"
)

# Configure the retriever
retriever = vectorstore.as_retriever(
    search_type="similarity",
    search_kwargs={"k": 5}  # Retrieve top 5 similar chunks
)

# Define the prompt template
prompt_template = PromptTemplate(
    template="""You are an expert technical assistant.
Answer the question based ONLY on the provided context.
If you cannot find the answer in the context, say so explicitly.

CONTEXT:
{context}

QUESTION: {question}

ANSWER (with citations to source documents):""",
    input_variables=["context", "question"]
)

# Create the RAG chain
llm = ChatOpenAI(
    model="gpt-4o",
    temperature=0  # Deterministic for factual answers
)

rag_chain = RetrievalQA.from_chain_type(
    llm=llm,
    chain_type="stuff",  # Insert all chunks into the prompt
    retriever=retriever,
    return_source_documents=True,
    chain_type_kwargs={
        "prompt": prompt_template
    }
)

# Execute a query
result = rag_chain.invoke(
    {"query": "How do I configure OAuth authentication?"}
)

# Print the answer
print("ANSWER:")
print(result["result"])
print("\nSOURCES:")
for doc in result["source_documents"]:
    print(f"  - {doc.metadata.get('source', 'N/A')}"
          f" (page {doc.metadata.get('page', 'N/A')})")

Expected Output


ANSWER:
To configure OAuth authentication in our application,
follow these steps:
1. Register the app with the OAuth provider (Google, GitHub, etc.)
2. Configure redirect URIs in config.yaml [Source: oauth-guide.pdf]
3. Implement the Authorization Code flow [Source: auth-architecture.pdf]
...

SOURCES:
  - documents/oauth-guide.pdf (page 12)
  - documents/auth-architecture.pdf (page 5)
  - documents/security-faq.pdf (page 3)

10.4 Project Structure

RAG Project Folder Structure


rag-project/
  ├── documents/           # Your source documents
  │   ├── oauth-guide.pdf
  │   ├── architecture.pdf
  │   └── faq.pdf
  ├── chroma_db/           # Persistent vector store (generated)
  ├── indexing.py           # Indexing script
  ├── query.py              # Query script
  ├── requirements.txt      # Dependencies
  └── .env                  # API keys (never commit!)

11. Naive RAG vs Advanced RAG

The system we built so far is what the community calls Naive RAG: a linear, straightforward implementation that works surprisingly well for many use cases. However, it has limitations that Advanced RAG techniques aim to overcome.

11.1 Limitations of Naive RAG

Query-Document Mismatch: The user's question might use different terms than the documents. "How do I reset my password?" vs document "Credential recovery procedure"
Chunk Boundary Issues: Relevant information might be split across two consecutive chunks
Lost in the Middle: LLMs tend to give more weight to chunks at the beginning and end of the context, neglecting those in the middle
Single-Hop Limitation: Cannot answer questions requiring synthesis from multiple documents

11.2 Advanced RAG Techniques

      Evolution from Naive to Advanced RAG
      
            Technique
            Problem Solved
            How It Works
          
            Query Rewriting
            Ambiguous or poorly formed queries
            The LLM rewrites the query in a form better suited for retrieval
          
            HyDE
            Query-Document mismatch
            Generates a hypothetical document from the query, then searches for similar real docs
          
            Reranking
            Sub-optimal result ordering
            A cross-encoder model re-orders results by relevance
          
            Multi-Query
            Single search perspective
            Generates query variants to broaden coverage
          
            Self-RAG
            When retrieval is not necessary
            The model autonomously decides if and when to retrieve
          
            Multi-Hop RAG
            Complex multi-step questions
            Chain of iterative retrievals to build compound reasoning
          
            Hybrid Search
            Limitations of semantic-only search
            Combines semantic (vector) with keyword (BM25) search
          
            Graph RAG
            Complex entity relationships
            Uses knowledge graphs to navigate relationships and structured contexts

Naive RAG vs Advanced RAG - Flow Comparison


NAIVE RAG:
  Query ──> Embedding ──> Search ──> Top-K ──> LLM ──> Answer

ADVANCED RAG:
  Query ──> [Query Analysis]
              │
              ├──> Query Rewriting ──> Embedding ──> Search ─┐
              ├──> HyDE Generation ──> Embedding ──> Search ─┤
              └──> Multi-Query ──────> Embedding ──> Search ─┘
                                                       │
                                        [Merge + Deduplicate]
                                               │
                                        [Reranker Model]
                                               │
                                        [Context Compression]
                                               │
                                            [LLM + Citations]
                                               │
                                        [Verified Answer]

We will dive deep into each of these techniques in the following articles of this series, particularly in the article on Hybrid Retrieval and Reranking.

12. When to Use RAG: Decision Framework

RAG is not the answer to every problem. There are three primary approaches to customizing LLM behavior, each with distinct advantages and limitations. The choice depends on your specific use case.

      RAG vs Fine-Tuning vs Prompt Engineering
      
        
            Criterion
            Prompt Engineering
            RAG
            Fine-Tuning
          

        
            Upfront cost
            Minimal
            Medium
            High
          

            Data required
            None
            Knowledge base
            Training dataset
          

            Update speed
            Immediate (change prompt)
            Fast (update documents)
            Slow (re-training)
          

            Latency
            Low
            Medium (retrieval + generation)
            Low
          

            Factual accuracy
            Depends on model
            High (grounded in documents)
            Medium (depends on training)
          

            Citability
            None
            High (tracks sources)
            None
          

            Complexity
            Low
            Medium
            High
          

      
    

12.1 Practical Decision Tree

Decision Tree: How to Choose Your Approach


Need to customize an LLM?
│
├── Does data change frequently? (documents, policies, catalogs)
│   └── YES ──> RAG
│         (Data updates without re-training)
│
├── Need to cite sources?
│   └── YES ──> RAG
│         (Answer traceability)
│
├── Is the domain stable and well-defined?
│   ├── Do you have a large training dataset?
│   │   └── YES ──> Fine-Tuning
│   │         (Deep behavior customization)
│   └── NO ──> RAG or Prompt Engineering
│
├── Only need to change tone/format/style?
│   └── YES ──> Prompt Engineering
│         (No additional infrastructure)
│
└── Need complex reasoning on proprietary data?
    └── YES ──> RAG + Fine-Tuning (Hybrid Approach)
          (Best of both worlds)

The Golden Rule

The recommended approach in the AI engineering community follows a ladder of increasing complexity: start with prompt engineering, then move to RAG if you need external data or citations, and consider fine-tuning only when the first two approaches prove insufficient. This progression minimizes cost and complexity while maximizing ROI.

12.2 Ideal Use Cases for RAG

Enterprise chatbots: Answer questions based on internal documentation
Semantic search: Find relevant documents by meaning, not just keywords
Knowledge base Q&A: Dynamic FAQs that update with the documents
Legal assistants: Answers grounded in regulations and case law
Technical support: Resolutions based on past tickets and manuals
Document analysis: Extract insights from reports, contracts, research papers

12.3 When NOT to Use RAG

Creative tasks: Creative writing, brainstorming, idea generation (the model needs freedom)
General conversation: Social chatbots where specific data is not needed
Few-data tasks: If you only have a handful of documents, prompt engineering might suffice
Ultra-low latency: If retrieval latency is unacceptable (under 50ms)

13. The RAG Market: Numbers and Trends

RAG is no longer an academic concept: it has become a production architecture adopted at scale. The market numbers confirm its strategic importance.

      RAG by the Numbers (2024-2030)
      
            Metric
            Data
          
            Global RAG market (2024)
            ~$1.2 billion USD
          
            Market projection (2030)
            ~$11 billion USD
          
            Annual growth rate (CAGR)
            49.1% (2025-2030)
          
            Enterprise adoption
            30-60% of LLM use cases leverage RAG
          
            Hallucination reduction
            Up to 71% with well-implemented RAG
          
            Dominant frameworks
            80.5% use FAISS or Elasticsearch

The sectors driving adoption are legal, medical, customer support, and financial services -- all domains where factual accuracy and source citability are non-negotiable requirements. The 2025-2026 trend is the shift from experimentation to large-scale production, with growing emphasis on compliance, monitoring, and data quality.

14. Conclusions and Next Steps

In this article we built a comprehensive understanding of RAG: from the problem it solves (LLM hallucinations) to the end-to-end architecture, passing through every component of the pipeline. We saw a working practical implementation and compared RAG with its alternatives (fine-tuning, prompt engineering).

Key Takeaways

RAG combines retrieval and generation to produce answers grounded in real data
The indexing pipeline transforms documents into vectors: Loading, Chunking, Embedding, Storage
The query pipeline finds and uses relevant documents: Embedding, Search, Assembly, Generation
Chunking is one of the most critical decisions: it directly impacts retrieval quality
Embeddings capture the semantic meaning of text in numeric vectors
Vector stores enable similarity search across millions of documents
Advanced RAG introduces techniques like reranking, HyDE, and multi-hop to overcome Naive RAG limitations
RAG is ideal when you need updated, verifiable, and citable answers

Next Article: Embeddings and Semantic Search

In the next article of this series, we will dive deep into the most fascinating component of RAG: embeddings. We will explore how they work internally, how to choose the right model, how to evaluate their quality, and how to optimize semantic search. We will also cover evaluation metrics and benchmarking techniques to measure the performance of your retrieval system.

      AI Engineering and Advanced RAG Series
      
        
            Article
            Topic
          

        01 - You are hereRAG: Retrieval-Augmented Generation Explained
02 - NextEmbeddings and Semantic Search Deep Dive
03Vector Databases: Architecture and Best Practices
04Building a RAG System with LangChain and Python
05Hybrid Retrieval: Keyword + Semantic + Reranking
06Context Window and Prompt Engineering for RAG
07RAG in Production: Monitoring, Evaluation, Scaling
08Knowledge Graphs and Structured Retrieval
09Multi-Agent Systems and Orchestrated RAG
10The Future of RAG: Trends and Research

      
    

Problem	Description	Practical Impact
Knowledge Cutoff	Knowledge is frozen at the training date	No information about recent events or proprietary data
Hallucinations	Generates plausible but entirely fabricated answers	False information presented with high confidence
No Citations	Cannot point to the source of its claims	Impossible to verify the correctness of answers
No Private Data	Does not know your organization's internal documentation	Useless for domain-specific use cases without context

Parameter	Description	Typical Value
chunk_size	Maximum size of each chunk in tokens or characters	300-500 tokens
chunk_overlap	Overlap between consecutive chunks	10-20% of chunk_size
separator	Character/string used for splitting	"\n\n", "\n", " "

Strategy	Quality	Complexity	When to Use
Fixed-Size	Basic	Minimal	Prototyping, homogeneous documents
Recursive	Good	Low	General use, structured documents
Semantic	Excellent	High	High quality required, heterogeneous docs

Model	Dimensions	Provider	Approximate Cost
`text-embedding-3-small`	1536	OpenAI	~$0.02 / 1M tokens
`text-embedding-3-large`	3072	OpenAI	~$0.13 / 1M tokens
`voyage-3-large`	1024	Voyage AI	~$0.06 / 1M tokens
`all-MiniLM-L6-v2`	384	HuggingFace	Free (self-hosted)
`nomic-embed-text`	768	Ollama (local)	Free (local)
`embed-v4`	1024	Cohere	~$0.10 / 1M tokens

Metric	Range	Description	When to Use
Cosine Similarity	[-1, 1]	Measures the angle between two vectors, ignores magnitude	Default for most cases (recommended)
Euclidean Distance (L2)	[0, +inf)	Geometric distance in space, sensitive to magnitude	When vector magnitude is meaningful
Dot Product	(-inf, +inf)	Scalar product, combines direction and magnitude	Already normalized vectors, Maximum Inner Product Search

Database	Type	Language	Best For
ChromaDB	Embedded / Server	Python	Prototyping, local development, small datasets
Pinecone	Cloud managed	Multi-lang	Production, auto-scaling, zero-ops
Weaviate	Self-hosted / Cloud	Go	Hybrid search, multi-tenancy, GraphQL
Milvus	Self-hosted / Cloud	Go / C++	Large volumes, high performance, enterprise
Qdrant	Self-hosted / Cloud	Rust	Performance, advanced filtering, REST API
pgvector	PostgreSQL extension	C	Existing PostgreSQL stack, relational + vector data
FAISS	In-memory library	C++ / Python	Research, benchmarking, maximum optimization

Technique	Problem Solved	How It Works
Query Rewriting	Ambiguous or poorly formed queries	The LLM rewrites the query in a form better suited for retrieval
HyDE	Query-Document mismatch	Generates a hypothetical document from the query, then searches for similar real docs
Reranking	Sub-optimal result ordering	A cross-encoder model re-orders results by relevance
Multi-Query	Single search perspective	Generates query variants to broaden coverage
Self-RAG	When retrieval is not necessary	The model autonomously decides if and when to retrieve
Multi-Hop RAG	Complex multi-step questions	Chain of iterative retrievals to build compound reasoning
Hybrid Search	Limitations of semantic-only search	Combines semantic (vector) with keyword (BM25) search
Graph RAG	Complex entity relationships	Uses knowledge graphs to navigate relationships and structured contexts

Criterion	Prompt Engineering	RAG	Fine-Tuning
Upfront cost	Minimal	Medium	High
Data required	None	Knowledge base	Training dataset
Update speed	Immediate (change prompt)	Fast (update documents)	Slow (re-training)
Latency	Low	Medium (retrieval + generation)	Low
Factual accuracy	Depends on model	High (grounded in documents)	Medium (depends on training)
Citability	None	High (tracks sources)	None
Complexity	Low	Medium	High

Metric	Data
Global RAG market (2024)	~$1.2 billion USD
Market projection (2030)	~$11 billion USD
Annual growth rate (CAGR)	49.1% (2025-2030)
Enterprise adoption	30-60% of LLM use cases leverage RAG
Hallucination reduction	Up to 71% with well-implemented RAG
Dominant frameworks	80.5% use FAISS or Elasticsearch

Article	Topic
01 - You are here	RAG: Retrieval-Augmented Generation Explained
02 - Next	Embeddings and Semantic Search Deep Dive
03	Vector Databases: Architecture and Best Practices
04	Building a RAG System with LangChain and Python
05	Hybrid Retrieval: Keyword + Semantic + Reranking
06	Context Window and Prompt Engineering for RAG
07	RAG in Production: Monitoring, Evaluation, Scaling
08	Knowledge Graphs and Structured Retrieval
09	Multi-Agent Systems and Orchestrated RAG
10	The Future of RAG: Trends and Research