Ciao! Sono

Federico Calò

Sviluppatore Software | Divulgatore Tecnico

Creo applicazioni web moderne e strumenti digitali personalizzati per aiutare le attività a crescere attraverso l'innovazione tecnologica. La mia passione è unire informatica ed economia per generare valore reale.

Contattami

Chi Sono

La mia passione per l'informatica è nata tra i banchi dell'Istituto Tecnico Commerciale di Maglie, dove ho scoperto il potere della programmazione e il fascino di creare soluzioni digitali. Fin da subito, ho capito che l'informatica non era solo codice, ma uno strumento straordinario per trasformare idee in realtà.

Durante gli studi superiori in Sistemi Informativi Aziendali, ho iniziato a intrecciare informatica ed economia, comprendendo come la tecnologia possa essere il motore della crescita per qualsiasi attività. Questa visione mi ha accompagnato all'Università degli Studi di Bari, dove ho conseguito la Laurea in Informatica, approfondendo le mie competenze tecniche e la mia passione per lo sviluppo software.

Oggi metto questa esperienza al servizio di imprese, professionisti e startup, creando soluzioni digitali su misura che automatizzano processi, ottimizzano risorse e aprono nuove opportunità di business. Perché la vera innovazione inizia quando la tecnologia incontra le esigenze reali delle persone.

Le Mie Competenze

Analisi Dati & Modelli Previsionali

Trasformo i dati in insights strategici con analisi approfondite e modelli predittivi per decisioni informate

Automazione Processi

Creo strumenti personalizzati che automatizzano operazioni ripetitive e liberano tempo per attività a valore aggiunto

Sistemi Custom

Sviluppo sistemi software su misura, dalle integrazioni tra piattaforme alle dashboard personalizzate

const federico = {
  nome: "Federico Calò",
  ruolo: "Sviluppatore Software",
  città: "Bari, Italia",
  missione: "Aiutare attraverso l'informatica",
  passioni: [
    "Codice Pulito",
    "Innovazione",
    "Crescita Continua"
  ]
};

La Mia Missione

Credo fermamente che l'informatica sia lo strumento più potente per trasformare le idee in realtà e migliorare la vita delle persone.

🚀

Democratizzare la Tecnologia

La mia missione è rendere l'informatica accessibile a tutti: dalle piccole imprese locali alle startup innovative, fino ai professionisti che vogliono digitalizzare la propria attività. Ogni realtà merita di sfruttare le potenzialità del digitale.

💡

Unire Informatica ed Economia

Non è solo questione di scrivere codice: è capire come la tecnologia possa generare valore reale. Intrecciando competenze informatiche e visione economica, aiuto le attività a crescere, ottimizzare processi e raggiungere nuovi traguardi di efficienza e redditività.

🎯

Creare Soluzioni su Misura

Ogni attività è unica, e così devono esserlo le soluzioni. Sviluppo strumenti personalizzati che rispondono alle esigenze specifiche di ciascun cliente, automatizzando processi ripetitivi e liberando tempo per ciò che conta davvero: far crescere il business.

Trasforma la Tua Attività con la Tecnologia

Dicembre 2024

Visualizza

Master SQL

RoadMap.sh

Novembre 2024

Visualizza

Oracle Certified Foundations Associate

Oracle

Ottobre 2024

Visualizza

People Leadership Credential

Connect

Settembre 2024

💻 Linguaggi & Tecnologie

☕Java

🐍Python

📜JavaScript

🅰️Angular

⚛️React

🔷TypeScript

🗄️SQL

🐘PHP

🎨CSS/SCSS

🔧Node.js

🐳Docker

🌿Git

💼

12/2024 - Presente

Custom Software Engineering Analyst

Accenture

Bari, Puglia, Italia · Ibrida Analisi e sviluppo di sistemi informatici attraverso l'utilizzo di Java e Quarkus in Health and Public Sector. Formazione continua su tecnologie moderne per la creazione di soluzioni software personalizzate ed efficienti e sugli agenti.

💼

06/2022 - 12/2024

Analista software e Back End Developer Associate Consultant

Links Management and Technology SpA

Esperienza nell'analisi di sistemi software as-is e flussi ETL utilizzando PowerCenter. Formazione completata su Spring Boot per lo sviluppo di applicazioni backend moderne e scalabili. Sviluppatore Backend specializzato in Spring Boot, con esperienza in progettazione di database, analisi, sviluppo e testing dei task assegnati.

💼

02/2021 - 10/2021

Programmatore software

Adesso.it (prima era WebScience srl)

Esperienza nell'analisi AS-IS e TO-BE, evoluzioni SEO ed evoluzioni website per migliorare le performance e l'engagement degli utenti.

🎓

2018 - 2025

Laurea in Informatica

Università degli Studi di Bari Aldo Moro

Bachelor's degree in Computer Science, focusing on software engineering, algorithms, and modern development practices.

📚

2013 - 2018

Diploma - Sistemi Informativi Aziendali

Istituto Tecnico Commerciale di Maglie

Technical diploma specializing in Business Information Systems, combining IT knowledge with business management.

Contattami

Hai un progetto in mente? Parliamone! Compila il form qui sotto e ti risponderò al più presto.

* Campi obbligatori. I tuoi dati saranno utilizzati solo per rispondere alla tua richiesta.

RAG in Production: Architecture, Scaling and Monitoring

Building a RAG prototype that works locally is relatively straightforward. Taking it to production, where it must handle thousands of simultaneous queries, respond in under two seconds, maintain high quality over time and not lose data, is an entirely different matter. The gap between "works on my laptop" and "works in production for 10,000 users" is enormous, and many RAG projects fail precisely at this transition.

In this article we tackle the real challenges of RAG in production: scalable architecture, optimal chunking, reranking, managing corpus updates, monitoring with RAG-specific metrics, and automated quality evaluation with frameworks like RAGAS. Every section includes executable Python code and patterns tested on real systems.

What You Will Learn

Production-ready architecture for a scalable RAG system
Advanced chunking strategies (recursive, semantic, sentence-window)
Reranking pipeline with cross-encoder to improve precision
Managing incremental corpus updates
Monitoring with RAG-specific metrics (faithfulness, relevance, recall)
Automated evaluation with the RAGAS framework
Intelligent caching to optimize latency and costs
Error handling and graceful degradation in production

Series Overview

#	Article	Focus
1	RAG Explained	Fundamentals and architecture
2	Embeddings and Semantic Search	BERT, SBERT, FAISS
3	Vector Database	Qdrant, Pinecone, Milvus
4	Hybrid Retrieval	BM25 + vector search
5	RAG in Production (you are here)	Scaling, monitoring, evaluation
6	LangChain for RAG	Framework and advanced patterns
7	Context Window Management	Optimizing LLM input
8	Multi-Agent Systems	Orchestration and coordination
9	Prompt Engineering in Production	Templates, versioning, testing
10	Knowledge Graphs for AI	Structured knowledge in LLMs

1. Architecture of a Production-Ready RAG System

A production RAG system is not a simple sequential pipeline: it is a distributed system with specialized components, each with different scalability, fault tolerance, and monitoring requirements. Understanding the complete architecture is the first step toward building something that holds up in production.

Production RAG Architecture: Main Components


PRODUCTION RAG ARCHITECTURE

┌─────────────────────────────────────────────────────────┐
│                     API GATEWAY                          │
│              (Rate limiting, auth, routing)              │
└──────────────────────┬──────────────────────────────────┘
                       │
            ┌──────────┴──────────┐
            │                     │
     ┌──────▼──────┐     ┌───────▼────────┐
     │  QUERY      │     │  INGESTION     │
     │  SERVICE    │     │  SERVICE       │
     └──────┬──────┘     └───────┬────────┘
            │                   │
    ┌───────▼──────┐    ┌───────▼────────┐
    │   RETRIEVAL  │    │   DOCUMENT     │
    │   ENGINE     │    │   PROCESSOR    │
    │  ┌─────────┐ │    │  ┌──────────┐ │
    │  │Embedding│ │    │  │Chunking  │ │
    │  │Cache    │ │    │  │Embedding │ │
    │  └────┬────┘ │    │  │Indexing  │ │
    │       │      │    │  └──────────┘ │
    │  ┌────▼────┐ │    └───────┬────────┘
    │  │Vector   │ │            │
    │  │Search   │ │    ┌───────▼────────┐
    │  └────┬────┘ │    │   VECTOR DB    │
    │       │      │    │ (Qdrant/Pine)  │
    │  ┌────▼────┐ │    └────────────────┘
    │  │Reranker │ │
    │  └────┬────┘ │
    └───────┼──────┘
            │
    ┌───────▼──────┐    ┌────────────────┐
    │  GENERATION  │    │    CACHE       │
    │  SERVICE     │◄──►│  (Redis/       │
    │  (LLM)       │    │   Semantic)    │
    └───────┬──────┘    └────────────────┘
            │
    ┌───────▼──────┐    ┌────────────────┐
    │  MONITORING  │    │   EVALUATION   │
    │  SERVICE     │    │   SERVICE      │
    │  (Prometheus)│    │   (RAGAS)      │
    └──────────────┘    └────────────────┘

1.1 Separating Concerns: Ingestion vs Query

The fundamental pattern in a production RAG system is the separation between the ingestion plane and the query plane. These two paths have very different requirements:

      Ingestion vs Query: Different Requirements
      
            Dimension
            Ingestion Path
            Query Path
          
            Latency
            Not critical (batch)
            Critical (<2s p95)
          
            Throughput
            Low-medium (documents)
            High (thousands req/s)
          
            CPU/GPU
            Embedding generation (GPU)
            Query embedding + rerank (GPU)
          
            Errors
            Retry with backoff
            Graceful fallback
          
            Scaling
            Horizontal batch
            Horizontal stateless

2. Advanced Chunking Strategies

Chunking is probably the most overlooked variable in RAG systems, yet it has an enormous impact on final quality. A chunk that is too small loses context; one that is too large introduces noise and exceeds the embedding model's context window. Getting chunking right should be the first thing you tune.

2.1 Recursive Character Text Splitter

Advanced Recursive Chunking


from langchain.text_splitter import RecursiveCharacterTextSplitter
from typing import List, Dict, Any
import re

class AdvancedChunker:
    def __init__(self, chunk_size=512, chunk_overlap=64, strategy="recursive"):
        self.chunk_size = chunk_size
        self.chunk_overlap = chunk_overlap
        self.strategy = strategy
        self.splitter = RecursiveCharacterTextSplitter(
            chunk_size=chunk_size,
            chunk_overlap=chunk_overlap,
            separators=["\n\n", "\n", ". ", "! ", "? ", " ", ""]
        )

    def chunk_with_metadata(self, text: str, doc_metadata: Dict) -> List[Dict]:
        """Create chunks with full metadata"""
        chunks = self.splitter.split_text(text)
        return [
            {
                "text": chunk,
                "metadata": {
                    **doc_metadata,
                    "chunk_index": i,
                    "total_chunks": len(chunks),
                    "prev_chunk": chunks[i-1][:100] if i > 0 else None,
                    "next_chunk": chunks[i+1][:100] if i < len(chunks)-1 else None,
                }
            }
            for i, chunk in enumerate(chunks)
        ]

    def sentence_window_split(self, text: str, window_size: int = 3) -> List[str]:
        """
        Sentence Window: index individual sentences but retrieve
        surrounding window for context preservation.
        Improves recall while maintaining precision.
        """
        sentences = re.split(r'(?<=[.!?])\s+', text)
        sentences = [s.strip() for s in sentences if s.strip()]

        return [
            " ".join(sentences[max(0, i-window_size//2):min(len(sentences), i+window_size//2+1)])
            for i in range(len(sentences))
        ]

2.2 Parent-Child Chunking

Parent-Child Chunking indexes small chunks (child) for retrieval precision but returns large chunks (parent) to give the LLM sufficient context. This pattern is particularly effective for long documents where the relevant answer is spread across multiple paragraphs.

3. Reranking: Improving Retrieval Precision

Reranking significantly improves retrieval quality by applying a second, more precise model to initial results. The typical flow: retrieve 50-100 candidates with fast vector search, then reorder with a precise cross-encoder, finally take the top-k.

Reranking Pipeline with Cross-Encoder


from sentence_transformers import CrossEncoder
from typing import List, Tuple

class RerankingRetriever:
    def __init__(self, bi_encoder, vector_index,
                 cross_encoder_name="cross-encoder/ms-marco-MiniLM-L-6-v2"):
        self.bi_encoder = bi_encoder
        self.index = vector_index
        self.cross_encoder = CrossEncoder(cross_encoder_name)
        self.documents = []

    def retrieve_and_rerank(
        self, query: str, initial_k: int = 50, final_k: int = 5
    ) -> List[Tuple[str, float]]:
        """Two-stage retrieval: fast retrieval + precise reranking"""
        # STAGE 1: Fast bi-encoder retrieval
        query_emb = self.bi_encoder.encode([query], normalize_embeddings=True)
        scores, indices = self.index.search(query_emb.astype('float32'), initial_k)
        candidates = [
            (self.documents[i], float(s))
            for s, i in zip(scores[0], indices[0])
            if i != -1
        ]

        if not candidates:
            return []

        # STAGE 2: Cross-encoder reranking
        cross_pairs = [(query, doc) for doc, _ in candidates]
        cross_scores = self.cross_encoder.predict(cross_pairs)

        reranked = sorted(
            zip([doc for doc, _ in candidates], cross_scores),
            key=lambda x: x[1], reverse=True
        )

        return reranked[:final_k]

      Cross-Encoder Models for Reranking
      
            Model
            Speed
            Quality
            Best Use
          
            cross-encoder/ms-marco-MiniLM-L-6-v2
            High
            Good
            Production, latency-sensitive
          
            cross-encoder/ms-marco-electra-base
            Medium
            Excellent
            Balanced tradeoff
          
            BAAI/bge-reranker-large
            Low
            State-of-the-art
            Maximum quality, latency secondary
          
            Cohere Rerank API
            API
            Excellent
            Prototyping, budget available

4. Intelligent Caching for Latency and Costs

In a production RAG system, a large percentage of queries are similar or identical. Semantic caching goes beyond exact cache and reuses results for semantically similar queries, dramatically reducing LLM inference costs. With a threshold of 0.95 cosine similarity, you can safely reuse cached responses.

Semantic Cache with Redis


import redis
import numpy as np
import json
import hashlib
from sentence_transformers import SentenceTransformer
from typing import Optional, Tuple
import time

class SemanticCache:
    def __init__(self, redis_url="redis://localhost:6379",
                 similarity_threshold=0.95, ttl_seconds=3600):
        self.redis = redis.from_url(redis_url)
        self.model = SentenceTransformer("all-MiniLM-L6-v2")
        self.threshold = similarity_threshold
        self.ttl = ttl_seconds

    def get(self, query: str) -> Optional[Tuple[str, float]]:
        """Check cache: exact match first, then semantic match"""
        key = f"cache:{hashlib.md5(query.encode()).hexdigest()}"
        cached = self.redis.get(key)
        if cached:
            return json.loads(cached)['response'], 1.0

        # Semantic search among cached queries
        query_emb = self.model.encode([query], normalize_embeddings=True)[0]
        best_score, best_response = 0.0, None

        for k in self.redis.scan_iter("cache:*"):
            data = json.loads(self.redis.get(k) or '{}')
            if not data or 'embedding' not in data:
                continue
            sim = float(np.dot(query_emb, np.array(data['embedding'])))
            if sim > best_score:
                best_score, best_response = sim, data['response']

        return (best_response, best_score) if best_score >= self.threshold else None

    def set(self, query: str, response: str):
        """Cache response with query embedding"""
        query_emb = self.model.encode([query], normalize_embeddings=True)[0]
        key = f"cache:{hashlib.md5(query.encode()).hexdigest()}"
        self.redis.setex(key, self.ttl, json.dumps({
            'response': response,
            'embedding': query_emb.tolist(),
            'timestamp': time.time()
        }))

5. Monitoring and Observability

A production RAG system must be observable at all levels. You must measure retrieval quality, hallucination rate, and cost per query - not just HTTP latency and uptime.

Prometheus Metrics for RAG


from prometheus_client import Counter, Histogram, Gauge

rag_queries_total = Counter(
    'rag_queries_total', 'Total RAG queries', ['status', 'cached']
)
rag_query_duration = Histogram(
    'rag_query_duration_seconds', 'RAG query duration by component',
    ['component'], buckets=[0.05, 0.1, 0.25, 0.5, 1.0, 2.0, 5.0]
)
rag_retrieval_top_score = Histogram(
    'rag_retrieval_top_score', 'Top-1 retrieval score',
    buckets=[0.1, 0.3, 0.5, 0.7, 0.8, 0.9, 0.95, 1.0]
)
# Quality metrics updated by evaluation service
rag_faithfulness = Gauge('rag_faithfulness_score', 'Average faithfulness')
rag_relevance = Gauge('rag_answer_relevance', 'Average answer relevance')

# Track LLM costs per query
rag_prompt_tokens = Counter('rag_prompt_tokens_total', 'Total prompt tokens')
rag_completion_tokens = Counter('rag_completion_tokens_total', 'Total completion tokens')

6. Automated Evaluation with RAGAS

RAGAS (RAG Assessment) is the most widely used framework for automated RAG evaluation. It measures four fundamental quality dimensions without requiring human annotations:

Faithfulness: Is the response supported by the retrieved context? The most critical metric for anti-hallucination.
Answer Relevance: Does the response actually address the question?
Context Recall: Does the retrieved context contain everything needed to answer?
Context Precision: Are all retrieved chunks actually relevant (no noise)?

RAG Evaluation with RAGAS


from ragas import evaluate
from ragas.metrics import (
    faithfulness, answer_relevancy,
    context_recall, context_precision
)
from datasets import Dataset
from typing import List
import pandas as pd

def evaluate_rag_system(rag_pipeline, test_questions: List[str],
                         ground_truths: List[str]) -> pd.DataFrame:
    """Full RAGAS evaluation pipeline"""
    questions, answers, contexts = [], [], []

    for question in test_questions:
        chunks = rag_pipeline.retrieve(question, top_k=5)
        answer = rag_pipeline.generate(question)

        questions.append(question)
        answers.append(answer)
        contexts.append([chunk for chunk, _ in chunks])

    dataset = Dataset.from_dict({
        "question": questions,
        "answer": answers,
        "contexts": contexts,
        "ground_truth": ground_truths
    })

    results = evaluate(dataset, metrics=[
        faithfulness, answer_relevancy,
        context_recall, context_precision
    ])

    df = results.to_pandas()

    print("=== RAGAS Evaluation Report ===")
    for metric in ['faithfulness', 'answer_relevancy', 'context_recall', 'context_precision']:
        print(f"  {metric:25s} {df[metric].mean():.3f}")

    # Identify problematic queries
    issues = df[df['faithfulness'] < 0.5]
    if len(issues) > 0:
        print(f"\nWARNING: {len(issues)} queries with faithfulness < 0.5")
        for _, row in issues.iterrows():
            print(f"  - {row['question'][:80]}...")

    return df

# Example test set
test_questions = [
    "What is RAG and how does it work?",
    "What is the difference between BERT and Sentence Transformers?",
    "How do you choose a vector database for production?"
]

ground_truths = [
    "RAG (Retrieval-Augmented Generation) combines knowledge base search with LLM generation to reduce hallucinations.",
    "BERT produces contextual token-level embeddings, while Sentence Transformers are fine-tuned to produce sentence-level embeddings for similarity search.",
    "The choice depends on scale, latency requirements, budget, and whether managed hosting or self-hosted is needed."
]

7. Best Practices and Anti-Patterns in Production

Production-Ready RAG Checklist

Chunking: use 400-600 token chunk size with 10-15% overlap; test different strategies on your specific corpus
Reranking: implement a cross-encoder for queries where precision is critical; the 100-300ms extra latency is worth it
Caching: semantic cache with 0.92-0.97 threshold reduces LLM costs by 30-60% on FAQ-like corpora
Monitoring: track faithfulness, answer relevance, retrieval latency, and LLM token cost per query
Evaluation: maintain a golden test set (100-200 questions with ground truth) and run RAGAS on every deploy
Fallback: if retrieval finds nothing relevant (top score < 0.5), explicitly say "I don't know" instead of hallucinating
Versioning: version embedding models and re-index when you change them; run parallel indexes during migration

Anti-Patterns to Avoid in Production

No quality monitoring: monitoring only latency and uptime is not enough. RAG can technically "work" but produce incorrect answers.
Stale corpus: a RAG system on outdated documentation is worse than no RAG at all - it confidently gives wrong information.
Fixed top-k: adapt the number of retrieved chunks to query complexity. Complex multi-part questions need more context.
Ignoring embedding latency: generating a query embedding takes 10-50ms. At 1000 req/s this becomes your bottleneck.
LLM as absolute oracle: even with RAG, the generative model can produce responses that go beyond the context. Implement guardrails.

Conclusions

Taking a RAG system to production requires much more than a simple sequential pipeline. We have covered production-ready architecture separating ingestion and query planes, advanced chunking strategies, cross-encoder reranking, semantic caching, and automated quality evaluation with RAGAS.

Key takeaways:

Separate ingestion and query paths - they have radically different requirements
Invest in chunking: it is the most impactful and most often overlooked variable
Implement cross-encoder reranking for precision-critical use cases
Use semantic caching to reduce costs and latency on repetitive queries
Measure faithfulness and answer relevance, not just latency and uptime
Maintain a golden test set and evaluate with RAGAS on every deploy

In the next article we will explore LangChain for RAG: advanced patterns including conversational RAG, multi-hop retrieval, and tool calling.

Continue the Series

Article 1: RAG Explained - Fundamentals
Article 2: Embeddings and Semantic Search
Article 3: Vector Database - Qdrant vs Pinecone
Article 4: Hybrid Retrieval: BM25 + Vector
Article 5: RAG in Production (current)
Article 6: LangChain for RAG

Dimension	Ingestion Path	Query Path
Latency	Not critical (batch)	Critical (<2s p95)
Throughput	Low-medium (documents)	High (thousands req/s)
CPU/GPU	Embedding generation (GPU)	Query embedding + rerank (GPU)
Errors	Retry with backoff	Graceful fallback
Scaling	Horizontal batch	Horizontal stateless

Model	Speed	Quality	Best Use
cross-encoder/ms-marco-MiniLM-L-6-v2	High	Good	Production, latency-sensitive
cross-encoder/ms-marco-electra-base	Medium	Excellent	Balanced tradeoff
BAAI/bge-reranker-large	Low	State-of-the-art	Maximum quality, latency secondary
Cohere Rerank API	API	Excellent	Prototyping, budget available