Hi! I'm

Federico Calò

Software Developer | Technical Writer

I create modern web applications and custom digital tools to help businesses grow through technological innovation. My passion is combining computer science and economics to generate real value.

Contact Me

About Me

My passion for computer science was born at the Technical Commercial Institute of Maglie, where I discovered the power of programming and the fascination of creating digital solutions. From the start, I understood that computer science was not just code, but an extraordinary tool for turning ideas into reality.

During my studies in Business Information Systems, I began to interweave computer science and economics, understanding how technology can be the engine of growth for any business. This vision accompanied me to the University of Bari, where I obtained my degree in Computer Science, deepening my technical skills and passion for software development.

Today I put this experience at the service of businesses, professionals and startups, creating tailor-made digital solutions that automate processes, optimize resources and open new business opportunities. Because true innovation begins when technology meets the real needs of people.

My Skills

Data Analysis & Predictive Models

I transform data into strategic insights with in-depth analysis and predictive models for informed decisions

Process Automation

I create custom tools that automate repetitive operations and free up time for value-added activities

Custom Systems

I develop tailor-made software systems, from platform integrations to customized dashboards

const federico = {
  nome: "Federico Calò",
  ruolo: "Sviluppatore Software",
  città: "Bari, Italia",
  missione: "Aiutare attraverso l'informatica",
  passioni: [
    "Codice Pulito",
    "Innovazione",
    "Crescita Continua"
  ]
};

La Mia Missione

Credo fermamente che l'informatica sia lo strumento più potente per trasformare le idee in realtà e migliorare la vita delle persone.

🚀

Democratizzare la Tecnologia

La mia missione è rendere l'informatica accessibile a tutti: dalle piccole imprese locali alle startup innovative, fino ai professionisti che vogliono digitalizzare la propria attività. Ogni realtà merita di sfruttare le potenzialità del digitale.

💡

Unire Informatica ed Economia

Non è solo questione di scrivere codice: è capire come la tecnologia possa generare valore reale. Intrecciando competenze informatiche e visione economica, aiuto le attività a crescere, ottimizzare processi e raggiungere nuovi traguardi di efficienza e redditività.

🎯

Creare Soluzioni su Misura

Ogni attività è unica, e così devono esserlo le soluzioni. Sviluppo strumenti personalizzati che rispondono alle esigenze specifiche di ciascun cliente, automatizzando processi ripetitivi e liberando tempo per ciò che conta davvero: far crescere il business.

Trasforma la Tua Attività con la Tecnologia

December 2024

View

Master SQL

RoadMap.sh

Novembre 2024

View

Oracle Certified Foundations Associate

Oracle

October 2024

View

People Leadership Credential

Connect

Settembre 2024

💻 Languages & Technologies

☕Java

🐍Python

📜JavaScript

🅰️Angular

⚛️React

🔷TypeScript

🗄️SQL

🐘PHP

🎨CSS/SCSS

🔧Node.js

🐳Docker

🌿Git

💼

12/2024 - Presente

Custom Software Engineering Analyst

Accenture

Bari, Puglia, Italia · Ibrida Analisi e sviluppo di sistemi informatici attraverso l'utilizzo di Java e Quarkus in Health and Public Sector. Formazione continua su tecnologie moderne per la creazione di soluzioni software personalizzate ed efficienti e sugli agenti.

💼

06/2022 - 12/2024

Analista software e Back End Developer Associate Consultant

Links Management and Technology SpA

Esperienza nell'analisi di sistemi software as-is e flussi ETL utilizzando PowerCenter. Formazione completata su Spring Boot per lo sviluppo di applicazioni backend moderne e scalabili. Sviluppatore Backend specializzato in Spring Boot, con esperienza in progettazione di database, analisi, sviluppo e testing dei task assegnati.

💼

02/2021 - 10/2021

Programmatore software

Adesso.it (prima era WebScience srl)

Esperienza nell'analisi AS-IS e TO-BE, evoluzioni SEO ed evoluzioni website per migliorare le performance e l'engagement degli utenti.

🎓

2018 - 2025

Laurea in Informatica

Università degli Studi di Bari Aldo Moro

Bachelor's degree in Computer Science, focusing on software engineering, algorithms, and modern development practices.

📚

2013 - 2018

Diploma - Sistemi Informativi Aziendali

Istituto Tecnico Commerciale di Maglie

Technical diploma specializing in Business Information Systems, combining IT knowledge with business management.

Contattami

Hai un progetto in mente? Parliamone! Compila il form qui sotto e ti risponderò al più presto.

* Campi obbligatori. I tuoi dati saranno utilizzati solo per rispondere alla tua richiesta.

System Design GenAI: Foundations and Architectures for Production Applications

You're trying to integrate an LLM into your application and wondering whether to do RAG, fine-tuning or just plain improve prompts? You're not alone: according to an analysis by 2024, the 73% of GenAI deployments in enterprise fails within six months, mainly due to incorrect architectural choices in the phases design initials. The problem isn't the model: it's that many teams choose technology first understand the problem.

This guide provides you with a practical decision framework for designing GenAI systems in production, covering the fundamental architectures — RAG, fine-tuning, prompt engineering — and the criteria for choosing the approach right for your specific use case.

What You Will Learn

The three fundamental architectures: RAG, fine-tuning and prompt engineering
Decision framework: when to use each approach
System architectures for GenAI applications in production
Technology Stack 2026: LangChain, LlamaIndex, vLLM
Common patterns and anti-patterns to avoid
Quality metrics for evaluating a RAG system

The 73% Problem: Why GenAI Deployments Fail

Before getting into architecture, it is crucial to understand why so many projects fail. The causes main ones identified in enterprise deployment post-mortems are:

Unmanaged hallucination: the model generates plausible but false answers, and none validation system has been implemented
Unacceptable latency:p99 beyond the 3-5 seconds on queries the user expects fast
Explosive costs: no cost per query calculation before go-live, then the budget gets burned in weeks
Unmanaged knowledge cutoff: The model does not know recent or private data of the company
Lack of traceability: impossible to know on which documents a document is based response (critical in regulated contexts)

Each architecture we'll look at addresses some of these problems better than others. Knowing the trade-off allows you to design robust systems from the start.

The Three Fundamental Architectures

1. Prompt Engineering

The starting point for any GenAI system: structure the prompt to guide the model towards the desired output. Key techniques in 2026:

Few-shot prompting: Provide input-output examples in the prompt
Chain-of-thought (CoT): ask the model to think step-by-step first to answer
System prompt: Define the behavior and context of the model
Structured output: Force output to JSON or XML format for reliable parsing

# Esempio: prompt engineering con structured output
import json
from openai import OpenAI

client = OpenAI()

def analyze_ticket(ticket_text: str) -> dict:
    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {
                "role": "system",
                "content": """Sei un sistema di triage per ticket di supporto.
Analizza il ticket e restituisci JSON con:
- priority: "high" | "medium" | "low"
- category: "bug" | "feature" | "question"
- sentiment: "frustrated" | "neutral" | "positive"
- estimated_resolution_hours: numero intero"""
            },
            {
                "role": "user",
                "content": f"Ticket: {ticket_text}"
            }
        ],
        response_format={"type": "json_object"}
    )
    return json.loads(response.choices[0].message.content)

# Uso
result = analyze_ticket("Il mio account e bloccato da ieri, non riesco ad accedere!")
# {"priority": "high", "category": "bug", "sentiment": "frustrated", ...}

When to use it: simple and well-defined cases, rapid prototyping, when you don't have private data to be integrated and the basic model already knows the domain.

Limits: does not work with recent or proprietary data, hallucination on facts specific, cost proportional to the length of the context.

2. RAG — Retrieval-Augmented Generation

RAG solves the problem of knowledge cutoff and private data: instead of relying only on the knowledge of the model, retrieves relevant documents from a database and places them in context before the generation.

The basic architecture of a RAG system has four phases:

Indexing: Documents are split into chunks, converted into embedding vectors and saved in a vector database
Retrieval: The user's query is converted into the same embedding space the most similar chunks are recovered
Augmentation: The retrieved chunks are inserted into the prompt as context
Generation: The LLM generates the response based on the context provided

# RAG minimo funzionante con LangChain e Qdrant
from langchain_openai import ChatOpenAI, OpenAIEmbeddings
from langchain_qdrant import QdrantVectorStore
from langchain.chains import RetrievalQA
from langchain_community.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter

# 1. Carica e chunka i documenti
loader = PyPDFLoader("manuale_prodotto.pdf")
docs = loader.load()

splitter = RecursiveCharacterTextSplitter(
    chunk_size=512,
    chunk_overlap=64
)
chunks = splitter.split_documents(docs)

# 2. Crea il vector store
embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
vectorstore = QdrantVectorStore.from_documents(
    documents=chunks,
    embedding=embeddings,
    url="http://localhost:6333",
    collection_name="manuale_prodotto"
)

# 3. Crea la chain RAG
llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)
retriever = vectorstore.as_retriever(search_kwargs={"k": 5})

rag_chain = RetrievalQA.from_chain_type(
    llm=llm,
    chain_type="stuff",
    retriever=retriever,
    return_source_documents=True
)

# 4. Query
result = rag_chain.invoke({"query": "Come configuro le notifiche email?"})
print(result["result"])
# La risposta cita i documenti recuperati, non inventa

When to use it: company knowledge base, technical documentation, FAQ, any case where the response must be based on specific and traceable documents.

Limits: quality dependent on retrieval quality, additional latency, infrastructure overhead.

3. Fine-tuning

Fine-tuning adapts the model's behavior through additional training on data domain specific. In 2026, the dominant paradigm is the Parameter-Efficient Fine-tuning (PEFT) with techniques such as LoRA and QLoRA that allow training on consumer hardware.

# Fine-tuning con LoRA usando transformers e PEFT
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import LoraConfig, get_peft_model, TaskType
import torch

model_name = "meta-llama/Llama-3.1-8B-Instruct"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype=torch.bfloat16,
    device_map="auto"
)

# Configurazione LoRA: adatta solo il 0.1% dei parametri
lora_config = LoraConfig(
    task_type=TaskType.CAUSAL_LM,
    r=16,           # rank della matrice di adattamento
    lora_alpha=32,  # scaling factor
    lora_dropout=0.1,
    target_modules=["q_proj", "v_proj", "k_proj", "o_proj"]
)

model = get_peft_model(model, lora_config)
model.print_trainable_parameters()
# trainable params: 6,815,744 || all params: 8,037,191,680
# trainable%: 0.0848%

When to use it: When you have thousands of training examples, you need a format of very specific and consistent output, or you want to reduce the length of the prompt by eliminating examples few-shot.

Critical anti-pattern: Don't fine-tune to inject factual knowledge (dates, numbers, specific facts). The model "memorizes" without understanding and hallucinera anyway. For knowledge use RAG.

The Decision Framework

The most important question is not "which technology to use" but "what is my real problem". This decision tree covers 90% of use cases:

Hai dati privati o recenti che il modello non conosce?
  SI  --> Considera RAG come base
  NO  --> Prompt engineering puo essere sufficiente

Il tuo knowledge base e aggiornato frequentemente?
  SI  --> RAG (indicizza i nuovi documenti, non ri-addestra)
  NO  --> Fine-tuning puo essere considerato

Hai 1000+ esempi di coppie input-output di alta qualita?
  SI  --> Fine-tuning e un'opzione valida
  NO  --> Stai nei limiti di RAG + few-shot

Hai bisogno di tracciabilita (citare le fonti)?
  SI  --> RAG obbligatorio
  NO  --> Piu flessibilita

Latenza critica (sotto 500ms)?
  SI  --> Fine-tuning (elimina retrieval overhead) o caching aggressivo
  NO  --> RAG funziona bene

Conclusione tipica 2026:
  Start with RAG + prompt engineering
  Add fine-tuning solo se RAG non raggiunge qualita richiesta

System Architecture for Production

A production-grade GenAI system goes far beyond just "LLM + vector database". The components necessary for a serious deployment:

# Stack minimo per RAG in produzione (Docker Compose)
services:
  api:
    image: your-genai-api:latest
    environment:
      OPENAI_API_KEY: ${OPENAI_API_KEY}
      QDRANT_URL: http://qdrant:6333
      REDIS_URL: redis://redis:6379
    depends_on:
      - qdrant
      - redis

  qdrant:
    image: qdrant/qdrant:v1.9.0
    volumes:
      - qdrant_storage:/qdrant/storage
    ports:
      - "6333:6333"

  redis:
    image: redis:7-alpine
    # Semantic cache: evita LLM calls per query simili
    volumes:
      - redis_data:/data

  prometheus:
    image: prom/prometheus:latest
    # Monitora: latency p50/p95/p99, costo per query, quality score

  grafana:
    image: grafana/grafana:latest
    # Dashboard: LLM performance, retrieval quality, cost tracking

The critical components that separate a prototype from a production system:

Semantic caching (Redis + library like GPTCache): reduces costs by 30-60% for applications with similar recurring queries
Observability: track each LLM call with latency, tokens used, cost and quality score — without this data you can't optimize
Fallback strategy: what happens when OpenAI is down? You have a local model how to backup?
Rate limiting and quota management: Protect your budget from queries anomalous
PII detection: before sending data to the LLM, detect and obscure personal data sensitive

The Technology Stack of 2026

The GenAI ecosystem has stabilized around a few dominant players:

Recommended Stack 2026

Orchestration: LangChain v0.3+ or LlamaIndex v0.10+ for complex RAG pipelines; LangGraph for agent workflows
Vector Database: Qdrant (self-hosted, excellent performance), pgvector (already in PostgreSQL, under 1M vectors), Pinecone (managed, guaranteed latency)
Inference: vLLM or TensorRT-LLM for self-hosted open source models; OpenAI/Anthropic for cloud APIs
Embeddings: text-embedding-3-small by OpenAI (1536 dim, $0.02/1M token) or all-MiniLM-L6-v2 for free self-hosting
Observability: LangSmith, Weights & Biases Weave, or Phoenix for tracing of chains
Evaluation: RAGAS for automated RAG metrics (faithfulness, answer relevancy, context recall)

Quality Metrics for RAG Systems

How to know if your RAG system is working well? The RAGAS framework defines measurable metrics:

# Valutazione automatica con RAGAS
from ragas import evaluate
from ragas.metrics import (
    faithfulness,        # la risposta e supportata dai documenti recuperati?
    answer_relevancy,    # la risposta risponde alla domanda?
    context_recall,      # i documenti recuperati contengono le info necessarie?
    context_precision    # i documenti recuperati sono tutti rilevanti?
)
from datasets import Dataset

# Dataset di test (ground truth necessario)
test_data = {
    "question": ["Come configuro l'autenticazione 2FA?"],
    "answer": ["Per configurare 2FA, vai in Impostazioni > Sicurezza..."],
    "contexts": [["Documentazione 2FA: ...", "Guida sicurezza: ..."]],
    "ground_truth": ["La 2FA si configura tramite l'app mobile nelle impostazioni sicurezza"]
}

dataset = Dataset.from_dict(test_data)
result = evaluate(dataset, metrics=[
    faithfulness,
    answer_relevancy,
    context_recall,
    context_precision
])

print(result)
# faithfulness: 0.95     (la risposta non inventa)
# answer_relevancy: 0.88 (la risposta e pertinente)
# context_recall: 0.82   (i doc recuperati coprono la risposta)
# context_precision: 0.91 (i doc recuperati sono rilevanti)

Realistic targets for a production system: faithfulness > 0.85 (critical: below this threshold hallucinations are frequent), answer_relevancy > 0.80, context_recall > 0.75.

Anti-patterns to Avoid

Fixed chunk size for all documents: structured documents (FAQ, API docs) require chunking other than narrative text
Semantic retrieval only: fails on exact technical terms; use hybrid search (BM25 + semantic)
No reranking: the top-k vectors are not necessarily the most useful; a cross-encoder improves precision by 15-20%
RAG without continuous evaluation: The quality degrades when documents they change; monitors faithfulness in production
Fine-tuning as first choice: and expensive and slow; RAG is almost always there right move to start

Conclusions and Next Steps

The system design for GenAI applications requires architectural choices that go far beyond model selection. The 2026 rule of thumb: always starts with RAG + prompt engineering, measure quality with RAGAS, and add fine-tuning only if the gap is quality persists after retrieval optimization.

In the next articles in this series we will explore each component in detail: the selection of right vector database, chunking strategies, hybrid search and agent architecture with LangGraph.