System Design GenAI: Foundations and Architectures for Production Applications
You're trying to integrate an LLM into your application and wondering whether to do RAG, fine-tuning or just plain improve prompts? You're not alone: according to an analysis by 2024, the 73% of GenAI deployments in enterprise fails within six months, mainly due to incorrect architectural choices in the phases design initials. The problem isn't the model: it's that many teams choose technology first understand the problem.
This guide provides you with a practical decision framework for designing GenAI systems in production, covering the fundamental architectures — RAG, fine-tuning, prompt engineering — and the criteria for choosing the approach right for your specific use case.
What You Will Learn
- The three fundamental architectures: RAG, fine-tuning and prompt engineering
- Decision framework: when to use each approach
- System architectures for GenAI applications in production
- Technology Stack 2026: LangChain, LlamaIndex, vLLM
- Common patterns and anti-patterns to avoid
- Quality metrics for evaluating a RAG system
The 73% Problem: Why GenAI Deployments Fail
Before getting into architecture, it is crucial to understand why so many projects fail. The causes main ones identified in enterprise deployment post-mortems are:
- Unmanaged hallucination: the model generates plausible but false answers, and none validation system has been implemented
- Unacceptable latency:p99 beyond the 3-5 seconds on queries the user expects fast
- Explosive costs: no cost per query calculation before go-live, then the budget gets burned in weeks
- Unmanaged knowledge cutoff: The model does not know recent or private data of the company
- Lack of traceability: impossible to know on which documents a document is based response (critical in regulated contexts)
Each architecture we'll look at addresses some of these problems better than others. Knowing the trade-off allows you to design robust systems from the start.
The Three Fundamental Architectures
1. Prompt Engineering
The starting point for any GenAI system: structure the prompt to guide the model towards the desired output. Key techniques in 2026:
- Few-shot prompting: Provide input-output examples in the prompt
- Chain-of-thought (CoT): ask the model to think step-by-step first to answer
- System prompt: Define the behavior and context of the model
- Structured output: Force output to JSON or XML format for reliable parsing
# Esempio: prompt engineering con structured output
import json
from openai import OpenAI
client = OpenAI()
def analyze_ticket(ticket_text: str) -> dict:
response = client.chat.completions.create(
model="gpt-4o",
messages=[
{
"role": "system",
"content": """Sei un sistema di triage per ticket di supporto.
Analizza il ticket e restituisci JSON con:
- priority: "high" | "medium" | "low"
- category: "bug" | "feature" | "question"
- sentiment: "frustrated" | "neutral" | "positive"
- estimated_resolution_hours: numero intero"""
},
{
"role": "user",
"content": f"Ticket: {ticket_text}"
}
],
response_format={"type": "json_object"}
)
return json.loads(response.choices[0].message.content)
# Uso
result = analyze_ticket("Il mio account e bloccato da ieri, non riesco ad accedere!")
# {"priority": "high", "category": "bug", "sentiment": "frustrated", ...}
When to use it: simple and well-defined cases, rapid prototyping, when you don't have private data to be integrated and the basic model already knows the domain.
Limits: does not work with recent or proprietary data, hallucination on facts specific, cost proportional to the length of the context.
2. RAG — Retrieval-Augmented Generation
RAG solves the problem of knowledge cutoff and private data: instead of relying only on the knowledge of the model, retrieves relevant documents from a database and places them in context before the generation.
The basic architecture of a RAG system has four phases:
- Indexing: Documents are split into chunks, converted into embedding vectors and saved in a vector database
- Retrieval: The user's query is converted into the same embedding space the most similar chunks are recovered
- Augmentation: The retrieved chunks are inserted into the prompt as context
- Generation: The LLM generates the response based on the context provided
# RAG minimo funzionante con LangChain e Qdrant
from langchain_openai import ChatOpenAI, OpenAIEmbeddings
from langchain_qdrant import QdrantVectorStore
from langchain.chains import RetrievalQA
from langchain_community.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
# 1. Carica e chunka i documenti
loader = PyPDFLoader("manuale_prodotto.pdf")
docs = loader.load()
splitter = RecursiveCharacterTextSplitter(
chunk_size=512,
chunk_overlap=64
)
chunks = splitter.split_documents(docs)
# 2. Crea il vector store
embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
vectorstore = QdrantVectorStore.from_documents(
documents=chunks,
embedding=embeddings,
url="http://localhost:6333",
collection_name="manuale_prodotto"
)
# 3. Crea la chain RAG
llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)
retriever = vectorstore.as_retriever(search_kwargs={"k": 5})
rag_chain = RetrievalQA.from_chain_type(
llm=llm,
chain_type="stuff",
retriever=retriever,
return_source_documents=True
)
# 4. Query
result = rag_chain.invoke({"query": "Come configuro le notifiche email?"})
print(result["result"])
# La risposta cita i documenti recuperati, non inventa
When to use it: company knowledge base, technical documentation, FAQ, any case where the response must be based on specific and traceable documents.
Limits: quality dependent on retrieval quality, additional latency, infrastructure overhead.
3. Fine-tuning
Fine-tuning adapts the model's behavior through additional training on data domain specific. In 2026, the dominant paradigm is the Parameter-Efficient Fine-tuning (PEFT) with techniques such as LoRA and QLoRA that allow training on consumer hardware.
# Fine-tuning con LoRA usando transformers e PEFT
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import LoraConfig, get_peft_model, TaskType
import torch
model_name = "meta-llama/Llama-3.1-8B-Instruct"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
model_name,
torch_dtype=torch.bfloat16,
device_map="auto"
)
# Configurazione LoRA: adatta solo il 0.1% dei parametri
lora_config = LoraConfig(
task_type=TaskType.CAUSAL_LM,
r=16, # rank della matrice di adattamento
lora_alpha=32, # scaling factor
lora_dropout=0.1,
target_modules=["q_proj", "v_proj", "k_proj", "o_proj"]
)
model = get_peft_model(model, lora_config)
model.print_trainable_parameters()
# trainable params: 6,815,744 || all params: 8,037,191,680
# trainable%: 0.0848%
When to use it: When you have thousands of training examples, you need a format of very specific and consistent output, or you want to reduce the length of the prompt by eliminating examples few-shot.
Critical anti-pattern: Don't fine-tune to inject factual knowledge (dates, numbers, specific facts). The model "memorizes" without understanding and hallucinera anyway. For knowledge use RAG.
The Decision Framework
The most important question is not "which technology to use" but "what is my real problem". This decision tree covers 90% of use cases:
Hai dati privati o recenti che il modello non conosce?
SI --> Considera RAG come base
NO --> Prompt engineering puo essere sufficiente
Il tuo knowledge base e aggiornato frequentemente?
SI --> RAG (indicizza i nuovi documenti, non ri-addestra)
NO --> Fine-tuning puo essere considerato
Hai 1000+ esempi di coppie input-output di alta qualita?
SI --> Fine-tuning e un'opzione valida
NO --> Stai nei limiti di RAG + few-shot
Hai bisogno di tracciabilita (citare le fonti)?
SI --> RAG obbligatorio
NO --> Piu flessibilita
Latenza critica (sotto 500ms)?
SI --> Fine-tuning (elimina retrieval overhead) o caching aggressivo
NO --> RAG funziona bene
Conclusione tipica 2026:
Start with RAG + prompt engineering
Add fine-tuning solo se RAG non raggiunge qualita richiesta
System Architecture for Production
A production-grade GenAI system goes far beyond just "LLM + vector database". The components necessary for a serious deployment:
# Stack minimo per RAG in produzione (Docker Compose)
services:
api:
image: your-genai-api:latest
environment:
OPENAI_API_KEY: ${OPENAI_API_KEY}
QDRANT_URL: http://qdrant:6333
REDIS_URL: redis://redis:6379
depends_on:
- qdrant
- redis
qdrant:
image: qdrant/qdrant:v1.9.0
volumes:
- qdrant_storage:/qdrant/storage
ports:
- "6333:6333"
redis:
image: redis:7-alpine
# Semantic cache: evita LLM calls per query simili
volumes:
- redis_data:/data
prometheus:
image: prom/prometheus:latest
# Monitora: latency p50/p95/p99, costo per query, quality score
grafana:
image: grafana/grafana:latest
# Dashboard: LLM performance, retrieval quality, cost tracking
The critical components that separate a prototype from a production system:
- Semantic caching (Redis + library like GPTCache): reduces costs by 30-60% for applications with similar recurring queries
- Observability: track each LLM call with latency, tokens used, cost and quality score — without this data you can't optimize
- Fallback strategy: what happens when OpenAI is down? You have a local model how to backup?
- Rate limiting and quota management: Protect your budget from queries anomalous
- PII detection: before sending data to the LLM, detect and obscure personal data sensitive
The Technology Stack of 2026
The GenAI ecosystem has stabilized around a few dominant players:
Recommended Stack 2026
- Orchestration: LangChain v0.3+ or LlamaIndex v0.10+ for complex RAG pipelines; LangGraph for agent workflows
- Vector Database: Qdrant (self-hosted, excellent performance), pgvector (already in PostgreSQL, under 1M vectors), Pinecone (managed, guaranteed latency)
- Inference: vLLM or TensorRT-LLM for self-hosted open source models; OpenAI/Anthropic for cloud APIs
- Embeddings: text-embedding-3-small by OpenAI (1536 dim, $0.02/1M token) or all-MiniLM-L6-v2 for free self-hosting
- Observability: LangSmith, Weights & Biases Weave, or Phoenix for tracing of chains
- Evaluation: RAGAS for automated RAG metrics (faithfulness, answer relevancy, context recall)
Quality Metrics for RAG Systems
How to know if your RAG system is working well? The RAGAS framework defines measurable metrics:
# Valutazione automatica con RAGAS
from ragas import evaluate
from ragas.metrics import (
faithfulness, # la risposta e supportata dai documenti recuperati?
answer_relevancy, # la risposta risponde alla domanda?
context_recall, # i documenti recuperati contengono le info necessarie?
context_precision # i documenti recuperati sono tutti rilevanti?
)
from datasets import Dataset
# Dataset di test (ground truth necessario)
test_data = {
"question": ["Come configuro l'autenticazione 2FA?"],
"answer": ["Per configurare 2FA, vai in Impostazioni > Sicurezza..."],
"contexts": [["Documentazione 2FA: ...", "Guida sicurezza: ..."]],
"ground_truth": ["La 2FA si configura tramite l'app mobile nelle impostazioni sicurezza"]
}
dataset = Dataset.from_dict(test_data)
result = evaluate(dataset, metrics=[
faithfulness,
answer_relevancy,
context_recall,
context_precision
])
print(result)
# faithfulness: 0.95 (la risposta non inventa)
# answer_relevancy: 0.88 (la risposta e pertinente)
# context_recall: 0.82 (i doc recuperati coprono la risposta)
# context_precision: 0.91 (i doc recuperati sono rilevanti)
Realistic targets for a production system: faithfulness > 0.85 (critical: below this threshold hallucinations are frequent), answer_relevancy > 0.80, context_recall > 0.75.
Anti-patterns to Avoid
- Fixed chunk size for all documents: structured documents (FAQ, API docs) require chunking other than narrative text
- Semantic retrieval only: fails on exact technical terms; use hybrid search (BM25 + semantic)
- No reranking: the top-k vectors are not necessarily the most useful; a cross-encoder improves precision by 15-20%
- RAG without continuous evaluation: The quality degrades when documents they change; monitors faithfulness in production
- Fine-tuning as first choice: and expensive and slow; RAG is almost always there right move to start
Conclusions and Next Steps
The system design for GenAI applications requires architectural choices that go far beyond model selection. The 2026 rule of thumb: always starts with RAG + prompt engineering, measure quality with RAGAS, and add fine-tuning only if the gap is quality persists after retrieval optimization.
In the next articles in this series we will explore each component in detail: the selection of right vector database, chunking strategies, hybrid search and agent architecture with LangGraph.







