Ciao! Sono

Federico Calò

Sviluppatore Software | Divulgatore Tecnico

Creo applicazioni web moderne e strumenti digitali personalizzati per aiutare le attività a crescere attraverso l'innovazione tecnologica. La mia passione è unire informatica ed economia per generare valore reale.

Contattami

Chi Sono

La mia passione per l'informatica è nata tra i banchi dell'Istituto Tecnico Commerciale di Maglie, dove ho scoperto il potere della programmazione e il fascino di creare soluzioni digitali. Fin da subito, ho capito che l'informatica non era solo codice, ma uno strumento straordinario per trasformare idee in realtà.

Durante gli studi superiori in Sistemi Informativi Aziendali, ho iniziato a intrecciare informatica ed economia, comprendendo come la tecnologia possa essere il motore della crescita per qualsiasi attività. Questa visione mi ha accompagnato all'Università degli Studi di Bari, dove ho conseguito la Laurea in Informatica, approfondendo le mie competenze tecniche e la mia passione per lo sviluppo software.

Oggi metto questa esperienza al servizio di imprese, professionisti e startup, creando soluzioni digitali su misura che automatizzano processi, ottimizzano risorse e aprono nuove opportunità di business. Perché la vera innovazione inizia quando la tecnologia incontra le esigenze reali delle persone.

Le Mie Competenze

Analisi Dati & Modelli Previsionali

Trasformo i dati in insights strategici con analisi approfondite e modelli predittivi per decisioni informate

Automazione Processi

Creo strumenti personalizzati che automatizzano operazioni ripetitive e liberano tempo per attività a valore aggiunto

Sistemi Custom

Sviluppo sistemi software su misura, dalle integrazioni tra piattaforme alle dashboard personalizzate

const federico = {
  nome: "Federico Calò",
  ruolo: "Sviluppatore Software",
  città: "Bari, Italia",
  missione: "Aiutare attraverso l'informatica",
  passioni: [
    "Codice Pulito",
    "Innovazione",
    "Crescita Continua"
  ]
};

La Mia Missione

Credo fermamente che l'informatica sia lo strumento più potente per trasformare le idee in realtà e migliorare la vita delle persone.

🚀

Democratizzare la Tecnologia

La mia missione è rendere l'informatica accessibile a tutti: dalle piccole imprese locali alle startup innovative, fino ai professionisti che vogliono digitalizzare la propria attività. Ogni realtà merita di sfruttare le potenzialità del digitale.

💡

Unire Informatica ed Economia

Non è solo questione di scrivere codice: è capire come la tecnologia possa generare valore reale. Intrecciando competenze informatiche e visione economica, aiuto le attività a crescere, ottimizzare processi e raggiungere nuovi traguardi di efficienza e redditività.

🎯

Creare Soluzioni su Misura

Ogni attività è unica, e così devono esserlo le soluzioni. Sviluppo strumenti personalizzati che rispondono alle esigenze specifiche di ciascun cliente, automatizzando processi ripetitivi e liberando tempo per ciò che conta davvero: far crescere il business.

Trasforma la Tua Attività con la Tecnologia

Dicembre 2024

Visualizza

Master SQL

RoadMap.sh

Novembre 2024

Visualizza

Oracle Certified Foundations Associate

Oracle

Ottobre 2024

Visualizza

People Leadership Credential

Connect

Settembre 2024

💻 Linguaggi & Tecnologie

☕Java

🐍Python

📜JavaScript

🅰️Angular

⚛️React

🔷TypeScript

🗄️SQL

🐘PHP

🎨CSS/SCSS

🔧Node.js

🐳Docker

🌿Git

💼

12/2024 - Presente

Custom Software Engineering Analyst

Accenture

Bari, Puglia, Italia · Ibrida Analisi e sviluppo di sistemi informatici attraverso l'utilizzo di Java e Quarkus in Health and Public Sector. Formazione continua su tecnologie moderne per la creazione di soluzioni software personalizzate ed efficienti e sugli agenti.

💼

06/2022 - 12/2024

Analista software e Back End Developer Associate Consultant

Links Management and Technology SpA

Esperienza nell'analisi di sistemi software as-is e flussi ETL utilizzando PowerCenter. Formazione completata su Spring Boot per lo sviluppo di applicazioni backend moderne e scalabili. Sviluppatore Backend specializzato in Spring Boot, con esperienza in progettazione di database, analisi, sviluppo e testing dei task assegnati.

💼

02/2021 - 10/2021

Programmatore software

Adesso.it (prima era WebScience srl)

Esperienza nell'analisi AS-IS e TO-BE, evoluzioni SEO ed evoluzioni website per migliorare le performance e l'engagement degli utenti.

🎓

2018 - 2025

Laurea in Informatica

Università degli Studi di Bari Aldo Moro

Bachelor's degree in Computer Science, focusing on software engineering, algorithms, and modern development practices.

📚

2013 - 2018

Diploma - Sistemi Informativi Aziendali

Istituto Tecnico Commerciale di Maglie

Technical diploma specializing in Business Information Systems, combining IT knowledge with business management.

Contattami

Hai un progetto in mente? Parliamone! Compila il form qui sotto e ti risponderò al più presto.

* Campi obbligatori. I tuoi dati saranno utilizzati solo per rispondere alla tua richiesta.

Prompt Engineering in Production: Templates, Versioning and Testing

Prompt engineering is often treated as an experimental activity: try something, it works, move on. In production, this approach fails systematically. A prompt change can degrade response quality without anyone noticing. A prompt optimized for GPT-4 can give poor results on GPT-4o-mini. A template that works in English may fail in Italian.

In this article we treat prompt engineering as an engineering discipline: advanced techniques (Chain-of-Thought, Few-Shot, Constitutional AI), template systems with variables and composition, prompt versioning with A/B evaluation, automated testing, and quality monitoring in production. Every section includes executable Python code and patterns tested on real systems.

What You Will Learn

Advanced techniques: Chain-of-Thought, Few-Shot Learning, Tree-of-Thought
Template system with variables, composition and inheritance
Prompt versioning with performance tracking
A/B testing prompts with statistical significance
Constitutional AI and guardrails for safe outputs
Prompts for structured outputs (JSON, XML, Markdown)
Automated prompt testing with LLM-as-judge
Monitoring prompt quality in production

1. Advanced Prompting Techniques

1.1 Chain-of-Thought (CoT)

Chain-of-Thought (Wei et al., 2022) is the most impactful modern prompting technique: asking the model to "show its reasoning" before giving the answer significantly increases accuracy on complex problems.

Chain-of-Thought vs Standard Prompting


# STANDARD PROMPTING - poor accuracy on complex reasoning
standard_prompt = """
A customer owes $1,200. They have already paid 30%. How much do they still owe?
"""
# Typical answer: "$840" (often correct but no guarantees)

# CHAIN-OF-THOUGHT - high accuracy
cot_prompt = """
A customer owes $1,200. They have already paid 30%. How much do they still owe?

Think step by step:
1. First calculate how much they have already paid
2. Then calculate the remaining amount
3. Provide the final answer
"""
# Answer:
# "1. Already paid: $1,200 * 30/100 = $360
#  2. Remaining: $1,200 - $360 = $840
#  3. The customer still owes $840."

# ZERO-SHOT CoT - just add "Let's think step by step"
zero_shot_cot = """
A customer owes $1,200. They have already paid 30%. How much do they still owe?

Let's think step by step:
"""
# The model autonomously generates step-by-step reasoning

# Python implementation
from langchain_openai import ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate

llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)

cot_template = ChatPromptTemplate.from_template("""
{context}

Question: {question}

Reason step by step before answering. Structure your response as:
REASONING:
[your detailed reasoning]

FINAL ANSWER:
[concise answer]
""")

chain = cot_template | llm
response = chain.invoke({
    "context": "Base price is $1000, with 22% VAT and 10% loyalty discount.",
    "question": "What is the final price to pay?"
})

1.2 Few-Shot Learning

Few-Shot prompting includes input-output examples in the prompt to guide model behavior. It is particularly effective for tasks with specific output formats or specialized domains where the model has limited knowledge.

Few-Shot Prompting with Dynamic Example Selection


from langchain_core.prompts import FewShotChatMessagePromptTemplate
from langchain_core.example_selectors import SemanticSimilarityExampleSelector
from langchain_openai import OpenAIEmbeddings
from langchain_community.vectorstores import FAISS

# Example library (sentiment classification task)
examples = [
    {
        "input": "Product arrived broken and support doesn't respond.",
        "output": "NEGATIVE - Product and customer support issue"
    },
    {
        "input": "Super fast delivery and product exactly as described!",
        "output": "POSITIVE - Delivery and product satisfaction"
    },
    {
        "input": "Price is average, nothing special.",
        "output": "NEUTRAL - Price assessment"
    },
    {
        "input": "Excellent quality, will definitely buy again.",
        "output": "POSITIVE - High quality, customer loyalty"
    },
    {
        "input": "Slow shipping, product ok but could have been better.",
        "output": "MIXED - Shipping issue, acceptable product"
    },
]

# Semantic selector: choose 3 most similar examples to the query
example_selector = SemanticSimilarityExampleSelector.from_examples(
    examples, OpenAIEmbeddings(), FAISS, k=3
)

example_prompt = ChatPromptTemplate.from_messages([
    ("human", "{input}"),
    ("ai", "{output}")
])

few_shot_prompt = FewShotChatMessagePromptTemplate(
    example_selector=example_selector,
    example_prompt=example_prompt,
)

final_prompt = ChatPromptTemplate.from_messages([
    ("system", """You are an e-commerce review sentiment analyst.
Classify sentiment as: POSITIVE, NEGATIVE, NEUTRAL, or MIXED.
Always include the main category of the issue/strength."""),
    few_shot_prompt,
    ("human", "{input}")
])

chain = final_prompt | llm
result = chain.invoke({"input": "Great product but terrible packaging."})
# Output: "MIXED - Product quality vs packaging issue"

1.3 Structured Output and Function Calling

One of the most important patterns in production: forcing the LLM to produce structured outputs (JSON, Pydantic models) instead of free text. Eliminates manual parsing and dramatically reduces format errors.

Structured Output with Pydantic and LangChain


from pydantic import BaseModel, Field
from typing import List, Optional, Literal
from langchain_openai import ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate


class ProductAnalysis(BaseModel):
    """Structured product review analysis"""
    sentiment: Literal["positive", "negative", "neutral", "mixed"] = Field(
        description="Overall review sentiment"
    )
    score: int = Field(description="Score from 1 to 10", ge=1, le=10)
    strengths: List[str] = Field(
        description="List of mentioned strengths", default_factory=list
    )
    weaknesses: List[str] = Field(
        description="List of mentioned weaknesses", default_factory=list
    )
    category: str = Field(
        description="Main category (e.g. 'quality', 'shipping', 'support')"
    )
    suggested_response: Optional[str] = Field(
        description="Suggested response for support team (if negative sentiment)",
        default=None
    )


# LLM with structured output
llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)
structured_llm = llm.with_structured_output(ProductAnalysis)

prompt = ChatPromptTemplate.from_template("""
Analyze this product review and classify the sentiment.

Review: {review}

Extract all required information precisely.""")

chain = prompt | structured_llm

# Output is already a validated Pydantic object
result: ProductAnalysis = chain.invoke({
    "review": "Excellent quality product, slow shipping. Support responded quickly."
})

print(f"Sentiment: {result.sentiment}")
print(f"Score: {result.score}/10")
print(f"Strengths: {result.strengths}")
print(f"Weaknesses: {result.weaknesses}")
# Output guaranteed:
# Sentiment: mixed
# Score: 6/10
# Strengths: ['product quality', 'responsive support']
# Weaknesses: ['slow shipping']

2. Production Template System

In production, prompts must be managed as first-class resources: versioned, testable, composable. A robust template system allows updating prompts without modifying application code.

Prompt Template Registry with Versioning


from dataclasses import dataclass, field
from typing import Dict, List, Optional
from datetime import datetime
import hashlib
import yaml
from pathlib import Path


@dataclass
class PromptVersion:
    """A specific version of a prompt"""
    version: str
    template: str
    variables: List[str]
    description: str
    created_at: datetime = field(default_factory=datetime.now)
    tags: List[str] = field(default_factory=list)
    performance_metrics: Dict[str, float] = field(default_factory=dict)
    is_active: bool = True

    @property
    def template_hash(self) -> str:
        return hashlib.md5(self.template.encode()).hexdigest()[:8]

    def render(self, **kwargs) -> str:
        """Render the template with provided variables"""
        try:
            return self.template.format(**kwargs)
        except KeyError as e:
            raise ValueError(f"Missing variable in template: {e}")


class PromptRegistry:
    """Centralized registry for production prompt management"""

    def __init__(self, storage_path: str = "./prompts"):
        self.storage_path = Path(storage_path)
        self.storage_path.mkdir(exist_ok=True)
        self.prompts: Dict[str, List[PromptVersion]] = {}

    def register(self, name: str, template: str, variables: List[str],
                 description: str = "", version: str = None,
                 tags: List[str] = None) -> PromptVersion:
        """Register a new prompt version"""
        if name not in self.prompts:
            self.prompts[name] = []

        if version is None:
            version = f"v{len(self.prompts[name]) + 1:03d}"

        # Deactivate previous active version
        for p in self.prompts[name]:
            p.is_active = False

        pv = PromptVersion(
            version=version, template=template, variables=variables,
            description=description, tags=tags or []
        )
        self.prompts[name].append(pv)
        return pv

    def get(self, name: str, version: str = None) -> PromptVersion:
        """Get a prompt version (default: latest active)"""
        if name not in self.prompts:
            raise KeyError(f"Prompt '{name}' not found")

        if version is None:
            active = [p for p in self.prompts[name] if p.is_active]
            if not active:
                raise ValueError(f"No active version for '{name}'")
            return active[-1]

        for p in self.prompts[name]:
            if p.version == version:
                return p
        raise KeyError(f"Version '{version}' not found for '{name}'")

    def rollback(self, name: str, version: str) -> PromptVersion:
        """Rollback to a previous version"""
        target = self.get(name, version)
        for p in self.prompts[name]:
            p.is_active = False
        target.is_active = True
        return target


# Usage
registry = PromptRegistry()

registry.register(
    name="rag_answer",
    template="""You are a technical assistant. Answer based ONLY on the context.

Context: {context}
Question: {question}
Answer:""",
    variables=["context", "question"],
    description="RAG base prompt v1"
)

registry.register(
    name="rag_answer",
    template="""You are a precise technical assistant. Answer based EXCLUSIVELY
on the provided context. If the context does not contain the answer, say so explicitly.

Context:
{context}

Question: {question}

Provide a concise and accurate answer:""",
    variables=["context", "question"],
    description="RAG prompt v2 - clearer fallback behavior"
)

prompt = registry.get("rag_answer")
rendered = prompt.render(context="...", question="...")

3. A/B Testing Prompts

A/B testing allows comparing two prompt versions on real traffic with statistical significance. It is essential for validating that prompt changes actually improve quality and do not degrade it.

A/B Testing Framework for Prompts


from scipy import stats
import numpy as np
import hashlib
import random
from typing import Optional


class PromptABTest:
    """A/B testing framework for prompts with statistical significance"""

    def __init__(self, prompt_a, prompt_b,
                 traffic_split=0.5, min_samples=100):
        self.prompt_a = prompt_a
        self.prompt_b = prompt_b
        self.traffic_split = traffic_split
        self.min_samples = min_samples
        self.results_a = []
        self.results_b = []

    def assign_variant(self, user_id: str = None):
        """Assign variant: deterministic if user_id provided"""
        if user_id:
            h = int(hashlib.md5(user_id.encode()).hexdigest(), 16)
            use_b = (h % 100) < (self.traffic_split * 100)
        else:
            use_b = random.random() < self.traffic_split

        return ("B", self.prompt_b) if use_b else ("A", self.prompt_a)

    def record_result(self, variant: str, score: float):
        (self.results_b if variant == "B" else self.results_a).append(score)

    def is_statistically_significant(self, alpha=0.05) -> bool:
        if (len(self.results_a) < self.min_samples or
                len(self.results_b) < self.min_samples):
            return False
        _, p_value = stats.ttest_ind(self.results_a, self.results_b)
        return p_value < alpha

    def get_winner(self) -> Optional[str]:
        if not self.is_statistically_significant():
            return None
        return "B" if np.mean(self.results_b) > np.mean(self.results_a) else "A"

    def report(self) -> dict:
        mean_a = np.mean(self.results_a) if self.results_a else 0
        mean_b = np.mean(self.results_b) if self.results_b else 0
        p_value = None
        if len(self.results_a) > 1 and len(self.results_b) > 1:
            _, p_value = stats.ttest_ind(self.results_a, self.results_b)

        winner = self.get_winner()
        return {
            "samples_a": len(self.results_a),
            "samples_b": len(self.results_b),
            "mean_score_a": mean_a,
            "mean_score_b": mean_b,
            "improvement_pct": ((mean_b - mean_a) / mean_a * 100) if mean_a > 0 else 0,
            "p_value": p_value,
            "is_significant": self.is_statistically_significant(),
            "winner": winner,
            "recommendation": "Deploy B" if winner == "B" else
                              "Keep A" if winner == "A" else "Collect more data"
        }

4. Automated Testing with LLM-as-Judge

LLM-as-judge is the most scalable pattern for evaluating prompt quality: use an LLM (often more powerful than the one being tested) to automatically evaluate outputs, simulating human evaluation at reduced cost.

LLM-as-Judge for Prompt Testing


from pydantic import BaseModel
from langchain_openai import ChatOpenAI
import numpy as np


class JudgeScore(BaseModel):
    """Structured score from the LLM judge"""
    accuracy: int        # 1-5: factual accuracy
    relevance: int       # 1-5: relevance to question
    clarity: int         # 1-5: response clarity
    completeness: int    # 1-5: response completeness
    overall: int         # 1-5: overall assessment
    reasoning: str       # Score explanation
    issues: list         # List of issues found


class PromptTester:
    """Automated testing framework for LLM prompts"""

    def __init__(self, model_under_test, judge_model=None):
        self.model = model_under_test
        self.judge = judge_model or ChatOpenAI(model="gpt-4o-mini", temperature=0)
        self.structured_judge = self.judge.with_structured_output(JudgeScore)

    def evaluate_single(self, prompt: str, input_vars: dict,
                        expected_output: str = None) -> JudgeScore:
        """Evaluate a single prompt output"""
        rendered = prompt.format(**input_vars)
        actual_output = self.model.invoke(rendered).content

        judge_prompt = f"""Evaluate this AI response on a 1-5 scale for each dimension.

Input: {input_vars.get('question', rendered[:200])}

Generated response: {actual_output}

{f"Expected response: {expected_output}" if expected_output else ""}

Evaluate objectively with specific examples for each point."""

        return self.structured_judge.invoke(judge_prompt)

    def run_test_suite(self, prompt: str, test_cases: list) -> dict:
        """Run a test suite on a prompt"""
        scores = []
        failed_cases = []

        for i, tc in enumerate(test_cases):
            try:
                score = self.evaluate_single(
                    prompt=prompt,
                    input_vars=tc["input"],
                    expected_output=tc.get("expected")
                )
                scores.append(score)
                if score.overall < 3:
                    failed_cases.append({"index": i, "score": score.overall,
                                        "issues": score.issues})
            except Exception as e:
                failed_cases.append({"index": i, "error": str(e)})

        if not scores:
            return {"error": "No tests completed"}

        avg_overall = np.mean([s.overall for s in scores])
        return {
            "total_tests": len(test_cases),
            "completed": len(scores),
            "avg_overall": avg_overall,
            "avg_accuracy": np.mean([s.accuracy for s in scores]),
            "avg_relevance": np.mean([s.relevance for s in scores]),
            "pass_rate": len([s for s in scores if s.overall >= 3]) / len(scores),
            "failed_cases": failed_cases,
            "recommendation": "DEPLOY" if avg_overall >= 4.0
                              else "REVIEW" if avg_overall >= 3.0
                              else "REJECT"
        }


# Example test suite for a RAG prompt
test_suite = [
    {
        "input": {
            "context": "LangChain is a Python framework for building LLM applications.",
            "question": "What is LangChain?"
        },
        "expected": "LangChain is a framework for building LLM-based applications"
    },
    {
        "input": {
            "context": "The base subscription price is $29 per month.",
            "question": "What is the price of the premium subscription?"
        },
        "expected": "The context does not specify the premium subscription price"
    },
]

5. Best Practices and Anti-Patterns

Prompt Engineering in Production Best Practices

Version prompts like code: every prompt change is a potential breaking change. Use semantic versioning, maintain a changelog, and test before deploying.
Use structured output whenever possible: JSON/Pydantic eliminate manual parsing and format surprises. More reliable than regex parsing of free text.
Test with edge cases: the most critical questions are out-of-distribution ones. Your test set must include edge cases, ambiguous questions, and malformed inputs.
Chain-of-Thought for complex tasks: for any task requiring more than one reasoning step, CoT significantly improves quality (often +20-40%).
Diverse few-shot examples: include examples covering all main cases, not just the most common. Diversity prevents systematic bias.

Anti-Patterns to Avoid

Hardcoded prompts in code: prompts in source code are impossible to update in production without a deploy. Use a registry or configuration file.
No testing before deploy: a prompt that works on 10 manual examples can fail on real edge cases. Build test suites with at least 50-100 diverse cases.
High temperature for deterministic tasks: for classification, data extraction and analysis always use temperature=0. Variability is the enemy of consistency.
Overly long unstructured prompts: beyond 500 tokens, models tend to lose information in the middle. Structure prompts with clear sections and bullet points.

Conclusions

Prompt engineering in production requires the same rigor as traditional software engineering: versioning, testing, monitoring and controlled deployment. We covered advanced techniques (CoT, Few-Shot, Structured Output), prompt management with a versioned registry, A/B testing with statistical significance, and automated evaluation with LLM-as-judge.

Key takeaways:

Chain-of-Thought significantly improves accuracy on complex tasks
Structured output (Pydantic) eliminates parsing errors and guarantees format
Prompts must be versioned, tested and deployed like any code
A/B testing with statistical significance before promoting new versions
LLM-as-judge scales quality evaluation at reduced cost

Continue the Series

Article 8: Multi-Agent Systems
Article 9: Prompt Engineering in Production (current)
Article 10: Knowledge Graphs for AI