Ciao! Sono

Federico Calò

Sviluppatore Software | Divulgatore Tecnico

Creo applicazioni web moderne e strumenti digitali personalizzati per aiutare le attività a crescere attraverso l'innovazione tecnologica. La mia passione è unire informatica ed economia per generare valore reale.

Contattami

Chi Sono

La mia passione per l'informatica è nata tra i banchi dell'Istituto Tecnico Commerciale di Maglie, dove ho scoperto il potere della programmazione e il fascino di creare soluzioni digitali. Fin da subito, ho capito che l'informatica non era solo codice, ma uno strumento straordinario per trasformare idee in realtà.

Durante gli studi superiori in Sistemi Informativi Aziendali, ho iniziato a intrecciare informatica ed economia, comprendendo come la tecnologia possa essere il motore della crescita per qualsiasi attività. Questa visione mi ha accompagnato all'Università degli Studi di Bari, dove ho conseguito la Laurea in Informatica, approfondendo le mie competenze tecniche e la mia passione per lo sviluppo software.

Oggi metto questa esperienza al servizio di imprese, professionisti e startup, creando soluzioni digitali su misura che automatizzano processi, ottimizzano risorse e aprono nuove opportunità di business. Perché la vera innovazione inizia quando la tecnologia incontra le esigenze reali delle persone.

Le Mie Competenze

Analisi Dati & Modelli Previsionali

Trasformo i dati in insights strategici con analisi approfondite e modelli predittivi per decisioni informate

Automazione Processi

Creo strumenti personalizzati che automatizzano operazioni ripetitive e liberano tempo per attività a valore aggiunto

Sistemi Custom

Sviluppo sistemi software su misura, dalle integrazioni tra piattaforme alle dashboard personalizzate

const federico = {
  nome: "Federico Calò",
  ruolo: "Sviluppatore Software",
  città: "Bari, Italia",
  missione: "Aiutare attraverso l'informatica",
  passioni: [
    "Codice Pulito",
    "Innovazione",
    "Crescita Continua"
  ]
};

La Mia Missione

Credo fermamente che l'informatica sia lo strumento più potente per trasformare le idee in realtà e migliorare la vita delle persone.

🚀

Democratizzare la Tecnologia

La mia missione è rendere l'informatica accessibile a tutti: dalle piccole imprese locali alle startup innovative, fino ai professionisti che vogliono digitalizzare la propria attività. Ogni realtà merita di sfruttare le potenzialità del digitale.

💡

Unire Informatica ed Economia

Non è solo questione di scrivere codice: è capire come la tecnologia possa generare valore reale. Intrecciando competenze informatiche e visione economica, aiuto le attività a crescere, ottimizzare processi e raggiungere nuovi traguardi di efficienza e redditività.

🎯

Creare Soluzioni su Misura

Ogni attività è unica, e così devono esserlo le soluzioni. Sviluppo strumenti personalizzati che rispondono alle esigenze specifiche di ciascun cliente, automatizzando processi ripetitivi e liberando tempo per ciò che conta davvero: far crescere il business.

Trasforma la Tua Attività con la Tecnologia

Dicembre 2024

Visualizza

Master SQL

RoadMap.sh

Novembre 2024

Visualizza

Oracle Certified Foundations Associate

Oracle

Ottobre 2024

Visualizza

People Leadership Credential

Connect

Settembre 2024

💻 Linguaggi & Tecnologie

☕Java

🐍Python

📜JavaScript

🅰️Angular

⚛️React

🔷TypeScript

🗄️SQL

🐘PHP

🎨CSS/SCSS

🔧Node.js

🐳Docker

🌿Git

💼

12/2024 - Presente

Custom Software Engineering Analyst

Accenture

Bari, Puglia, Italia · Ibrida Analisi e sviluppo di sistemi informatici attraverso l'utilizzo di Java e Quarkus in Health and Public Sector. Formazione continua su tecnologie moderne per la creazione di soluzioni software personalizzate ed efficienti e sugli agenti.

💼

06/2022 - 12/2024

Analista software e Back End Developer Associate Consultant

Links Management and Technology SpA

Esperienza nell'analisi di sistemi software as-is e flussi ETL utilizzando PowerCenter. Formazione completata su Spring Boot per lo sviluppo di applicazioni backend moderne e scalabili. Sviluppatore Backend specializzato in Spring Boot, con esperienza in progettazione di database, analisi, sviluppo e testing dei task assegnati.

💼

02/2021 - 10/2021

Programmatore software

Adesso.it (prima era WebScience srl)

Esperienza nell'analisi AS-IS e TO-BE, evoluzioni SEO ed evoluzioni website per migliorare le performance e l'engagement degli utenti.

🎓

2018 - 2025

Laurea in Informatica

Università degli Studi di Bari Aldo Moro

Bachelor's degree in Computer Science, focusing on software engineering, algorithms, and modern development practices.

📚

2013 - 2018

Diploma - Sistemi Informativi Aziendali

Istituto Tecnico Commerciale di Maglie

Technical diploma specializing in Business Information Systems, combining IT knowledge with business management.

Contattami

Hai un progetto in mente? Parliamone! Compila il form qui sotto e ti risponderò al più presto.

* Campi obbligatori. I tuoi dati saranno utilizzati solo per rispondere alla tua richiesta.

Context Window Management: Optimizing LLM Input

The context window is the token limit an LLM can process in a single call. GPT-4 has 128K tokens, Claude 3 200K, Gemini 1.5 1 million. Enormous numbers, yet in complex RAG systems and long conversations these limits are regularly hit. When that happens, the model truncates older context, losing critical information. And costs? A 100K token prompt on GPT-4 costs around $3 per single call. In production, with thousands of queries per day, this quickly becomes unsustainable.

Context Window Management is the art of maximizing LLM response quality while optimizing the use of available context. It is not just about fitting everything in the window: it is about deciding what to include, how to structure it, and how much space to allocate to each component. In this article we explore all the techniques: from token counting and budgeting, to context compression, to memory management for long conversations.

What You Will Learn

How the context window works and why it is critical for RAG
Precise token counting with tiktoken for OpenAI and open-source models
Context budgeting: allocating the token budget across system, history, context and query
Context compression with LLMLingua and summarization techniques
Memory management for long conversations (sliding window, summary memory)
Lost in the Middle: why position in context matters
Intelligent truncation strategies for RAG
Monitoring token usage and cost optimization

1. How the Context Window Works

A Transformer-based LLM processes input as a sequence of tokens: text units that correspond approximately to 3/4 of an English word. The maximum number of tokens the model can process in the entire call (prompt + response) is defined by the context window.

Tokenization and Context Window


# Models and their context windows (2025)
CONTEXT_WINDOWS = {
    # OpenAI
    "gpt-4o": 128_000,
    "gpt-4o-mini": 128_000,
    "gpt-3.5-turbo": 16_385,
    # Anthropic
    "claude-3-opus": 200_000,
    "claude-3-sonnet": 200_000,
    "claude-3-haiku": 200_000,
    # Google
    "gemini-1.5-pro": 1_000_000,
    "gemini-1.5-flash": 1_000_000,
    # Open Source
    "llama-3.1-8b": 128_000,
    "mistral-7b-v0.3": 32_768,
}

# Tokenization rules of thumb:
# - English: ~1 token per 4 chars (750 words ~ 1000 tokens)
# - Code: ~1 token per 3.5 chars
# - Unicode / special chars: more tokens per char

# Typical context distribution in RAG:
CONTEXT_BUDGET_EXAMPLE = {
    "total_tokens": 128_000,
    "system_prompt": 500,       # ~0.4%
    "chat_history": 10_000,     # ~8%
    "retrieved_context": 8_000, # ~6%
    "user_query": 200,          # ~0.2%
    "safety_margin": 2_000,     # ~1.6%
    "response_space": 107_300   # ~84% available for response
}

1.1 The "Lost in the Middle" Problem

A surprising research finding (Liu et al., 2023, "Lost in the Middle") shows that LLMs are very good at remembering information at the beginning and end of the context, but tend to "lose" information positioned in the middle. This has direct implications for how RAG context is structured.

Lost in the Middle: Practical Implications


# Average effectiveness by position in context (Liu et al. 2023 study)
# On multi-document QA tasks with 10-20 documents:

POSITION_PERFORMANCE = {
    "first_document":   85,  # % accuracy
    "second":           82,
    "third":            78,
    # ... degradation in the middle
    "middle_of_context": 55,  # minimum!
    # ... recovery at the end
    "penultimate":      79,
    "last_document":    84,
}

# STRATEGIES to mitigate "Lost in the Middle":
# 1. Place the MOST CRITICAL information at the beginning or end
# 2. Limit documents in context (5-10 max)
# 3. Repeat crucial info at both beginning and end
# 4. Sort by decreasing relevance (most relevant first)

def sort_chunks_for_context(chunks_with_scores):
    """
    Sort chunks to maximize LLM attention.
    Strategy: most relevant first, second most relevant last.
    """
    sorted_chunks = sorted(chunks_with_scores, key=lambda x: x[1], reverse=True)

    if len(sorted_chunks) <= 2:
        return sorted_chunks

    # "Sandwich" pattern: most relevant first, second most relevant last
    reordered = [sorted_chunks[0]]  # Most relevant: first
    middle = sorted_chunks[2:]      # Less critical: middle
    reordered.extend(middle)
    reordered.append(sorted_chunks[1])  # Second most relevant: last

    return reordered

2. Precise Token Counting with Tiktoken

Before managing the token budget, you need to count tokens precisely. OpenAI's tiktoken library implements the exact tokenizer used by GPT models. For open-source models, each model has its own tokenizer.

Token Counting for OpenAI and Open Source Models


import tiktoken
from typing import List, Dict


class TokenCounter:
    """Precise token counter for different LLM models"""

    ENCODING_MAP = {
        "gpt-4o": "o200k_base",
        "gpt-4o-mini": "o200k_base",
        "gpt-4": "cl100k_base",
        "gpt-3.5-turbo": "cl100k_base",
        "text-embedding-3-small": "cl100k_base",
    }

    def __init__(self, model: str = "gpt-4o-mini"):
        self.model = model
        encoding_name = self.ENCODING_MAP.get(model, "cl100k_base")
        self.encoding = tiktoken.get_encoding(encoding_name)

    def count_tokens(self, text: str) -> int:
        """Count tokens in a text string"""
        return len(self.encoding.encode(text))

    def count_message_tokens(self, messages: List[Dict]) -> int:
        """
        Count tokens in an OpenAI messages list,
        including per-message overhead tokens.
        """
        tokens_per_message = 3   # <|start|>role<|sep|>
        tokens_per_name = 1
        tokens_reply = 3         # response starts with <|start|>assistant

        num_tokens = tokens_reply
        for message in messages:
            num_tokens += tokens_per_message
            for key, value in message.items():
                num_tokens += self.count_tokens(str(value))
                if key == "name":
                    num_tokens += tokens_per_name
        return num_tokens

    def truncate_to_limit(self, text: str, max_tokens: int) -> str:
        """Truncate text to maximum token count"""
        tokens = self.encoding.encode(text)
        if len(tokens) <= max_tokens:
            return text
        return self.encoding.decode(tokens[:max_tokens]) + "... [truncated]"

    def estimate_cost(self, prompt_tokens: int, completion_tokens: int) -> dict:
        """Estimate cost for OpenAI models (2025 pricing)"""
        PRICES_PER_1M = {
            "gpt-4o": {"prompt": 5.0, "completion": 15.0},
            "gpt-4o-mini": {"prompt": 0.15, "completion": 0.60},
            "gpt-4-turbo": {"prompt": 10.0, "completion": 30.0},
        }
        prices = PRICES_PER_1M.get(self.model, {"prompt": 1.0, "completion": 3.0})
        prompt_cost = (prompt_tokens / 1_000_000) * prices["prompt"]
        completion_cost = (completion_tokens / 1_000_000) * prices["completion"]
        return {
            "prompt_tokens": prompt_tokens,
            "completion_tokens": completion_tokens,
            "total_cost_usd": prompt_cost + completion_cost
        }


# Usage
counter = TokenCounter("gpt-4o-mini")
text = "This is an example text for RAG systems."
print(f"Tokens: {counter.count_tokens(text)}")

cost = counter.estimate_cost(prompt_tokens=5000, completion_tokens=500)
print(f"Estimated cost: #123;cost['total_cost_usd']:.4f}")

3. Context Budgeting: Allocating the Token Budget

Context budgeting is the process of deciding how many tokens to allocate to each part of the prompt. It is a tradeoff: more tokens for RAG context improves quality but increases costs and latency; fewer tokens save resources but risk losing critical information.

Context Budget Manager


from dataclasses import dataclass
from typing import List, Optional, Tuple
import tiktoken


@dataclass
class ContextBudget:
    """Defines the token budget for each component"""
    total_context: int
    max_response: int
    system_prompt: int
    chat_history: int
    retrieved_docs: int
    query: int
    safety_margin: int = 200

    @property
    def available_for_docs(self) -> int:
        used = (self.system_prompt + self.chat_history +
                self.query + self.safety_margin + self.max_response)
        return min(self.retrieved_docs, self.total_context - used)


class ContextWindowManager:
    """Manages context allocation for LLM calls"""

    BUDGETS = {
        "gpt-4o-mini-128k": ContextBudget(
            total_context=128_000, max_response=4_000,
            system_prompt=800, chat_history=12_000,
            retrieved_docs=6_000, query=500
        ),
        "gpt-4o-128k": ContextBudget(
            total_context=128_000, max_response=8_000,
            system_prompt=1_000, chat_history=20_000,
            retrieved_docs=10_000, query=500
        ),
        "claude-3-200k": ContextBudget(
            total_context=200_000, max_response=8_000,
            system_prompt=1_000, chat_history=40_000,
            retrieved_docs=15_000, query=500
        ),
    }

    def __init__(self, model: str = "gpt-4o-mini-128k"):
        self.budget = self.BUDGETS.get(model, self.BUDGETS["gpt-4o-mini-128k"])
        encoding_name = "o200k_base" if "gpt-4o" in model else "cl100k_base"
        self.encoder = tiktoken.get_encoding(encoding_name)

    def _count(self, text: str) -> int:
        return len(self.encoder.encode(text))

    def fit_documents_to_budget(
        self,
        documents: List[Tuple[str, float]],
        actual_chat_tokens: int = 0
    ) -> List[str]:
        """Select and truncate documents to fit within budget"""
        history_overflow = max(0, actual_chat_tokens - self.budget.chat_history)
        available = self.budget.available_for_docs - history_overflow

        if available <= 100:
            return []

        selected_docs = []
        tokens_used = 0

        for doc_text, score in documents:
            doc_tokens = self._count(doc_text)

            if tokens_used + doc_tokens <= available:
                selected_docs.append(doc_text)
                tokens_used += doc_tokens
            elif tokens_used < available * 0.5:
                remaining = available - tokens_used
                if remaining > 100:
                    truncated = self.encoder.decode(
                        self.encoder.encode(doc_text)[:remaining - 20]
                    ) + "...[truncated]"
                    selected_docs.append(truncated)
                break
            else:
                break

        return selected_docs

    def summarize_history_if_needed(
        self,
        messages: List[dict],
        llm_client,
        target_tokens: Optional[int] = None
    ) -> List[dict]:
        """Summarize old messages if history exceeds budget"""
        if target_tokens is None:
            target_tokens = self.budget.chat_history

        all_text = " ".join(m["content"] for m in messages)
        current_tokens = self._count(all_text)

        if current_tokens <= target_tokens:
            return messages

        keep_recent = 4  # Keep last 2 turns
        recent_messages = messages[-keep_recent:]
        old_messages = messages[:-keep_recent]

        if not old_messages:
            return recent_messages

        old_text = "\n".join(f"{m['role']}: {m['content']}" for m in old_messages)
        summary_response = llm_client.chat.completions.create(
            model="gpt-4o-mini",
            messages=[{"role": "user",
                       "content": f"Summarize this conversation in 2-3 sentences:\n\n{old_text}"}],
            max_tokens=200, temperature=0
        )
        summary = summary_response.choices[0].message.content

        return [
            {"role": "system", "content": f"[Previous conversation summary]: {summary}"}
        ] + recent_messages

4. Context Compression

When RAG documents exceed the available budget, there are two approaches: truncation (cut the text) or compression (extract only the relevant parts). Compression produces better results because it preserves key information instead of discarding it arbitrarily.

Contextual Compression Retriever with LangChain


from langchain.retrievers import ContextualCompressionRetriever
from langchain.retrievers.document_compressors import (
    LLMChainExtractor, EmbeddingsFilter
)
from langchain_openai import ChatOpenAI, OpenAIEmbeddings


class ContextCompressor:
    """Compress RAG context to fit within budget"""

    def __init__(self, base_retriever, llm, embeddings):
        # Method 1: LLMChainExtractor
        # Uses LLM to extract only relevant parts for the question
        # Pro: high quality | Con: slow and expensive
        llm_extractor = LLMChainExtractor.from_llm(llm)
        self.llm_compressor = ContextualCompressionRetriever(
            base_compressor=llm_extractor,
            base_retriever=base_retriever
        )

        # Method 2: EmbeddingsFilter
        # Removes documents below a similarity threshold with the query
        # Pro: fast and free | Con: less precise
        embeddings_filter = EmbeddingsFilter(
            embeddings=embeddings,
            similarity_threshold=0.76
        )
        self.embedding_compressor = ContextualCompressionRetriever(
            base_compressor=embeddings_filter,
            base_retriever=base_retriever
        )


# Custom: sentence-level compression
from sentence_transformers import SentenceTransformer
import numpy as np
import re

class SentenceLevelCompressor:
    """Sentence-level compression for maximum information density"""

    def __init__(self, model_name="all-MiniLM-L6-v2"):
        self.model = SentenceTransformer(model_name)

    def compress(self, document: str, query: str,
                 top_k_sentences: int = 5) -> str:
        """Extract most relevant sentences from document for the query"""
        sentences = re.split(r'(?<=[.!?])\s+', document)
        sentences = [s.strip() for s in sentences if len(s.strip()) > 20]

        if len(sentences) <= 3:
            return document  # Already short

        query_emb = self.model.encode([query], normalize_embeddings=True)[0]
        sentence_embs = self.model.encode(sentences, normalize_embeddings=True)
        scores = np.dot(sentence_embs, query_emb)

        # Select top-k sentences preserving original order
        top_indices = sorted(
            np.argsort(scores)[-top_k_sentences:].tolist()
        )

        return " ".join(sentences[i] for i in top_indices)

5. Memory Management for Long Conversations

Long conversations are one of the most critical cases for context window management. There are several memory strategies with different quality/cost tradeoffs.

Memory Strategies for Long Conversations


from langchain.memory import (
    ConversationBufferMemory,           # Full history
    ConversationBufferWindowMemory,     # Sliding window
    ConversationSummaryMemory,          # Summary
    ConversationSummaryBufferMemory,    # Hybrid: summary + recent
    ConversationTokenBufferMemory,      # Precise token limit
)
from langchain_openai import ChatOpenAI

llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)

# 1. SLIDING WINDOW: keep only last k turns
# Pro: simple, fast, fixed cost | Con: loses distant context
window_memory = ConversationBufferWindowMemory(
    k=5, return_messages=True, memory_key="chat_history"
)

# 2. SUMMARY MEMORY: summarize entire history
# Pro: scales without limits | Con: loses details, extra cost
summary_memory = ConversationSummaryMemory(
    llm=llm, return_messages=True, memory_key="chat_history"
)

# 3. SUMMARY BUFFER MEMORY: hybrid - summary + last k tokens
# Pro: keeps both general context AND recent details
# Con: more complex, moderate summary cost
# BEST CHOICE for most production use cases
hybrid_memory = ConversationSummaryBufferMemory(
    llm=llm,
    max_token_limit=4000,  # Threshold: summarize old messages when exceeded
    return_messages=True,
    memory_key="chat_history"
)

# 4. TOKEN BUFFER MEMORY: precise token limit
# Pro: exact budget control | Con: may truncate mid-turn
token_memory = ConversationTokenBufferMemory(
    llm=llm,
    max_token_limit=8000,
    return_messages=True,
    memory_key="chat_history"
)

# RECOMMENDATION:
# Short conversations (<10 turns): ConversationBufferWindowMemory
# Medium conversations (10-50 turns): ConversationSummaryBufferMemory
# Long conversations (50+ turns): ConversationSummaryMemory
# Precise cost control: ConversationTokenBufferMemory

6. Token Usage Monitoring and Cost Optimization

Token Usage Tracker for Production


from langchain.callbacks import get_openai_callback
from langchain_core.callbacks import BaseCallbackHandler
import time
import logging

logger = logging.getLogger(__name__)


class TokenUsageTracker(BaseCallbackHandler):
    """Tracks token usage and costs for each LLM call"""

    def __init__(self, model: str = "gpt-4o-mini"):
        self.model = model
        self.total_prompt_tokens = 0
        self.total_completion_tokens = 0
        self.total_calls = 0

    def on_llm_start(self, serialized, prompts, **kwargs):
        self._start_time = time.time()

    def on_llm_end(self, response, **kwargs):
        duration = time.time() - self._start_time
        if hasattr(response, 'llm_output') and response.llm_output:
            token_usage = response.llm_output.get('token_usage', {})
            prompt_tokens = token_usage.get('prompt_tokens', 0)
            completion_tokens = token_usage.get('completion_tokens', 0)
            self.total_prompt_tokens += prompt_tokens
            self.total_completion_tokens += completion_tokens
            self.total_calls += 1
            logger.info(
                f"LLM: {prompt_tokens}+{completion_tokens} tokens, {duration*1000:.0f}ms"
            )

    def get_stats(self) -> dict:
        counter = TokenCounter(self.model)
        cost = counter.estimate_cost(
            self.total_prompt_tokens, self.total_completion_tokens
        )
        return {
            "total_calls": self.total_calls,
            "total_tokens": self.total_prompt_tokens + self.total_completion_tokens,
            "avg_prompt_tokens": self.total_prompt_tokens / max(1, self.total_calls),
            "total_cost_usd": cost["total_cost_usd"],
        }


# Simple usage with get_openai_callback
with get_openai_callback() as cb:
    result = rag_chain.invoke("What is RAG?")
    print(f"Tokens used: {cb.total_tokens}")
    print(f"Cost: #123;cb.total_cost:.6f}")
    print(f"Prompt tokens: {cb.prompt_tokens}")
    print(f"Completion tokens: {cb.completion_tokens}")

7. Best Practices and Anti-Patterns

Context Window Management Best Practices

Count tokens before calling the LLM: do not wait for a "context too long" error. Use tiktoken to validate the prompt before sending.
Structure the prompt against "Lost in the Middle": put the most critical information at the beginning (system prompt, key instructions) and at the end (user query, specific request).
Use ConversationSummaryBufferMemory for long conversations: it retains recent details and general context from older turns at a reasonable cost.
Compress before truncating: semantic compression is better than brute truncation. A document compressed to 40% retains 90% of the relevant information.
Monitor cost-per-query in production and set alerts when it exceeds predefined thresholds (e.g. >$0.01 per query on gpt-4o-mini indicates a context management problem).

Anti-Patterns to Avoid

Context stuffing: filling the context with everything possible does not improve quality - it often worsens it due to "Lost in the Middle". Choose quality over quantity.
Ignoring history costs: a 50-turn conversation with RAG can cost 10-50x a single query. Always implement a limit on history.
Mid-sentence truncation: truncating a document in the middle of a sentence or concept is worse than not including it. Always truncate at natural boundaries.
Same budget for all models: a 128K token model and a 4K token model require radically different strategies. Do not use the same constants for both.

Conclusions

Context window management is not an implementation detail: it is one of the most impactful variables on quality and cost of production RAG systems. We explored precise token counting with tiktoken, systematic context budgeting, semantic compression, memory management for long conversations, and cost monitoring.

Key takeaways:

Always count tokens before sending with tiktoken or equivalent
Structure context to mitigate "Lost in the Middle": critical info at start and end
Use semantic compression instead of brute truncation
ConversationSummaryBufferMemory is the best choice for long conversations
Monitor cost-per-query in production and set alerts

In the next article we will explore Multi-Agent Systems: how to orchestrate multiple specialized AI agents that collaborate to solve complex problems that no single agent could handle alone.

Continue the Series

Article 1: RAG Explained
Article 6: LangChain for RAG
Article 7: Context Window Management (current)
Article 8: Multi-Agent Systems
Article 9: Prompt Engineering in Production