Context Window Management: Optimizing LLM Input
The context window is the token limit an LLM can process in a single call. GPT-4 has 128K tokens, Claude 3 200K, Gemini 1.5 1 million. Enormous numbers, yet in complex RAG systems and long conversations these limits are regularly hit. When that happens, the model truncates older context, losing critical information. And costs? A 100K token prompt on GPT-4 costs around $3 per single call. In production, with thousands of queries per day, this quickly becomes unsustainable.
Context Window Management is the art of maximizing LLM response quality while optimizing the use of available context. It is not just about fitting everything in the window: it is about deciding what to include, how to structure it, and how much space to allocate to each component. In this article we explore all the techniques: from token counting and budgeting, to context compression, to memory management for long conversations.
What You Will Learn
- How the context window works and why it is critical for RAG
- Precise token counting with tiktoken for OpenAI and open-source models
- Context budgeting: allocating the token budget across system, history, context and query
- Context compression with LLMLingua and summarization techniques
- Memory management for long conversations (sliding window, summary memory)
- Lost in the Middle: why position in context matters
- Intelligent truncation strategies for RAG
- Monitoring token usage and cost optimization
1. How the Context Window Works
A Transformer-based LLM processes input as a sequence of tokens: text units that correspond approximately to 3/4 of an English word. The maximum number of tokens the model can process in the entire call (prompt + response) is defined by the context window.
# Models and their context windows (2025)
CONTEXT_WINDOWS = {
# OpenAI
"gpt-4o": 128_000,
"gpt-4o-mini": 128_000,
"gpt-3.5-turbo": 16_385,
# Anthropic
"claude-3-opus": 200_000,
"claude-3-sonnet": 200_000,
"claude-3-haiku": 200_000,
# Google
"gemini-1.5-pro": 1_000_000,
"gemini-1.5-flash": 1_000_000,
# Open Source
"llama-3.1-8b": 128_000,
"mistral-7b-v0.3": 32_768,
}
# Tokenization rules of thumb:
# - English: ~1 token per 4 chars (750 words ~ 1000 tokens)
# - Code: ~1 token per 3.5 chars
# - Unicode / special chars: more tokens per char
# Typical context distribution in RAG:
CONTEXT_BUDGET_EXAMPLE = {
"total_tokens": 128_000,
"system_prompt": 500, # ~0.4%
"chat_history": 10_000, # ~8%
"retrieved_context": 8_000, # ~6%
"user_query": 200, # ~0.2%
"safety_margin": 2_000, # ~1.6%
"response_space": 107_300 # ~84% available for response
}
1.1 The "Lost in the Middle" Problem
A surprising research finding (Liu et al., 2023, "Lost in the Middle") shows that LLMs are very good at remembering information at the beginning and end of the context, but tend to "lose" information positioned in the middle. This has direct implications for how RAG context is structured.
# Average effectiveness by position in context (Liu et al. 2023 study)
# On multi-document QA tasks with 10-20 documents:
POSITION_PERFORMANCE = {
"first_document": 85, # % accuracy
"second": 82,
"third": 78,
# ... degradation in the middle
"middle_of_context": 55, # minimum!
# ... recovery at the end
"penultimate": 79,
"last_document": 84,
}
# STRATEGIES to mitigate "Lost in the Middle":
# 1. Place the MOST CRITICAL information at the beginning or end
# 2. Limit documents in context (5-10 max)
# 3. Repeat crucial info at both beginning and end
# 4. Sort by decreasing relevance (most relevant first)
def sort_chunks_for_context(chunks_with_scores):
"""
Sort chunks to maximize LLM attention.
Strategy: most relevant first, second most relevant last.
"""
sorted_chunks = sorted(chunks_with_scores, key=lambda x: x[1], reverse=True)
if len(sorted_chunks) <= 2:
return sorted_chunks
# "Sandwich" pattern: most relevant first, second most relevant last
reordered = [sorted_chunks[0]] # Most relevant: first
middle = sorted_chunks[2:] # Less critical: middle
reordered.extend(middle)
reordered.append(sorted_chunks[1]) # Second most relevant: last
return reordered
2. Precise Token Counting with Tiktoken
Before managing the token budget, you need to count tokens precisely. OpenAI's tiktoken library implements the exact tokenizer used by GPT models. For open-source models, each model has its own tokenizer.
import tiktoken
from typing import List, Dict
class TokenCounter:
"""Precise token counter for different LLM models"""
ENCODING_MAP = {
"gpt-4o": "o200k_base",
"gpt-4o-mini": "o200k_base",
"gpt-4": "cl100k_base",
"gpt-3.5-turbo": "cl100k_base",
"text-embedding-3-small": "cl100k_base",
}
def __init__(self, model: str = "gpt-4o-mini"):
self.model = model
encoding_name = self.ENCODING_MAP.get(model, "cl100k_base")
self.encoding = tiktoken.get_encoding(encoding_name)
def count_tokens(self, text: str) -> int:
"""Count tokens in a text string"""
return len(self.encoding.encode(text))
def count_message_tokens(self, messages: List[Dict]) -> int:
"""
Count tokens in an OpenAI messages list,
including per-message overhead tokens.
"""
tokens_per_message = 3 # <|start|>role<|sep|>
tokens_per_name = 1
tokens_reply = 3 # response starts with <|start|>assistant
num_tokens = tokens_reply
for message in messages:
num_tokens += tokens_per_message
for key, value in message.items():
num_tokens += self.count_tokens(str(value))
if key == "name":
num_tokens += tokens_per_name
return num_tokens
def truncate_to_limit(self, text: str, max_tokens: int) -> str:
"""Truncate text to maximum token count"""
tokens = self.encoding.encode(text)
if len(tokens) <= max_tokens:
return text
return self.encoding.decode(tokens[:max_tokens]) + "... [truncated]"
def estimate_cost(self, prompt_tokens: int, completion_tokens: int) -> dict:
"""Estimate cost for OpenAI models (2025 pricing)"""
PRICES_PER_1M = {
"gpt-4o": {"prompt": 5.0, "completion": 15.0},
"gpt-4o-mini": {"prompt": 0.15, "completion": 0.60},
"gpt-4-turbo": {"prompt": 10.0, "completion": 30.0},
}
prices = PRICES_PER_1M.get(self.model, {"prompt": 1.0, "completion": 3.0})
prompt_cost = (prompt_tokens / 1_000_000) * prices["prompt"]
completion_cost = (completion_tokens / 1_000_000) * prices["completion"]
return {
"prompt_tokens": prompt_tokens,
"completion_tokens": completion_tokens,
"total_cost_usd": prompt_cost + completion_cost
}
# Usage
counter = TokenCounter("gpt-4o-mini")
text = "This is an example text for RAG systems."
print(f"Tokens: {counter.count_tokens(text)}")
cost = counter.estimate_cost(prompt_tokens=5000, completion_tokens=500)
print(f"Estimated cost: 






