Workers AI: LLM Inference and Vision Models Directly at the Edge
Workers AI saw 4000% YoY growth in inference requests to Q1 2026. Learn how to run text, vision and speech models directly in your Workers without dedicated GPU, with latency under 200ms.
AI Inference Moved to the Edge
Until 2023, running a Large Language Model was almost mandatory communicate with an external API (OpenAI, Anthropic, Google) or deploy expensive GPUs on dedicated infrastructure. The network latency to these centralized endpoints it added 200-800ms to each request, and GPU costs were prohibitive for low volume applications.
Workers AI has changed this scenario. Cloudflare has distributed AI inference hardware (specialized GPUs) in dozens of data centers around the world. Models run on the same hardware that runs your Worker, eliminating the network round-trip to external providers. The result is latency inference reduced, billed on consumption without infrastructure management.
What You Will Learn
- Templates available in Workers AI: LLM, vision, speech, embedding
- Text generation with Llama 3.1 and response streaming
- Vision models: image analysis with LLaVA
- Speech-to-text with Whisper
- Text embeddings for semantic search
- AI Gateway: caching, rate limiting and observability of AI requests
- Limits, costs and optimization strategies
Overview of Available Models
Workers AI offers a selection of open-source models optimized for inference
on Cloudflare hardware. The models are indicated with the prefix
@cf/ o @hf/ (Hugging Face hosted):
| Category | Main models | Use case |
|---|---|---|
| Text Generation | @cf/meta/llama-3.1-8b-instruct, @cf/mistral/mistral-7b-instruct-v0.2 | Chatbot, summary, Q&A, code generation |
| Text Generation (Large) | @cf/meta/llama-3.3-70b-instruct-fp8-fast | Complex reasoning, advanced analysis |
| Vision | @cf/llava-hf/llava-1.5-7b-hf, @cf/unum/uform-gen2-qwen-500m | Image captioning, visual Q&A |
| Speech-to-Text | @cf/openai/whisper, @cf/openai/whisper-large-v3-turbo | Audio transcription |
| Text Embeddings | @cf/baai/bge-base-en-v1.5, @cf/baai/bge-large-en-v1.5 | Semantic search, similarity, RAG |
| Image Classification | @cf/microsoft/resnet-50 | Image classification |
| Translation | @cf/meta/m2m100-1.2b | Translation 100+ languages |
| Image Generation | @cf/stabilityai/stable-diffusion-xl-base-1.0 | Text-to-image |
Configuration: AI Binding in wrangler.toml
To use Workers AI just add the binding [ai] in the configuration:
# wrangler.toml
name = "ai-worker"
main = "src/worker.ts"
compatibility_date = "2024-09-23"
# Binding per Workers AI
[ai]
binding = "AI"
The TypeScript type of the binding must be declared in the interface Env:
// types.ts - dichiarazione del binding AI
interface Env {
AI: Ai; // Tipo fornito da @cloudflare/workers-types
}
Text Generation with Llama 3.1
The most common use case is generating text with a template instruction-response. Let's see how to implement a chat endpoint:
// src/worker.ts - endpoint di chat con Llama 3.1
export default {
async fetch(request: Request, env: Env): Promise<Response> {
if (request.method !== 'POST' || new URL(request.url).pathname !== '/chat') {
return new Response('POST /chat required', { status: 400 });
}
const { messages, stream = false } = await request.json<ChatRequest>();
// Valida l'input
if (!Array.isArray(messages) || messages.length === 0) {
return Response.json({ error: 'messages array required' }, { status: 400 });
}
if (stream) {
// Streaming response: il modello restituisce token man mano
const aiStream = await env.AI.run('@cf/meta/llama-3.1-8b-instruct', {
messages,
stream: true,
max_tokens: 1024,
temperature: 0.7,
});
// Trasforma lo stream AI in Server-Sent Events
return new Response(aiStream, {
headers: {
'Content-Type': 'text/event-stream',
'Cache-Control': 'no-cache',
'Connection': 'keep-alive',
},
});
}
// Risposta sincrona: attende il completion completo
const result = await env.AI.run('@cf/meta/llama-3.1-8b-instruct', {
messages,
max_tokens: 1024,
temperature: 0.7,
});
return Response.json({
response: (result as AiTextGenerationOutput).response,
usage: {
// Workers AI non espone ancora i token counts nella risposta base
model: '@cf/meta/llama-3.1-8b-instruct',
},
});
},
};
interface ChatRequest {
messages: Array<{ role: 'system' | 'user' | 'assistant'; content: string }>;
stream?: boolean;
}
interface Env {
AI: Ai;
}
A more complete example that implements a system prompt assistant and robust error handling:
// src/assistant-worker.ts
const SYSTEM_PROMPT = `Sei un assistente tecnico esperto in cloud computing e edge computing.
Rispondi in modo conciso e tecnico. Se non conosci la risposta, dillo chiaramente.
Non inventare informazioni. Rispondi sempre in italiano a meno che l'utente non scriva in un'altra lingua.`;
export default {
async fetch(request: Request, env: Env): Promise<Response> {
if (request.method !== 'POST') {
return new Response('Method Not Allowed', { status: 405 });
}
let body: AssistantRequest;
try {
body = await request.json<AssistantRequest>();
} catch {
return Response.json({ error: 'Invalid JSON body' }, { status: 400 });
}
if (!body.question?.trim()) {
return Response.json({ error: 'question field is required' }, { status: 400 });
}
try {
const result = await env.AI.run('@cf/meta/llama-3.1-8b-instruct', {
messages: [
{ role: 'system', content: SYSTEM_PROMPT },
{ role: 'user', content: body.question },
],
max_tokens: 2048,
temperature: 0.3, // Bassa temperatura per risposte piu deterministiche
}) as AiTextGenerationOutput;
return Response.json({
answer: result.response,
model: '@cf/meta/llama-3.1-8b-instruct',
timestamp: new Date().toISOString(),
});
} catch (err) {
console.error('AI inference error:', err);
return Response.json(
{ error: 'AI inference failed', details: (err as Error).message },
{ status: 500 }
);
}
},
};
interface AssistantRequest {
question: string;
}
interface Env {
AI: Ai;
}
Vision Models: Image Analysis
Vision models allow you to analyze input images together with a question textual. This is useful for content moderation, information extraction from scanned documents and accessibility features:
// src/vision-worker.ts - analisi immagini con LLaVA
export default {
async fetch(request: Request, env: Env): Promise<Response> {
if (request.method !== 'POST') {
return new Response('Method Not Allowed', { status: 405 });
}
// Accetta immagine come Base64 o URL
const body = await request.json<VisionRequest>();
let imageData: number[];
if (body.imageUrl) {
// Scarica l'immagine e converti in array di byte
const imgResponse = await fetch(body.imageUrl);
if (!imgResponse.ok) {
return Response.json({ error: 'Failed to fetch image' }, { status: 400 });
}
const buffer = await imgResponse.arrayBuffer();
imageData = Array.from(new Uint8Array(buffer));
} else if (body.imageBase64) {
// Decodifica Base64
const binaryString = atob(body.imageBase64);
imageData = Array.from({ length: binaryString.length }, (_, i) =>
binaryString.charCodeAt(i)
);
} else {
return Response.json({ error: 'imageUrl or imageBase64 required' }, { status: 400 });
}
const prompt = body.prompt ?? 'Descrivi questa immagine in dettaglio in italiano.';
const result = await env.AI.run('@cf/llava-hf/llava-1.5-7b-hf', {
image: imageData,
prompt,
max_tokens: 512,
}) as AiTextGenerationOutput;
return Response.json({
description: result.response,
prompt,
model: '@cf/llava-hf/llava-1.5-7b-hf',
});
},
};
interface VisionRequest {
imageUrl?: string;
imageBase64?: string;
prompt?: string;
}
interface Env {
AI: Ai;
}
Speech-to-Text with Whisper
Workers AI includes Whisper for audio transcription. The model accepts audio
in format ArrayBuffer and returns the timestamped transcript
optional:
// src/transcribe-worker.ts - Speech-to-text con Whisper
export default {
async fetch(request: Request, env: Env): Promise<Response> {
if (request.method !== 'POST') {
return new Response('Method Not Allowed', { status: 405 });
}
const contentType = request.headers.get('Content-Type') ?? '';
// Accetta audio come multipart/form-data o application/octet-stream
let audioBuffer: ArrayBuffer;
if (contentType.includes('multipart/form-data')) {
const formData = await request.formData();
const audioFile = formData.get('audio') as File | null;
if (!audioFile) {
return Response.json({ error: 'audio field required in form data' }, { status: 400 });
}
audioBuffer = await audioFile.arrayBuffer();
} else {
// Raw binary audio
audioBuffer = await request.arrayBuffer();
}
if (audioBuffer.byteLength === 0) {
return Response.json({ error: 'Empty audio data' }, { status: 400 });
}
// Limita a 25MB (limite Whisper)
if (audioBuffer.byteLength > 25 * 1024 * 1024) {
return Response.json({ error: 'Audio file too large (max 25MB)' }, { status: 413 });
}
const result = await env.AI.run('@cf/openai/whisper', {
audio: Array.from(new Uint8Array(audioBuffer)),
}) as AiSpeechRecognitionOutput;
return Response.json({
text: result.text,
wordCount: result.text.split(/\s+/).filter(Boolean).length,
model: '@cf/openai/whisper',
});
},
};
interface Env {
AI: Ai;
}
Text Embeddings for Semantic Search
Embeddings are numerical vectors that represent semantic meaning of a text. Workers AI includes BGE models optimized for semantic search. Combined with Vectorize (Cloudflare's vector database), they allow you to build RAG pipeline completely at the edge:
// src/embedding-worker.ts - generazione embeddings + ricerca semantica
export default {
async fetch(request: Request, env: Env): Promise<Response> {
const url = new URL(request.url);
if (url.pathname === '/embed' && request.method === 'POST') {
const { texts } = await request.json<EmbedRequest>();
if (!Array.isArray(texts) || texts.length === 0) {
return Response.json({ error: 'texts array required' }, { status: 400 });
}
// BGE genera embedding di 768 dimensioni (base) o 1024 (large)
const result = await env.AI.run('@cf/baai/bge-base-en-v1.5', {
text: texts,
}) as AiTextEmbeddingsOutput;
return Response.json({
embeddings: result.data,
dimensions: result.data[0]?.length ?? 0,
count: result.data.length,
model: '@cf/baai/bge-base-en-v1.5',
});
}
if (url.pathname === '/search' && request.method === 'POST') {
const { query, topK = 5 } = await request.json<SearchRequest>();
// 1. Genera l'embedding per la query
const queryEmbed = await env.AI.run('@cf/baai/bge-base-en-v1.5', {
text: [query],
}) as AiTextEmbeddingsOutput;
// 2. Ricerca semantica su Vectorize
const results = await env.VECTORIZE.query(queryEmbed.data[0], {
topK,
returnMetadata: 'all',
});
return Response.json({
query,
results: results.matches.map((match) => ({
id: match.id,
score: match.score,
metadata: match.metadata,
})),
});
}
return new Response('Not Found', { status: 404 });
},
};
interface EmbedRequest {
texts: string[];
}
interface SearchRequest {
query: string;
topK?: number;
}
interface Env {
AI: Ai;
VECTORIZE: VectorizeIndex;
}
AI Gateway: Observability and Caching
Cloudflare AI Gateway it is a transparent proxy that yes positions ahead of AI calls (both Workers AI and external providers such as OpenAI). Adds semantic caching, rate limiting, logging and automatic fallback:
// src/worker-with-gateway.ts - Workers AI via AI Gateway
export default {
async fetch(request: Request, env: Env): Promise<Response> {
const { prompt } = await request.json<{ prompt: string }>();
// Usa il gateway invece del binding diretto
// Il gateway aggiunge: caching, retry, logging, rate limiting
const response = await fetch(
`https://gateway.ai.cloudflare.com/v1/${env.CF_ACCOUNT_ID}/${env.AI_GATEWAY_ID}/workers-ai/@cf/meta/llama-3.1-8b-instruct`,
{
method: 'POST',
headers: {
'Authorization': `Bearer ${env.CF_API_TOKEN}`,
'Content-Type': 'application/json',
},
body: JSON.stringify({
messages: [{ role: 'user', content: prompt }],
max_tokens: 512,
}),
}
);
if (!response.ok) {
const error = await response.text();
return Response.json({ error }, { status: response.status });
}
const result = await response.json();
return Response.json(result);
},
};
interface Env {
AI: Ai;
CF_ACCOUNT_ID: string;
CF_API_TOKEN: string;
AI_GATEWAY_ID: string;
}
Alternatively, you can configure the gateway directly in the AI binding of the wrangler.toml:
# wrangler.toml con AI Gateway
[ai]
binding = "AI"
# Il gateway viene usato automaticamente per tutte le chiamate
# Configurato nella dashboard Cloudflare
Limitations and Cost Considerations
| Model | Free (Neuron Units) | Paid ($per 1K neurons) | Typical latency |
|---|---|---|---|
| Llama 3.1 8B | 10K NU/day free | $0.011 / 1K NU | ~500ms-2s (depends on tokens) |
| Llama 3.3 70B FP8 | Included in the paid plan | $0.050 / 1K NU | ~1-5s |
| Whisper | 10K NU/day free | $0.011 / 1K NU | ~1-3s per minute of audio |
| BGE embeddings | 10K NU/day free | $0.011 / 1K NU | ~50-200ms |
| Stable Diffusion XL | 10K NU/day free | $0.020 / image | ~3-10s |
Timeouts and CPU Limits
Workers AI operates outside of the Worker's normal CPU budget (30s paid plan). However, large models like Llama 70B can take 5-15 seconds to respond. In these cases, be sure to use the streaming to return tokens as you go and not exceed the client's HTTP timeout. For long inferences consider using a queue (Workers Queues) and notify the client when finished.
Production Pattern: RAG at the Edge
An increasingly common pattern is the RAG (Retrieval-Augmented Generation) entirely at the edge: Vectorize for retrieval, Workers AI for embedding and generation.
// src/rag-worker.ts - RAG completo all'edge
export default {
async fetch(request: Request, env: Env): Promise<Response> {
if (request.method !== 'POST') return new Response('POST only', { status: 405 });
const { question } = await request.json<{ question: string }>();
// Step 1: Genera l'embedding della domanda
const queryEmbedding = await env.AI.run('@cf/baai/bge-base-en-v1.5', {
text: [question],
}) as AiTextEmbeddingsOutput;
// Step 2: Recupera i chunk rilevanti dal vector store
const relevant = await env.DOCS.query(queryEmbedding.data[0], {
topK: 3,
returnMetadata: 'all',
});
// Step 3: Costruisce il contesto dai chunk recuperati
const context = relevant.matches
.map((m) => m.metadata?.['text'] as string ?? '')
.filter(Boolean)
.join('\n\n---\n\n');
// Step 4: Genera la risposta con il contesto
const answer = await env.AI.run('@cf/meta/llama-3.1-8b-instruct', {
messages: [
{
role: 'system',
content: `Rispondi alla domanda basandoti SOLO sul contesto fornito.
Se il contesto non contiene informazioni sufficienti, dillo esplicitamente.
Contesto:
${context}`,
},
{ role: 'user', content: question },
],
max_tokens: 1024,
temperature: 0.1,
}) as AiTextGenerationOutput;
return Response.json({
question,
answer: answer.response,
sources: relevant.matches.map((m) => ({
id: m.id,
score: m.score,
title: m.metadata?.['title'],
})),
});
},
};
interface Env {
AI: Ai;
DOCS: VectorizeIndex;
}
Conclusions and Next Steps
Workers AI represents a paradigm shift in accessing AI inference: no GPU to manage, no external providers to integrate, billing for consumption with generous free tier. 4000% YoY growth to Q1 2026 reflects rapid adoption by developers looking for a simpler path towards AI in their products.
The combination Workers AI + Vectorize + Durable Objects (for managing the conversation history) allows you to build complete AI assistants entirely on the Cloudflare platform, without external dependencies.
Next Articles in the Series
- Article 6: Vercel Edge Runtime — Advanced Middleware, Geolocation and A/B Testing: How Vercel uses the edge runtime with Next.js for customization and feature flags.
- Article 7: Geographic Routing at the Edge — Personalization GDPR Content and Compliance: Build geo-based logic without touching the main server.
- Article 8: Cache API and Invalidation Strategies in Cloudflare Workers: Programmable CDN with TTL, stale-while-revalidate, and per-key purge.







