AI Inference Moved to the Edge

Until 2023, running a Large Language Model was almost mandatory communicate with an external API (OpenAI, Anthropic, Google) or deploy expensive GPUs on dedicated infrastructure. The network latency to these centralized endpoints it added 200-800ms to each request, and GPU costs were prohibitive for low volume applications.

Workers AI has changed this scenario. Cloudflare has distributed AI inference hardware (specialized GPUs) in dozens of data centers around the world. Models run on the same hardware that runs your Worker, eliminating the network round-trip to external providers. The result is latency inference reduced, billed on consumption without infrastructure management.

What You Will Learn

  • Templates available in Workers AI: LLM, vision, speech, embedding
  • Text generation with Llama 3.1 and response streaming
  • Vision models: image analysis with LLaVA
  • Speech-to-text with Whisper
  • Text embeddings for semantic search
  • AI Gateway: caching, rate limiting and observability of AI requests
  • Limits, costs and optimization strategies

Overview of Available Models

Workers AI offers a selection of open-source models optimized for inference on Cloudflare hardware. The models are indicated with the prefix @cf/ o @hf/ (Hugging Face hosted):

Category Main models Use case
Text Generation @cf/meta/llama-3.1-8b-instruct, @cf/mistral/mistral-7b-instruct-v0.2 Chatbot, summary, Q&A, code generation
Text Generation (Large) @cf/meta/llama-3.3-70b-instruct-fp8-fast Complex reasoning, advanced analysis
Vision @cf/llava-hf/llava-1.5-7b-hf, @cf/unum/uform-gen2-qwen-500m Image captioning, visual Q&A
Speech-to-Text @cf/openai/whisper, @cf/openai/whisper-large-v3-turbo Audio transcription
Text Embeddings @cf/baai/bge-base-en-v1.5, @cf/baai/bge-large-en-v1.5 Semantic search, similarity, RAG
Image Classification @cf/microsoft/resnet-50 Image classification
Translation @cf/meta/m2m100-1.2b Translation 100+ languages
Image Generation @cf/stabilityai/stable-diffusion-xl-base-1.0 Text-to-image

Configuration: AI Binding in wrangler.toml

To use Workers AI just add the binding [ai] in the configuration:

# wrangler.toml
name = "ai-worker"
main = "src/worker.ts"
compatibility_date = "2024-09-23"

# Binding per Workers AI
[ai]
binding = "AI"

The TypeScript type of the binding must be declared in the interface Env:

// types.ts - dichiarazione del binding AI
interface Env {
  AI: Ai; // Tipo fornito da @cloudflare/workers-types
}

Text Generation with Llama 3.1

The most common use case is generating text with a template instruction-response. Let's see how to implement a chat endpoint:

// src/worker.ts - endpoint di chat con Llama 3.1

export default {
  async fetch(request: Request, env: Env): Promise<Response> {
    if (request.method !== 'POST' || new URL(request.url).pathname !== '/chat') {
      return new Response('POST /chat required', { status: 400 });
    }

    const { messages, stream = false } = await request.json<ChatRequest>();

    // Valida l'input
    if (!Array.isArray(messages) || messages.length === 0) {
      return Response.json({ error: 'messages array required' }, { status: 400 });
    }

    if (stream) {
      // Streaming response: il modello restituisce token man mano
      const aiStream = await env.AI.run('@cf/meta/llama-3.1-8b-instruct', {
        messages,
        stream: true,
        max_tokens: 1024,
        temperature: 0.7,
      });

      // Trasforma lo stream AI in Server-Sent Events
      return new Response(aiStream, {
        headers: {
          'Content-Type': 'text/event-stream',
          'Cache-Control': 'no-cache',
          'Connection': 'keep-alive',
        },
      });
    }

    // Risposta sincrona: attende il completion completo
    const result = await env.AI.run('@cf/meta/llama-3.1-8b-instruct', {
      messages,
      max_tokens: 1024,
      temperature: 0.7,
    });

    return Response.json({
      response: (result as AiTextGenerationOutput).response,
      usage: {
        // Workers AI non espone ancora i token counts nella risposta base
        model: '@cf/meta/llama-3.1-8b-instruct',
      },
    });
  },
};

interface ChatRequest {
  messages: Array<{ role: 'system' | 'user' | 'assistant'; content: string }>;
  stream?: boolean;
}

interface Env {
  AI: Ai;
}

A more complete example that implements a system prompt assistant and robust error handling:

// src/assistant-worker.ts

const SYSTEM_PROMPT = `Sei un assistente tecnico esperto in cloud computing e edge computing.
Rispondi in modo conciso e tecnico. Se non conosci la risposta, dillo chiaramente.
Non inventare informazioni. Rispondi sempre in italiano a meno che l'utente non scriva in un'altra lingua.`;

export default {
  async fetch(request: Request, env: Env): Promise<Response> {
    if (request.method !== 'POST') {
      return new Response('Method Not Allowed', { status: 405 });
    }

    let body: AssistantRequest;
    try {
      body = await request.json<AssistantRequest>();
    } catch {
      return Response.json({ error: 'Invalid JSON body' }, { status: 400 });
    }

    if (!body.question?.trim()) {
      return Response.json({ error: 'question field is required' }, { status: 400 });
    }

    try {
      const result = await env.AI.run('@cf/meta/llama-3.1-8b-instruct', {
        messages: [
          { role: 'system', content: SYSTEM_PROMPT },
          { role: 'user', content: body.question },
        ],
        max_tokens: 2048,
        temperature: 0.3, // Bassa temperatura per risposte piu deterministiche
      }) as AiTextGenerationOutput;

      return Response.json({
        answer: result.response,
        model: '@cf/meta/llama-3.1-8b-instruct',
        timestamp: new Date().toISOString(),
      });
    } catch (err) {
      console.error('AI inference error:', err);
      return Response.json(
        { error: 'AI inference failed', details: (err as Error).message },
        { status: 500 }
      );
    }
  },
};

interface AssistantRequest {
  question: string;
}

interface Env {
  AI: Ai;
}

Vision Models: Image Analysis

Vision models allow you to analyze input images together with a question textual. This is useful for content moderation, information extraction from scanned documents and accessibility features:

// src/vision-worker.ts - analisi immagini con LLaVA

export default {
  async fetch(request: Request, env: Env): Promise<Response> {
    if (request.method !== 'POST') {
      return new Response('Method Not Allowed', { status: 405 });
    }

    // Accetta immagine come Base64 o URL
    const body = await request.json<VisionRequest>();

    let imageData: number[];

    if (body.imageUrl) {
      // Scarica l'immagine e converti in array di byte
      const imgResponse = await fetch(body.imageUrl);
      if (!imgResponse.ok) {
        return Response.json({ error: 'Failed to fetch image' }, { status: 400 });
      }
      const buffer = await imgResponse.arrayBuffer();
      imageData = Array.from(new Uint8Array(buffer));
    } else if (body.imageBase64) {
      // Decodifica Base64
      const binaryString = atob(body.imageBase64);
      imageData = Array.from({ length: binaryString.length }, (_, i) =>
        binaryString.charCodeAt(i)
      );
    } else {
      return Response.json({ error: 'imageUrl or imageBase64 required' }, { status: 400 });
    }

    const prompt = body.prompt ?? 'Descrivi questa immagine in dettaglio in italiano.';

    const result = await env.AI.run('@cf/llava-hf/llava-1.5-7b-hf', {
      image: imageData,
      prompt,
      max_tokens: 512,
    }) as AiTextGenerationOutput;

    return Response.json({
      description: result.response,
      prompt,
      model: '@cf/llava-hf/llava-1.5-7b-hf',
    });
  },
};

interface VisionRequest {
  imageUrl?: string;
  imageBase64?: string;
  prompt?: string;
}

interface Env {
  AI: Ai;
}

Speech-to-Text with Whisper

Workers AI includes Whisper for audio transcription. The model accepts audio in format ArrayBuffer and returns the timestamped transcript optional:

// src/transcribe-worker.ts - Speech-to-text con Whisper

export default {
  async fetch(request: Request, env: Env): Promise<Response> {
    if (request.method !== 'POST') {
      return new Response('Method Not Allowed', { status: 405 });
    }

    const contentType = request.headers.get('Content-Type') ?? '';

    // Accetta audio come multipart/form-data o application/octet-stream
    let audioBuffer: ArrayBuffer;

    if (contentType.includes('multipart/form-data')) {
      const formData = await request.formData();
      const audioFile = formData.get('audio') as File | null;
      if (!audioFile) {
        return Response.json({ error: 'audio field required in form data' }, { status: 400 });
      }
      audioBuffer = await audioFile.arrayBuffer();
    } else {
      // Raw binary audio
      audioBuffer = await request.arrayBuffer();
    }

    if (audioBuffer.byteLength === 0) {
      return Response.json({ error: 'Empty audio data' }, { status: 400 });
    }

    // Limita a 25MB (limite Whisper)
    if (audioBuffer.byteLength > 25 * 1024 * 1024) {
      return Response.json({ error: 'Audio file too large (max 25MB)' }, { status: 413 });
    }

    const result = await env.AI.run('@cf/openai/whisper', {
      audio: Array.from(new Uint8Array(audioBuffer)),
    }) as AiSpeechRecognitionOutput;

    return Response.json({
      text: result.text,
      wordCount: result.text.split(/\s+/).filter(Boolean).length,
      model: '@cf/openai/whisper',
    });
  },
};

interface Env {
  AI: Ai;
}

Text Embeddings for Semantic Search

Embeddings are numerical vectors that represent semantic meaning of a text. Workers AI includes BGE models optimized for semantic search. Combined with Vectorize (Cloudflare's vector database), they allow you to build RAG pipeline completely at the edge:

// src/embedding-worker.ts - generazione embeddings + ricerca semantica

export default {
  async fetch(request: Request, env: Env): Promise<Response> {
    const url = new URL(request.url);

    if (url.pathname === '/embed' && request.method === 'POST') {
      const { texts } = await request.json<EmbedRequest>();

      if (!Array.isArray(texts) || texts.length === 0) {
        return Response.json({ error: 'texts array required' }, { status: 400 });
      }

      // BGE genera embedding di 768 dimensioni (base) o 1024 (large)
      const result = await env.AI.run('@cf/baai/bge-base-en-v1.5', {
        text: texts,
      }) as AiTextEmbeddingsOutput;

      return Response.json({
        embeddings: result.data,
        dimensions: result.data[0]?.length ?? 0,
        count: result.data.length,
        model: '@cf/baai/bge-base-en-v1.5',
      });
    }

    if (url.pathname === '/search' && request.method === 'POST') {
      const { query, topK = 5 } = await request.json<SearchRequest>();

      // 1. Genera l'embedding per la query
      const queryEmbed = await env.AI.run('@cf/baai/bge-base-en-v1.5', {
        text: [query],
      }) as AiTextEmbeddingsOutput;

      // 2. Ricerca semantica su Vectorize
      const results = await env.VECTORIZE.query(queryEmbed.data[0], {
        topK,
        returnMetadata: 'all',
      });

      return Response.json({
        query,
        results: results.matches.map((match) => ({
          id: match.id,
          score: match.score,
          metadata: match.metadata,
        })),
      });
    }

    return new Response('Not Found', { status: 404 });
  },
};

interface EmbedRequest {
  texts: string[];
}

interface SearchRequest {
  query: string;
  topK?: number;
}

interface Env {
  AI: Ai;
  VECTORIZE: VectorizeIndex;
}

AI Gateway: Observability and Caching

Cloudflare AI Gateway it is a transparent proxy that yes positions ahead of AI calls (both Workers AI and external providers such as OpenAI). Adds semantic caching, rate limiting, logging and automatic fallback:

// src/worker-with-gateway.ts - Workers AI via AI Gateway

export default {
  async fetch(request: Request, env: Env): Promise<Response> {
    const { prompt } = await request.json<{ prompt: string }>();

    // Usa il gateway invece del binding diretto
    // Il gateway aggiunge: caching, retry, logging, rate limiting
    const response = await fetch(
      `https://gateway.ai.cloudflare.com/v1/${env.CF_ACCOUNT_ID}/${env.AI_GATEWAY_ID}/workers-ai/@cf/meta/llama-3.1-8b-instruct`,
      {
        method: 'POST',
        headers: {
          'Authorization': `Bearer ${env.CF_API_TOKEN}`,
          'Content-Type': 'application/json',
        },
        body: JSON.stringify({
          messages: [{ role: 'user', content: prompt }],
          max_tokens: 512,
        }),
      }
    );

    if (!response.ok) {
      const error = await response.text();
      return Response.json({ error }, { status: response.status });
    }

    const result = await response.json();
    return Response.json(result);
  },
};

interface Env {
  AI: Ai;
  CF_ACCOUNT_ID: string;
  CF_API_TOKEN: string;
  AI_GATEWAY_ID: string;
}

Alternatively, you can configure the gateway directly in the AI ​​binding of the wrangler.toml:

# wrangler.toml con AI Gateway
[ai]
binding = "AI"
# Il gateway viene usato automaticamente per tutte le chiamate
# Configurato nella dashboard Cloudflare

Limitations and Cost Considerations

Model Free (Neuron Units) Paid ($per 1K neurons) Typical latency
Llama 3.1 8B 10K NU/day free $0.011 / 1K NU ~500ms-2s (depends on tokens)
Llama 3.3 70B FP8 Included in the paid plan $0.050 / 1K NU ~1-5s
Whisper 10K NU/day free $0.011 / 1K NU ~1-3s per minute of audio
BGE embeddings 10K NU/day free $0.011 / 1K NU ~50-200ms
Stable Diffusion XL 10K NU/day free $0.020 / image ~3-10s

Timeouts and CPU Limits

Workers AI operates outside of the Worker's normal CPU budget (30s paid plan). However, large models like Llama 70B can take 5-15 seconds to respond. In these cases, be sure to use the streaming to return tokens as you go and not exceed the client's HTTP timeout. For long inferences consider using a queue (Workers Queues) and notify the client when finished.

Production Pattern: RAG at the Edge

An increasingly common pattern is the RAG (Retrieval-Augmented Generation) entirely at the edge: Vectorize for retrieval, Workers AI for embedding and generation.

// src/rag-worker.ts - RAG completo all'edge

export default {
  async fetch(request: Request, env: Env): Promise<Response> {
    if (request.method !== 'POST') return new Response('POST only', { status: 405 });

    const { question } = await request.json<{ question: string }>();

    // Step 1: Genera l'embedding della domanda
    const queryEmbedding = await env.AI.run('@cf/baai/bge-base-en-v1.5', {
      text: [question],
    }) as AiTextEmbeddingsOutput;

    // Step 2: Recupera i chunk rilevanti dal vector store
    const relevant = await env.DOCS.query(queryEmbedding.data[0], {
      topK: 3,
      returnMetadata: 'all',
    });

    // Step 3: Costruisce il contesto dai chunk recuperati
    const context = relevant.matches
      .map((m) => m.metadata?.['text'] as string ?? '')
      .filter(Boolean)
      .join('\n\n---\n\n');

    // Step 4: Genera la risposta con il contesto
    const answer = await env.AI.run('@cf/meta/llama-3.1-8b-instruct', {
      messages: [
        {
          role: 'system',
          content: `Rispondi alla domanda basandoti SOLO sul contesto fornito.
Se il contesto non contiene informazioni sufficienti, dillo esplicitamente.
Contesto:
${context}`,
        },
        { role: 'user', content: question },
      ],
      max_tokens: 1024,
      temperature: 0.1,
    }) as AiTextGenerationOutput;

    return Response.json({
      question,
      answer: answer.response,
      sources: relevant.matches.map((m) => ({
        id: m.id,
        score: m.score,
        title: m.metadata?.['title'],
      })),
    });
  },
};

interface Env {
  AI: Ai;
  DOCS: VectorizeIndex;
}

Conclusions and Next Steps

Workers AI represents a paradigm shift in accessing AI inference: no GPU to manage, no external providers to integrate, billing for consumption with generous free tier. 4000% YoY growth to Q1 2026 reflects rapid adoption by developers looking for a simpler path towards AI in their products.

The combination Workers AI + Vectorize + Durable Objects (for managing the conversation history) allows you to build complete AI assistants entirely on the Cloudflare platform, without external dependencies.

Next Articles in the Series

  • Article 6: Vercel Edge Runtime — Advanced Middleware, Geolocation and A/B Testing: How Vercel uses the edge runtime with Next.js for customization and feature flags.
  • Article 7: Geographic Routing at the Edge — Personalization GDPR Content and Compliance: Build geo-based logic without touching the main server.
  • Article 8: Cache API and Invalidation Strategies in Cloudflare Workers: Programmable CDN with TTL, stale-while-revalidate, and per-key purge.