Workers
AI.

Edge inference across a catalogue of open models — LLMs, embeddings, image generation, speech-to-text. No GPU management, no model hosting. One binding, one env.AI.run() call.

TechnologyEdge AI InferenceOpen TierLast updated · Apr 2026

01 · What it is

GPU inference without GPU management.

Workers AI is Cloudflare's serverless inference platform. You pick a model from their catalogue (Llama, Mistral, Whisper, Stable Diffusion, BAAI embeddings, and more), call env.AI.run() from your Worker, and Cloudflare routes the request to GPUs in their network. No model hosting, no VRAM calculations, no queue management.

The models are open-source — no API keys for OpenAI or Anthropic. You pay per token (for LLMs) or per request (for other models), and there's a generous free tier. The trade-off: model selection is limited to what Cloudflare offers, and you can't fine-tune or deploy custom models.

For structured output, Workers AI supports JSON schema enforcement on compatible models — the model's output is constrained to match your schema, eliminating parse failures.

02 · How it works

Binding, model catalogue, and structured output.

Text generation (LLM)

Call a language model with a prompt. Supports system messages, streaming, and JSON schema enforcement.

# wrangler.toml
[ai]
binding = "AI"

// Worker code
const response = await env.AI.run("@cf/meta/llama-3.1-8b-instruct", {
  messages: [
    { role: "system", content: "You are a helpful assistant." },
    { role: "user", content: "Summarise this invoice." },
  ],
  max_tokens: 512,
});

Embeddings

Generate vector embeddings for text. Pairs with Vectorize for RAG pipelines.

const embeddings = await env.AI.run("@cf/baai/bge-base-en-v1.5", {
  text: ["Cloudflare Workers run on V8 isolates"],
});
// embeddings.data[0] is a 768-dim float array

// Insert into Vectorize
await env.VECTORS.upsert([{
  id: "doc-1",
  values: embeddings.data[0],
  metadata: { source: "workers-explainer" },
}]);

Streaming responses

Stream LLM output token-by-token for responsive UIs. The response is a ReadableStream of server-sent events.

const stream = await env.AI.run("@cf/meta/llama-3.1-8b-instruct", {
  messages: [{ role: "user", content: "Explain KV vs D1" }],
  stream: true,
});

return new Response(stream, {
  headers: { "Content-Type": "text/event-stream" },
});

03 · Gotchas

Where it bites you.

Model selection

Limited to Cloudflare's catalogue

You can't deploy custom or fine-tuned models. If you need GPT-4, Claude, or a private model, use AI Gateway to proxy to those providers instead.

Latency

GPU routing adds latency vs local inference

Requests are routed to the nearest GPU-equipped PoP, which isn't every PoP. From some regions (including parts of Africa), this adds noticeable latency.

Rate limits

Concurrent request limits per model

Popular models have concurrency limits. Burst traffic can hit 429s. Use Queues to smooth demand for batch workloads.

04 · Decision guide

When it fits. When it doesn't.

✓ Use it when

You want zero-ops AI inference. No GPU provisioning, no model deployment, no CUDA drivers.
Open models are sufficient. Llama, Mistral, Whisper, and BAAI embeddings cover many use cases.
Building a fully-Cloudflare RAG stack. Workers AI + Vectorize + D1/R2 is a complete pipeline with no external dependencies.

✗ Skip it when

You need frontier models. GPT-4o, Claude Opus, Gemini — use AI Gateway to proxy to those providers.
You need custom or fine-tuned models. Workers AI runs Cloudflare's catalogue only.
Inference latency is critical and users are in GPU-sparse regions. GPU routing from underserved regions adds unpredictable latency.

05 · Connections