Edge inference across a catalogue of open models — LLMs, embeddings, image generation, speech-to-text. No GPU management, no model hosting. One binding, one env.AI.run() call.
Workers AI is Cloudflare's serverless inference platform. You pick a model from their catalogue (Llama, Mistral, Whisper, Stable Diffusion, BAAI embeddings, and more), call env.AI.run() from your Worker, and Cloudflare routes the request to GPUs in their network. No model hosting, no VRAM calculations, no queue management.
The models are open-source — no API keys for OpenAI or Anthropic. You pay per token (for LLMs) or per request (for other models), and there's a generous free tier. The trade-off: model selection is limited to what Cloudflare offers, and you can't fine-tune or deploy custom models.
For structured output, Workers AI supports JSON schema enforcement on compatible models — the model's output is constrained to match your schema, eliminating parse failures.
Call a language model with a prompt. Supports system messages, streaming, and JSON schema enforcement.
# wrangler.toml [ai] binding = "AI" // Worker code const response = await env.AI.run("@cf/meta/llama-3.1-8b-instruct", { messages: [ { role: "system", content: "You are a helpful assistant." }, { role: "user", content: "Summarise this invoice." }, ], max_tokens: 512, });
Generate vector embeddings for text. Pairs with Vectorize for RAG pipelines.
const embeddings = await env.AI.run("@cf/baai/bge-base-en-v1.5", { text: ["Cloudflare Workers run on V8 isolates"], }); // embeddings.data[0] is a 768-dim float array // Insert into Vectorize await env.VECTORS.upsert([{ id: "doc-1", values: embeddings.data[0], metadata: { source: "workers-explainer" }, }]);
Stream LLM output token-by-token for responsive UIs. The response is a ReadableStream of server-sent events.
const stream = await env.AI.run("@cf/meta/llama-3.1-8b-instruct", { messages: [{ role: "user", content: "Explain KV vs D1" }], stream: true, }); return new Response(stream, { headers: { "Content-Type": "text/event-stream" }, });
You can't deploy custom or fine-tuned models. If you need GPT-4, Claude, or a private model, use AI Gateway to proxy to those providers instead.
Requests are routed to the nearest GPU-equipped PoP, which isn't every PoP. From some regions (including parts of Africa), this adds noticeable latency.
Popular models have concurrency limits. Burst traffic can hit 429s. Use Queues to smooth demand for batch workloads.