know.2nth.aiTechnologytechcloudflareWorkers AI
Technology · Cloudflare · Skill Node

Workers
AI.

Edge inference across a catalogue of open models — LLMs, embeddings, image generation, speech-to-text. No GPU management, no model hosting. One binding, one env.AI.run() call.

TechnologyEdge AI InferenceOpen TierLast updated · Apr 2026

GPU inference without GPU management.

Workers AI is Cloudflare's serverless inference platform. You pick a model from their catalogue (Llama, Mistral, Whisper, Stable Diffusion, BAAI embeddings, and more), call env.AI.run() from your Worker, and Cloudflare routes the request to GPUs in their network. No model hosting, no VRAM calculations, no queue management.

The models are open-source — no API keys for OpenAI or Anthropic. You pay per token (for LLMs) or per request (for other models), and there's a generous free tier. The trade-off: model selection is limited to what Cloudflare offers, and you can't fine-tune or deploy custom models.

For structured output, Workers AI supports JSON schema enforcement on compatible models — the model's output is constrained to match your schema, eliminating parse failures.

Binding, model catalogue, and structured output.

01

Text generation (LLM)

Call a language model with a prompt. Supports system messages, streaming, and JSON schema enforcement.

# wrangler.toml
[ai]
binding = "AI"

// Worker code
const response = await env.AI.run("@cf/meta/llama-3.1-8b-instruct", {
  messages: [
    { role: "system", content: "You are a helpful assistant." },
    { role: "user", content: "Summarise this invoice." },
  ],
  max_tokens: 512,
});
02

Embeddings

Generate vector embeddings for text. Pairs with Vectorize for RAG pipelines.

const embeddings = await env.AI.run("@cf/baai/bge-base-en-v1.5", {
  text: ["Cloudflare Workers run on V8 isolates"],
});
// embeddings.data[0] is a 768-dim float array

// Insert into Vectorize
await env.VECTORS.upsert([{
  id: "doc-1",
  values: embeddings.data[0],
  metadata: { source: "workers-explainer" },
}]);
03

Streaming responses

Stream LLM output token-by-token for responsive UIs. The response is a ReadableStream of server-sent events.

const stream = await env.AI.run("@cf/meta/llama-3.1-8b-instruct", {
  messages: [{ role: "user", content: "Explain KV vs D1" }],
  stream: true,
});

return new Response(stream, {
  headers: { "Content-Type": "text/event-stream" },
});

Where it bites you.

Model selection

Limited to Cloudflare's catalogue

You can't deploy custom or fine-tuned models. If you need GPT-4, Claude, or a private model, use AI Gateway to proxy to those providers instead.

Latency

GPU routing adds latency vs local inference

Requests are routed to the nearest GPU-equipped PoP, which isn't every PoP. From some regions (including parts of Africa), this adds noticeable latency.

Rate limits

Concurrent request limits per model

Popular models have concurrency limits. Burst traffic can hit 429s. Use Queues to smooth demand for batch workloads.

When it fits. When it doesn't.

✓ Use it when
  • You want zero-ops AI inference. No GPU provisioning, no model deployment, no CUDA drivers.
  • Open models are sufficient. Llama, Mistral, Whisper, and BAAI embeddings cover many use cases.
  • Building a fully-Cloudflare RAG stack. Workers AI + Vectorize + D1/R2 is a complete pipeline with no external dependencies.

Where this node connects.

Go deeper.