Cloudflare AI Gateway is the observability and resilience layer that sits between a Worker and Azure OpenAI. Caching, retries, prompt logs, cost dashboards, and per-tenant rate limits — all by changing one URL. The natural upgrade from the direct-fetch pattern when the workload reaches the volume where you'd otherwise build a custom proxy.
The direct-fetch pattern works. It's lean, fast, and easy to reason about. What it lacks is the operational layer that any LLM workload eventually needs — a place to see prompts and responses for debugging, a cache that absorbs duplicate requests, retry logic that survives a 429 burst, fallback to a backup model when the primary deployment is down, and per-tenant rate limits in a multi-tenant SaaS.
You can write all of that. Most teams do, eventually. AI Gateway ships it as a Cloudflare service that sits between your Worker and Azure OpenAI. The change is a one-line edit: instead of fetch('https://aoai-imbila-sa.openai.azure.com/...'), you fetch https://gateway.ai.cloudflare.com/v1/<account>/<gateway>/azure-openai/aoai-imbila-sa/.... The request body, the api-version, the auth header — all unchanged. The response is the same shape. The dashboard now has graphs.
The cost is one extra hop. From a JNB Worker that hop is roughly 5 ms — small enough that the operational benefits dominate. The default decision in 2026 for any serious LLM workload on Cloudflare is to route through AI Gateway. Direct fetch becomes the path you take only when the gateway adds friction (highly custom retry semantics, very strict latency budgets, or a prototype where the dashboards aren't worth the URL change).
Operational visibility, free caching, and a per-tenant rate-limit layer aren't decorations — they're the difference between a Worker that works and a Worker that's debuggable, reliable, and cost-controllable in production.
Start with the URL swap (concept 01) and you immediately get logs and a cost dashboard. Layer in cache, retry, and per-tenant rate limits as the workload demands them. The order matches the order most teams adopt them in production.
One wrangler secret put and one URL change. The Worker code from the direct-fetch pattern still works — the gateway forwards the request to Azure OpenAI with the api-key header you set, returns the body, and logs every request and response. Open the AI Gateway dashboard and there are graphs by model, deployment, latency, and cost.
// Before — direct fetch (no observability) const url = `${env.AOAI_ENDPOINT}/openai/deployments/${env.AOAI_DEPLOYMENT}/chat/completions?api-version=2025-01-01-preview`; // After — same shape, gateway in front const url = `https://gateway.ai.cloudflare.com/v1/${env.CF_ACCOUNT_ID}/${env.GATEWAY_ID}/azure-openai/${env.AOAI_RESOURCE}/${env.AOAI_DEPLOYMENT}/chat/completions?api-version=2025-01-01-preview`; // Body, headers, auth — completely unchanged const res = await fetch(url, { method: 'POST', headers: { 'api-key': env.AOAI_KEY, 'Content-Type': 'application/json' }, body: JSON.stringify({ messages, max_tokens: 800 }), });
Idempotent prompts (RAG-grounded Q&A, classification, structured extraction) repeat. The gateway hashes the request body and serves a cached response when the next identical request lands. You opt in per-request via the cf-aig-cache-ttl header — set 0 for chat conversations where every response should be fresh, set 3600 for "what's the return policy" prompts where the answer hasn't changed in a year. No KV setup, no hash function, no sha256 on the hot path.
const res = await fetch(url, { method: 'POST', headers: { 'api-key': env.AOAI_KEY, 'Content-Type': 'application/json', 'cf-aig-cache-ttl': '3600', // 1 hour 'cf-aig-cache-key': tenantId, // scope cache per tenant }, body: JSON.stringify({ messages, max_tokens: 800, temperature: 0 }), }); // Inspect the headers to see if it was a hit const hit = res.headers.get('cf-aig-cache-status'); // "HIT" | "MISS" | "BYPASS"
SA North's tighter quota means 429s are routine. With direct fetch you write the retry loop yourself (see concept 03 there). With AI Gateway, you set retry headers on the request and the gateway honours retry-after-ms from Azure, falls back to exponential backoff, and only surfaces the failure to your Worker if all attempts fail. The Worker stays simple; the resilience lives in the gateway.
const res = await fetch(url, { method: 'POST', headers: { 'api-key': env.AOAI_KEY, 'Content-Type': 'application/json', 'cf-aig-max-attempts': '3', 'cf-aig-retry-delay': '500', // initial ms 'cf-aig-backoff': 'exponential', }, body: JSON.stringify({ messages }), }); // Inspect what the gateway did const attempts = res.headers.get('cf-aig-attempts'); // "1" | "2" | "3"
For a multi-tenant SaaS, the gateway's per-key rate limit is the cleanest abstraction. Set a daily token budget per tenant in the dashboard; the gateway enforces it without your Worker tracking counters. For resilience against an Azure OpenAI region outage, configure a fallback — the gateway tries SA North first, falls back to West Europe on 5xx, and surfaces a header telling the Worker which provider answered. Useful for non-regulated workloads where availability outranks strict residency.
// Tenant identity → cache key + rate limit bucket const res = await fetch(url, { method: 'POST', headers: { 'api-key': env.AOAI_KEY, 'cf-aig-metadata': JSON.stringify({ tenantId, userId }), 'cf-aig-cache-key': tenantId, }, body: JSON.stringify({ messages }), }); // Rate limit hit → 429 from the gateway, before AOAI is even called if (res.status === 429 && res.headers.get('cf-aig-source') === 'rate-limit') { return new Response('Tenant quota exceeded', { status: 429 }); } // Fallback: gateway tried SA North, failed, served from West Europe const answeredBy = res.headers.get('cf-aig-provider'); // "azure-openai-sa" | "azure-openai-eu"
There are four shapes of "thing between your Worker and the LLM". They make different trade-offs on observability, latency, and the cost of standing them up.
| Pattern | Latency overhead | Built-in observability | Caching | Setup cost |
|---|---|---|---|---|
| AI Gateway → AOAI (this page) | ~5 ms | Yes — dashboards, logs | Built-in, header-driven | One URL change |
| Direct fetch → AOAI | ~0 ms | DIY (Analytics Engine, Logpush) | Manual (KV + sha256) | Lower up front, higher long-tail |
| Custom Worker proxy | ~2–8 ms | As much as you build | As much as you build | High — you own the code forever |
| Helicone / LangSmith proxy | ~50–150 ms | Strong, vendor-specific | Vendor-dependent | Account + URL change · cross-border |
| Azure API Management → AOAI | ~20 ms | Strong, Azure-native | Built-in | Azure-shop tax — APIM is a project |
Some shapes of LLM workload are AI-Gateway-shaped on day one. Others get routed through it the second time the team has to debug a prompt that misbehaved in production.
One Worker, many customers, separate spending caps. The gateway's cf-aig-metadata + per-key rate limit replaces an in-Worker counter that would otherwise need a Durable Object for serialisation.
The same five questions account for ~40% of customer support traffic. Cache them at the gateway with a 1-hour TTL keyed by question hash, save tokens, drop p50 to ~50 ms for repeat questions.
Tag requests via cf-aig-metadata with the prompt version, then filter the dashboard. No A/B framework, no analytics pipeline — the gateway logs are the experiment.
Set a budget in the gateway, hook a webhook to Slack. Far easier than parsing Azure billing exports — the gateway sees every token by deployment and sums them in real time.
Non-regulated workloads where availability outranks residency: configure SA North as primary, West Europe as fallback. Gateway switches on 5xx, surfaces which provider answered via response header.
Logpush stream from AI Gateway → R2 → S3-compatible bucket → SIEM ingestion. Append-only audit log of every prompt and response that left the Worker, in country, retained for the regulatory window.
AI Gateway shipped late. Cloudflare's bet was that LLM workloads would settle into a small set of providers with a shared shape — and that one well-implemented proxy would be more useful than asking developers to build their own.
AI Gateway is a default-yes for the workloads it's designed for. The cases where direct fetch wins are narrower than they used to be — most of them turn into "use the gateway and configure around the friction" once you write the code.
AI Gateway is rarely the whole story. It sits between the Worker that calls it and the AOAI deployment it routes to, and it depends on Logpush for the audit story, R2 for log storage, and the workload-specific Worker pattern for what's calling it.
The upgrade path from a working direct-fetch Worker to a production-grade LLM caller. Concepts 01–04 stack the same way the underlying explainer does — start with the URL swap for instant observability, layer in cache for cost, retry for resilience, and per-tenant metadata for SaaS posture. Skip the gateway entirely for sub-50 ms latency budgets, strict private-endpoint deployments, or SARB workloads where any Cloudflare-side processing is contentious.
When the agent-context API ships, this node will also expose the matching wrangler.toml, the gateway-creation Terraform, and the Logpush configuration for compliance-grade prompt audit.
Cloudflare's docs cover the service well. Microsoft's Azure OpenAI reference still the source of truth for the request body, since AI Gateway is transparent to it. Skill tree links round out.
Reference Worker with all four concepts wired, gateway-creation Terraform, and the Logpush configuration for prompt audit. Shipping with the know.2nth.ai Worker API.