know.2nth.ai Technology Microsoft Azure OpenAI from Workers + AI Gateway
Technology · Microsoft · Implementation Node

The proxy
you don't have to write.

Cloudflare AI Gateway is the observability and resilience layer that sits between a Worker and Azure OpenAI. Caching, retries, prompt logs, cost dashboards, and per-tenant rate limits — all by changing one URL. The natural upgrade from the direct-fetch pattern when the workload reaches the volume where you'd otherwise build a custom proxy.

Implementation AI Gateway Azure OpenAI SA North Last updated · May 2026

Same auth, same body. Different host.

The direct-fetch pattern works. It's lean, fast, and easy to reason about. What it lacks is the operational layer that any LLM workload eventually needs — a place to see prompts and responses for debugging, a cache that absorbs duplicate requests, retry logic that survives a 429 burst, fallback to a backup model when the primary deployment is down, and per-tenant rate limits in a multi-tenant SaaS.

You can write all of that. Most teams do, eventually. AI Gateway ships it as a Cloudflare service that sits between your Worker and Azure OpenAI. The change is a one-line edit: instead of fetch('https://aoai-imbila-sa.openai.azure.com/...'), you fetch https://gateway.ai.cloudflare.com/v1/<account>/<gateway>/azure-openai/aoai-imbila-sa/.... The request body, the api-version, the auth header — all unchanged. The response is the same shape. The dashboard now has graphs.

The cost is one extra hop. From a JNB Worker that hop is roughly 5 ms — small enough that the operational benefits dominate. The default decision in 2026 for any serious LLM workload on Cloudflare is to route through AI Gateway. Direct fetch becomes the path you take only when the gateway adds friction (highly custom retry semantics, very strict latency budgets, or a prototype where the dashboards aren't worth the URL change).

01 Worker
JNB colo
02 AI Gateway
cache · retry · log
03 Azure OpenAI
southafricanorth
04 Stream back
SSE pass-through

Numbers that pay back the extra hop.

Operational visibility, free caching, and a per-tenant rate-limit layer aren't decorations — they're the difference between a Worker that works and a Worker that's debuggable, reliable, and cost-controllable in production.

+5 ms
Latency over direct fetch
JNB Worker → JNB Gateway → SA-North AOAI
100 %
Prompt + response logs
All traffic, queryable, free tier
~30–60 %
Typical cache hit rate
Idempotent prompts, RAG-grounded Q&A
$0
Free tier requests / month
Caching, logs, retry — all included

Four moves. The first is one line.

Start with the URL swap (concept 01) and you immediately get logs and a cost dashboard. Layer in cache, retry, and per-tenant rate limits as the workload demands them. The order matches the order most teams adopt them in production.

01

The URL swap — instant observability

One wrangler secret put and one URL change. The Worker code from the direct-fetch pattern still works — the gateway forwards the request to Azure OpenAI with the api-key header you set, returns the body, and logs every request and response. Open the AI Gateway dashboard and there are graphs by model, deployment, latency, and cost.

// Before — direct fetch (no observability)
const url = `${env.AOAI_ENDPOINT}/openai/deployments/${env.AOAI_DEPLOYMENT}/chat/completions?api-version=2025-01-01-preview`;

// After — same shape, gateway in front
const url = `https://gateway.ai.cloudflare.com/v1/${env.CF_ACCOUNT_ID}/${env.GATEWAY_ID}/azure-openai/${env.AOAI_RESOURCE}/${env.AOAI_DEPLOYMENT}/chat/completions?api-version=2025-01-01-preview`;

// Body, headers, auth — completely unchanged
const res = await fetch(url, {
  method: 'POST',
  headers: { 'api-key': env.AOAI_KEY, 'Content-Type': 'application/json' },
  body: JSON.stringify({ messages, max_tokens: 800 }),
});
02

Cache control headers — let the gateway absorb duplicates

Idempotent prompts (RAG-grounded Q&A, classification, structured extraction) repeat. The gateway hashes the request body and serves a cached response when the next identical request lands. You opt in per-request via the cf-aig-cache-ttl header — set 0 for chat conversations where every response should be fresh, set 3600 for "what's the return policy" prompts where the answer hasn't changed in a year. No KV setup, no hash function, no sha256 on the hot path.

const res = await fetch(url, {
  method: 'POST',
  headers: {
    'api-key': env.AOAI_KEY,
    'Content-Type': 'application/json',
    'cf-aig-cache-ttl': '3600',            // 1 hour
    'cf-aig-cache-key': tenantId,             // scope cache per tenant
  },
  body: JSON.stringify({ messages, max_tokens: 800, temperature: 0 }),
});

// Inspect the headers to see if it was a hit
const hit = res.headers.get('cf-aig-cache-status');  // "HIT" | "MISS" | "BYPASS"
03

Built-in retry — gateway handles the 429s for you

SA North's tighter quota means 429s are routine. With direct fetch you write the retry loop yourself (see concept 03 there). With AI Gateway, you set retry headers on the request and the gateway honours retry-after-ms from Azure, falls back to exponential backoff, and only surfaces the failure to your Worker if all attempts fail. The Worker stays simple; the resilience lives in the gateway.

const res = await fetch(url, {
  method: 'POST',
  headers: {
    'api-key': env.AOAI_KEY,
    'Content-Type': 'application/json',
    'cf-aig-max-attempts': '3',
    'cf-aig-retry-delay': '500',             // initial ms
    'cf-aig-backoff': 'exponential',
  },
  body: JSON.stringify({ messages }),
});

// Inspect what the gateway did
const attempts = res.headers.get('cf-aig-attempts');   // "1" | "2" | "3"
04

Per-tenant rate limits and fallback routing

For a multi-tenant SaaS, the gateway's per-key rate limit is the cleanest abstraction. Set a daily token budget per tenant in the dashboard; the gateway enforces it without your Worker tracking counters. For resilience against an Azure OpenAI region outage, configure a fallback — the gateway tries SA North first, falls back to West Europe on 5xx, and surfaces a header telling the Worker which provider answered. Useful for non-regulated workloads where availability outranks strict residency.

// Tenant identity → cache key + rate limit bucket
const res = await fetch(url, {
  method: 'POST',
  headers: {
    'api-key': env.AOAI_KEY,
    'cf-aig-metadata': JSON.stringify({ tenantId, userId }),
    'cf-aig-cache-key': tenantId,
  },
  body: JSON.stringify({ messages }),
});

// Rate limit hit → 429 from the gateway, before AOAI is even called
if (res.status === 429 && res.headers.get('cf-aig-source') === 'rate-limit') {
  return new Response('Tenant quota exceeded', { status: 429 });
}

// Fallback: gateway tried SA North, failed, served from West Europe
const answeredBy = res.headers.get('cf-aig-provider');  // "azure-openai-sa" | "azure-openai-eu"

AI Gateway vs the alternatives.

There are four shapes of "thing between your Worker and the LLM". They make different trade-offs on observability, latency, and the cost of standing them up.

Pattern Latency overhead Built-in observability Caching Setup cost
AI Gateway → AOAI (this page) ~5 ms Yes — dashboards, logs Built-in, header-driven One URL change
Direct fetch → AOAI ~0 ms DIY (Analytics Engine, Logpush) Manual (KV + sha256) Lower up front, higher long-tail
Custom Worker proxy ~2–8 ms As much as you build As much as you build High — you own the code forever
Helicone / LangSmith proxy ~50–150 ms Strong, vendor-specific Vendor-dependent Account + URL change · cross-border
Azure API Management → AOAI ~20 ms Strong, Azure-native Built-in Azure-shop tax — APIM is a project

Workloads that always end up routed through it.

Some shapes of LLM workload are AI-Gateway-shaped on day one. Others get routed through it the second time the team has to debug a prompt that misbehaved in production.

Multi-tenant SaaS

Per-tenant token budget enforcement

One Worker, many customers, separate spending caps. The gateway's cf-aig-metadata + per-key rate limit replaces an in-Worker counter that would otherwise need a Durable Object for serialisation.

Customer support chat

Cache "what's your return policy"

The same five questions account for ~40% of customer support traffic. Cache them at the gateway with a 1-hour TTL keyed by question hash, save tokens, drop p50 to ~50 ms for repeat questions.

Prompt engineering

Compare two system prompts in production

Tag requests via cf-aig-metadata with the prompt version, then filter the dashboard. No A/B framework, no analytics pipeline — the gateway logs are the experiment.

Cost governance

Slack alert when monthly spend hits 80%

Set a budget in the gateway, hook a webhook to Slack. Far easier than parsing Azure billing exports — the gateway sees every token by deployment and sums them in real time.

Resilience

Fallback to West Europe on SA outage

Non-regulated workloads where availability outranks residency: configure SA North as primary, West Europe as fallback. Gateway switches on 5xx, surfaces which provider answered via response header.

Prompt audit

Compliance log of every customer prompt

Logpush stream from AI Gateway → R2 → S3-compatible bucket → SIEM ingestion. Append-only audit log of every prompt and response that left the Worker, in country, retained for the regulatory window.

From "log my LLM calls" to default infrastructure.

AI Gateway shipped late. Cloudflare's bet was that LLM workloads would settle into a small set of providers with a shared shape — and that one well-implemented proxy would be more useful than asking developers to build their own.

2023 Q3
AI Gateway beta
Launched alongside Workers AI. Initial coverage: OpenAI, Anthropic, Workers AI. Logs and basic caching. Not yet production-grade.
2024 Q1
Azure OpenAI provider · cost dashboards
Azure OpenAI as a first-class provider. Per-token cost graphs by deployment. The pattern on this page becomes possible.
2024 Q3
GA · Logpush · Universal Endpoint
Production-grade. Logpush to R2 / S3 / Splunk. Universal Endpoint lets one URL fan out to multiple providers based on policy.
2025
Per-key rate limiting · fallback chains · evaluations
Multi-tenant SaaS pattern lands. Provider fallback chains for resilience. Built-in eval framework for prompt regression tests.
2026
Default infrastructure
For new LLM workloads on Cloudflare, AI Gateway is the default. Direct fetch is the path you take only when the gateway adds friction.

When to route through it. When the +5 ms costs you.

AI Gateway is a default-yes for the workloads it's designed for. The cases where direct fetch wins are narrower than they used to be — most of them turn into "use the gateway and configure around the friction" once you write the code.

✓ Use it when
  • The workload is in production. If real users see the responses, you'll need logs to debug. The gateway's logs are free. Writing your own with Analytics Engine isn't.
  • Multi-tenant SaaS. Per-key rate limits replace a Durable Object counter. The cost-attribution dashboards replace a billing pipeline.
  • Idempotent prompts repeat. RAG-grounded Q&A, classification, structured extraction. Even a 30% cache hit rate pays back the +5 ms many times over.
  • You need cost governance. Budget alerts, per-tenant attribution, model-mix dashboards. All of it shipped.
  • Compliance wants a prompt audit log. Logpush → R2 in country, append-only, retained for the regulatory window. Cleanest path going.

Where this node sits in the chain.

AI Gateway is rarely the whole story. It sits between the Worker that calls it and the AOAI deployment it routes to, and it depends on Logpush for the audit story, R2 for log storage, and the workload-specific Worker pattern for what's calling it.

For agents loading this context

What this node gives you

The upgrade path from a working direct-fetch Worker to a production-grade LLM caller. Concepts 01–04 stack the same way the underlying explainer does — start with the URL swap for instant observability, layer in cache for cost, retry for resilience, and per-tenant metadata for SaaS posture. Skip the gateway entirely for sub-50 ms latency budgets, strict private-endpoint deployments, or SARB workloads where any Cloudflare-side processing is contentious.

When the agent-context API ships, this node will also expose the matching wrangler.toml, the gateway-creation Terraform, and the Logpush configuration for compliance-grade prompt audit.

Go deeper.

Cloudflare's docs cover the service well. Microsoft's Azure OpenAI reference still the source of truth for the request body, since AI Gateway is transparent to it. Skill tree links round out.

Agent context

Load this node into your agent

Reference Worker with all four concepts wired, gateway-creation Terraform, and the Logpush configuration for prompt audit. Shipping with the know.2nth.ai Worker API.