know.2nth.ai › Technology › Microsoft › Azure OpenAI from Workers › + AI Gateway

Technology · Microsoft · Implementation Node

The proxy
you don't have to write.

Cloudflare AI Gateway is the observability and resilience layer that sits between a Worker and Azure OpenAI. Caching, retries, prompt logs, cost dashboards, and per-tenant rate limits — all by changing one URL. The natural upgrade from the direct-fetch pattern when the workload reaches the volume where you'd otherwise build a custom proxy.

Implementation AI Gateway Azure OpenAI SA North Last updated · May 2026

01 · What it is

Same auth, same body. Different host.

The direct-fetch pattern works. It's lean, fast, and easy to reason about. What it lacks is the operational layer that any LLM workload eventually needs — a place to see prompts and responses for debugging, a cache that absorbs duplicate requests, retry logic that survives a 429 burst, fallback to a backup model when the primary deployment is down, and per-tenant rate limits in a multi-tenant SaaS.

You can write all of that. Most teams do, eventually. AI Gateway ships it as a Cloudflare service that sits between your Worker and Azure OpenAI. The change is a one-line edit: instead of fetch('https://aoai-imbila-sa.openai.azure.com/...'), you fetch https://gateway.ai.cloudflare.com/v1/<account>/<gateway>/azure-openai/aoai-imbila-sa/.... The request body, the api-version, the auth header — all unchanged. The response is the same shape. The dashboard now has graphs.

The cost is one extra hop. From a JNB Worker that hop is roughly 5 ms — small enough that the operational benefits dominate. The default decision in 2026 for any serious LLM workload on Cloudflare is to route through AI Gateway. Direct fetch becomes the path you take only when the gateway adds friction (highly custom retry semantics, very strict latency budgets, or a prototype where the dashboards aren't worth the URL change).

01 Worker

JNB colo

02 AI Gateway

cache · retry · log

03 Azure OpenAI

southafricanorth

04 Stream back

SSE pass-through

02 · Why it matters

Numbers that pay back the extra hop.

Operational visibility, free caching, and a per-tenant rate-limit layer aren't decorations — they're the difference between a Worker that works and a Worker that's debuggable, reliable, and cost-controllable in production.

+5 ms

Latency over direct fetch

JNB Worker → JNB Gateway → SA-North AOAI

100 %

Prompt + response logs

All traffic, queryable, free tier

~30–60 %

Typical cache hit rate

Idempotent prompts, RAG-grounded Q&A

Free tier requests / month

Caching, logs, retry — all included

03 · How it works

Four moves. The first is one line.

Start with the URL swap (concept 01) and you immediately get logs and a cost dashboard. Layer in cache, retry, and per-tenant rate limits as the workload demands them. The order matches the order most teams adopt them in production.

The URL swap — instant observability

One wrangler secret put and one URL change. The Worker code from the direct-fetch pattern still works — the gateway forwards the request to Azure OpenAI with the api-key header you set, returns the body, and logs every request and response. Open the AI Gateway dashboard and there are graphs by model, deployment, latency, and cost.

// Before — direct fetch (no observability)
const url = `${env.AOAI_ENDPOINT}/openai/deployments/${env.AOAI_DEPLOYMENT}/chat/completions?api-version=2025-01-01-preview`;

// After — same shape, gateway in front
const url = `https://gateway.ai.cloudflare.com/v1/${env.CF_ACCOUNT_ID}/${env.GATEWAY_ID}/azure-openai/${env.AOAI_RESOURCE}/${env.AOAI_DEPLOYMENT}/chat/completions?api-version=2025-01-01-preview`;

// Body, headers, auth — completely unchanged
const res = await fetch(url, {
  method: 'POST',
  headers: { 'api-key': env.AOAI_KEY, 'Content-Type': 'application/json' },
  body: JSON.stringify({ messages, max_tokens: 800 }),
});

Cache control headers — let the gateway absorb duplicates

Idempotent prompts (RAG-grounded Q&A, classification, structured extraction) repeat. The gateway hashes the request body and serves a cached response when the next identical request lands. You opt in per-request via the cf-aig-cache-ttl header — set 0 for chat conversations where every response should be fresh, set 3600 for "what's the return policy" prompts where the answer hasn't changed in a year. No KV setup, no hash function, no sha256 on the hot path.

const res = await fetch(url, {
  method: 'POST',
  headers: {
    'api-key': env.AOAI_KEY,
    'Content-Type': 'application/json',
    'cf-aig-cache-ttl': '3600',            // 1 hour
    'cf-aig-cache-key': tenantId,             // scope cache per tenant
  },
  body: JSON.stringify({ messages, max_tokens: 800, temperature: 0 }),
});

// Inspect the headers to see if it was a hit
const hit = res.headers.get('cf-aig-cache-status');  // "HIT" | "MISS" | "BYPASS"

Built-in retry — gateway handles the 429s for you

SA North's tighter quota means 429s are routine. With direct fetch you write the retry loop yourself (see concept 03 there). With AI Gateway, you set retry headers on the request and the gateway honours retry-after-ms from Azure, falls back to exponential backoff, and only surfaces the failure to your Worker if all attempts fail. The Worker stays simple; the resilience lives in the gateway.

const res = await fetch(url, {
  method: 'POST',
  headers: {
    'api-key': env.AOAI_KEY,
    'Content-Type': 'application/json',
    'cf-aig-max-attempts': '3',
    'cf-aig-retry-delay': '500',             // initial ms
    'cf-aig-backoff': 'exponential',
  },
  body: JSON.stringify({ messages }),
});

// Inspect what the gateway did
const attempts = res.headers.get('cf-aig-attempts');   // "1" | "2" | "3"

Per-tenant rate limits and fallback routing

For a multi-tenant SaaS, the gateway's per-key rate limit is the cleanest abstraction. Set a daily token budget per tenant in the dashboard; the gateway enforces it without your Worker tracking counters. For resilience against an Azure OpenAI region outage, configure a fallback — the gateway tries SA North first, falls back to West Europe on 5xx, and surfaces a header telling the Worker which provider answered. Useful for non-regulated workloads where availability outranks strict residency.

// Tenant identity → cache key + rate limit bucket
const res = await fetch(url, {
  method: 'POST',
  headers: {
    'api-key': env.AOAI_KEY,
    'cf-aig-metadata': JSON.stringify({ tenantId, userId }),
    'cf-aig-cache-key': tenantId,
  },
  body: JSON.stringify({ messages }),
});

// Rate limit hit → 429 from the gateway, before AOAI is even called
if (res.status === 429 && res.headers.get('cf-aig-source') === 'rate-limit') {
  return new Response('Tenant quota exceeded', { status: 429 });
}

// Fallback: gateway tried SA North, failed, served from West Europe
const answeredBy = res.headers.get('cf-aig-provider');  // "azure-openai-sa" | "azure-openai-eu"

04 · The ecosystem

AI Gateway vs the alternatives.

There are four shapes of "thing between your Worker and the LLM". They make different trade-offs on observability, latency, and the cost of standing them up.

Pattern	Latency overhead	Built-in observability	Caching	Setup cost
AI Gateway → AOAI (this page)	~5 ms	Yes — dashboards, logs	Built-in, header-driven	One URL change
Direct fetch → AOAI	~0 ms	DIY (Analytics Engine, Logpush)	Manual (KV + sha256)	Lower up front, higher long-tail
Custom Worker proxy	~2–8 ms	As much as you build	As much as you build	High — you own the code forever
Helicone / LangSmith proxy	~50–150 ms	Strong, vendor-specific	Vendor-dependent	Account + URL change · cross-border
Azure API Management → AOAI	~20 ms	Strong, Azure-native	Built-in	Azure-shop tax — APIM is a project

05 · Use cases

Workloads that always end up routed through it.

Some shapes of LLM workload are AI-Gateway-shaped on day one. Others get routed through it the second time the team has to debug a prompt that misbehaved in production.

Multi-tenant SaaS

Per-tenant token budget enforcement

One Worker, many customers, separate spending caps. The gateway's cf-aig-metadata + per-key rate limit replaces an in-Worker counter that would otherwise need a Durable Object for serialisation.

Customer support chat

Cache "what's your return policy"

The same five questions account for ~40% of customer support traffic. Cache them at the gateway with a 1-hour TTL keyed by question hash, save tokens, drop p50 to ~50 ms for repeat questions.

Prompt engineering

Compare two system prompts in production

Tag requests via cf-aig-metadata with the prompt version, then filter the dashboard. No A/B framework, no analytics pipeline — the gateway logs are the experiment.

Cost governance

Slack alert when monthly spend hits 80%

Set a budget in the gateway, hook a webhook to Slack. Far easier than parsing Azure billing exports — the gateway sees every token by deployment and sums them in real time.

Resilience

Fallback to West Europe on SA outage

Non-regulated workloads where availability outranks residency: configure SA North as primary, West Europe as fallback. Gateway switches on 5xx, surfaces which provider answered via response header.

Prompt audit

Compliance log of every customer prompt

Logpush stream from AI Gateway → R2 → S3-compatible bucket → SIEM ingestion. Append-only audit log of every prompt and response that left the Worker, in country, retained for the regulatory window.

06 · Evolution

From "log my LLM calls" to default infrastructure.

AI Gateway shipped late. Cloudflare's bet was that LLM workloads would settle into a small set of providers with a shared shape — and that one well-implemented proxy would be more useful than asking developers to build their own.

2023 Q3

AI Gateway beta

Launched alongside Workers AI. Initial coverage: OpenAI, Anthropic, Workers AI. Logs and basic caching. Not yet production-grade.

2024 Q1

Azure OpenAI provider · cost dashboards

Azure OpenAI as a first-class provider. Per-token cost graphs by deployment. The pattern on this page becomes possible.

2024 Q3

GA · Logpush · Universal Endpoint

Production-grade. Logpush to R2 / S3 / Splunk. Universal Endpoint lets one URL fan out to multiple providers based on policy.

2025

Per-key rate limiting · fallback chains · evaluations

Multi-tenant SaaS pattern lands. Provider fallback chains for resilience. Built-in eval framework for prompt regression tests.

2026

Default infrastructure

For new LLM workloads on Cloudflare, AI Gateway is the default. Direct fetch is the path you take only when the gateway adds friction.

07 · Decision guide

When to route through it. When the +5 ms costs you.

AI Gateway is a default-yes for the workloads it's designed for. The cases where direct fetch wins are narrower than they used to be — most of them turn into "use the gateway and configure around the friction" once you write the code.

✓ Use it when

The workload is in production. If real users see the responses, you'll need logs to debug. The gateway's logs are free. Writing your own with Analytics Engine isn't.
Multi-tenant SaaS. Per-key rate limits replace a Durable Object counter. The cost-attribution dashboards replace a billing pipeline.
Idempotent prompts repeat. RAG-grounded Q&A, classification, structured extraction. Even a 30% cache hit rate pays back the +5 ms many times over.
You need cost governance. Budget alerts, per-tenant attribution, model-mix dashboards. All of it shipped.
Compliance wants a prompt audit log. Logpush → R2 in country, append-only, retained for the regulatory window. Cleanest path going.

✗ Skip it when

Latency budget is < 50 ms total. Voice agents, real-time autocomplete, ultra-low-latency UX. The +5 ms matters; direct fetch is the right call.
The retry / cache logic is highly custom. If you need bespoke retry semantics (e.g. only retry on specific error codes from a custom-trained model), the gateway's built-in retry can't be turned off granularly enough — write it yourself.
Strict SARB / private-endpoint posture. If the AOAI resource is behind an Azure Private Endpoint reachable only from a specific VNet, AI Gateway can't reach it. Use Azure Function in the VNet or Cloudflare Tunnel + Worker.
Strict residency reading rejects Cloudflare in path. AI Gateway runs on Cloudflare's global network — for a prompt that contains regulated PI, the gateway hop technically processes the data on Cloudflare infrastructure. Document and check before deploying for SARB-strict workloads.
Throw-away prototype. Two-day spike to test if a prompt works at all. Direct fetch is fewer moving parts. Refactor to gateway when the prototype graduates.

08 · 2ⁿ perspective

Where this node sits in the chain.

AI Gateway is rarely the whole story. It sits between the Worker that calls it and the AOAI deployment it routes to, and it depends on Logpush for the audit story, R2 for log storage, and the workload-specific Worker pattern for what's calling it.

tech/microsoft

Azure OpenAI from Workers

The pattern this page upgrades. Read it first if you're starting from scratch — the four concepts there map 1:1 to the four here.

tech/microsoft

Copilot in South Africa (parent)

The strategy decision tree this implementation slots into. Use it to confirm the headless pattern is the right shape before adding the gateway.

tech/cloudflare

AI Gateway (canonical)

The Cloudflare-side reference for the service itself — provider list, dashboard tour, Universal Endpoint mode, all the details this page glosses.

tech/cloudflare

Workers & Pages

The runtime calling the gateway. The fetch pattern is identical to the direct case — only the URL changes.

tech/cloudflare

The Logpush destination for prompt audit trails. Zero egress cost makes long-term retention cheap.

tech/cloudflare

Analytics Engine

When the gateway's built-in dashboards aren't enough — push custom metrics from the Worker and join them with gateway logs.

For agents loading this context

What this node gives you

The upgrade path from a working direct-fetch Worker to a production-grade LLM caller. Concepts 01–04 stack the same way the underlying explainer does — start with the URL swap for instant observability, layer in cache for cost, retry for resilience, and per-tenant metadata for SaaS posture. Skip the gateway entirely for sub-50 ms latency budgets, strict private-endpoint deployments, or SARB workloads where any Cloudflare-side processing is contentious.

When the agent-context API ships, this node will also expose the matching wrangler.toml, the gateway-creation Terraform, and the Logpush configuration for compliance-grade prompt audit.

09 · Resources

Go deeper.

Cloudflare's docs cover the service well. Microsoft's Azure OpenAI reference still the source of truth for the request body, since AI Gateway is transparent to it. Skill tree links round out.

Docs · Cloudflare AI Gateway main reference developers.cloudflare.com · ai-gateway Docs · Cloudflare AI Gateway — Azure OpenAI provider developers.cloudflare.com · ai-gateway/providers/azureopenai Docs · Cloudflare Cache configuration headers developers.cloudflare.com · ai-gateway/caching Docs · Cloudflare Retry configuration developers.cloudflare.com · ai-gateway/retries Docs · Cloudflare Per-key rate limiting developers.cloudflare.com · ai-gateway/rate-limiting Docs · Cloudflare Logpush — prompt audit to R2 / Splunk developers.cloudflare.com · logs/logpush Docs · Microsoft Azure OpenAI REST reference (request body) learn.microsoft.com · openai/reference Skill tree tech/microsoft/azure-ai/sa — model + region matrix github.com/2nth-ai/skills · azure-ai/sa

Agent context

Load this node into your agent

Reference Worker with all four concepts wired, gateway-creation Terraform, and the Logpush configuration for prompt audit. Shipping with the know.2nth.ai Worker API.

The proxyyou don't have to write.

Same auth, same body. Different host.

Numbers that pay back the extra hop.

Four moves. The first is one line.

The URL swap — instant observability

Cache control headers — let the gateway absorb duplicates

Built-in retry — gateway handles the 429s for you

Per-tenant rate limits and fallback routing

AI Gateway vs the alternatives.

Workloads that always end up routed through it.

Per-tenant token budget enforcement

Cache "what's your return policy"

Compare two system prompts in production

Slack alert when monthly spend hits 80%

Fallback to West Europe on SA outage

Compliance log of every customer prompt

From "log my LLM calls" to default infrastructure.

When to route through it. When the +5 ms costs you.

Where this node sits in the chain.

What this node gives you

Go deeper.

Load this node into your agent

The proxy
you don't have to write.