The headless pattern under the SA in-country LLM stack — call gpt-4o in southafricanorth from a JNB-colocated Worker with sub-millisecond cold starts, full prompt-loop control, and the throttle-handling discipline the region demands. When the Copilot UX is overhead and you'd rather own the request shape end-to-end.
fetch(), the way Cloudflare engineers prefer.The Copilot in South Africa node lays out three patterns for an in-country LLM stack on Microsoft. This page is the one Cloudflare-shop engineers actually deploy — a Worker calls Azure OpenAI directly, no Copilot UX layer, no Power Platform environment, no Dataverse. Just a JNB-colocated isolate making an HTTPS POST to aoai-imbila-sa.openai.azure.com.
The benefit isn't theoretical. From the JNB Cloudflare colo to the SA North Azure region the round trip is roughly 20 ms p50 — the same latency budget you'd hit talking to your own database. Cold starts are zero because Workers runs in V8 isolates. The whole prompt loop — auth, request shape, streaming, retries, caching — is yours to control, with no managed-orchestration layer making decisions silently on your behalf.
The trade-off is also honest: you write the production-readiness layer yourself. Copilot Studio handles throttling, fallback, conversation state, and Bing-grounding-disabled-by-policy out of the box. A raw Worker handles none of that until you write it. The rest of this page is what "writing it" looks like.
Latency, cost, and quota all bend in your favour when the caller and the LLM both sit in country. The shape of the workload — short, stateless, request-scoped — is also exactly what V8 isolates are good at.
Start with concept 01 — the basic shape works for a prototype. Layer in the rest as the workload demands them. There's no point KV-caching a chat response if your throughput is twelve requests a day.
api-key auth and a fetchAzure OpenAI uses the same REST shape as OpenAI's own API. Different base URL, an api-version query parameter, and an api-key header instead of Authorization: Bearer. That's the entire delta. From a Worker, you don't even need an SDK — fetch() is enough, and the response body is already a ReadableStream ready to hand back.
// src/index.ts — the smallest useful Worker that calls SA-North gpt-4o type Env = { AOAI_ENDPOINT: string; // https://aoai-imbila-sa.openai.azure.com AOAI_DEPLOYMENT: string; // gpt-4o-za AOAI_KEY: string; // wrangler secret put AOAI_KEY }; export default { async fetch(req: Request, env: Env): Promise<Response> { const { messages } = await req.json(); const url = `${env.AOAI_ENDPOINT}/openai/deployments/${env.AOAI_DEPLOYMENT}/chat/completions?api-version=2025-01-01-preview`; const aoai = await fetch(url, { method: 'POST', headers: { 'api-key': env.AOAI_KEY, 'Content-Type': 'application/json' }, body: JSON.stringify({ messages, max_tokens: 800, temperature: 0.3 }), }); return new Response(aoai.body, { status: aoai.status, headers: aoai.headers }); } };
Chat UX needs first-token-fast — the user wants to see the response start within ~300 ms of pressing send, not wait for the whole completion. Set stream: true on the request, and Azure responds with text/event-stream chunks. A Worker doesn't need to parse them. The ReadableStream from fetch() can be returned directly as the response body — Workers does no buffering, so the user's browser sees tokens as fast as they leave Azure.
// Same handler, streaming variant const aoai = await fetch(url, { method: 'POST', headers: { 'api-key': env.AOAI_KEY, 'Content-Type': 'application/json' }, body: JSON.stringify({ messages, stream: true, max_tokens: 800, stream_options: { include_usage: true }, // final chunk has token counts }), }); // Pass the SSE stream straight through. No JSON parsing, no buffering. return new Response(aoai.body, { headers: { 'Content-Type': 'text/event-stream', 'Cache-Control': 'no-cache, no-transform', 'X-Accel-Buffering': 'no', }, });
retry-after-msSA North's default token-per-minute quota is roughly 30–50% of the equivalent quota in East US. Burst traffic that auto-scales fine in US regions hits 429s here. Azure OpenAI returns a retry-after-ms header on throttle responses — honour it, fall back to exponential backoff if the header is missing, and cap retries at three. For idempotent prompts, layer KV caching on top so repeats absorb at zero cost.
async function chatWithRetry( env: Env, messages: unknown, attempt = 0 ): Promise<Response> { const url = `${env.AOAI_ENDPOINT}/openai/deployments/${env.AOAI_DEPLOYMENT}/chat/completions?api-version=2025-01-01-preview`; const res = await fetch(url, { method: 'POST', headers: { 'api-key': env.AOAI_KEY, 'Content-Type': 'application/json' }, body: JSON.stringify({ messages, max_tokens: 800 }), }); if (res.status === 429 && attempt < 3) { const hdr = res.headers.get('retry-after-ms'); const wait = hdr ? Number(hdr) : 500 * 2 ** attempt; // 500, 1000, 2000 await new Promise(r => setTimeout(r, wait)); return chatWithRetry(env, messages, attempt + 1); } return res; } // Idempotent prompt? Hash + KV. const key = `aoai:${await sha256(JSON.stringify(messages))}`; const cached = await env.CACHE.get(key); if (cached) return new Response(cached, { headers: { 'Content-Type': 'application/json' } }); const fresh = await chatWithRetry(env, messages); const body = await fresh.text(); await env.CACHE.put(key, body, { expirationTtl: 3600 }); return new Response(body);
API keys are fine for a prototype. For production — especially SARB-strict workloads — the better pattern is OAuth client credentials with a service principal in Entra ID. The Worker fetches an access token from login.microsoftonline.com, caches it in KV until ~5 minutes before expiry, and presents it as Authorization: Bearer <token> on each AOAI call. Rotate the client secret in Key Vault without redeploying the Worker. Workers can't have an Azure managed identity directly — this is the closest equivalent.
async function getAzureToken(env: Env): Promise<string> { const cached = await env.CACHE.get('aoai:token'); if (cached) return cached; const res = await fetch( `https://login.microsoftonline.com/${env.AZURE_TENANT_ID}/oauth2/v2.0/token`, { method: 'POST', headers: { 'Content-Type': 'application/x-www-form-urlencoded' }, body: new URLSearchParams({ client_id: env.AZURE_CLIENT_ID, client_secret: env.AZURE_CLIENT_SECRET, scope: 'https://cognitiveservices.azure.com/.default', grant_type: 'client_credentials', }), } ); const { access_token, expires_in } = await res.json() as any; // Cache 5 min short of expiry — Entra tokens are 60-min by default await env.CACHE.put('aoai:token', access_token, { expirationTtl: expires_in - 300, }); return access_token; } // Use it on the AOAI call const token = await getAzureToken(env); const aoai = await fetch(url, { method: 'POST', headers: { 'Authorization': `Bearer ${token}`, 'Content-Type': 'application/json', }, body: JSON.stringify({ messages }), });
A Worker calling Azure OpenAI is one of several callable shapes. The other shapes have their place — knowing where the boundary sits is half the architecture decision.
| Caller pattern | Latency to model | SA-resident | Cost shape | Best for |
|---|---|---|---|---|
| Worker → AOAI (this page) | ~20 ms p50 | Yes (model in SA North) | Token-metered + Workers requests | Headless apps, customer-facing chat, RAG APIs |
| Worker → AI Gateway → AOAI | ~25 ms p50 | Yes (gateway + model) | Same + free gateway tier | Same as above plus telemetry, caching, fallback |
| Worker → Workers AI (env.AI) | ~10 ms p50 | Yes (Cloudflare hosted) | Per-neuron, cheaper | Routing/classification with Llama 3.1; not gpt-4o quality |
| Worker → Anthropic / OpenAI direct | ~180 ms p50 | No | Token-metered | Non-regulated workloads needing Claude or o3 |
| Lambda (af-south-1) → AOAI | ~30 ms p50 | Yes (both in SA) | Per-invocation + cold starts | Teams already deep in AWS |
| Azure Function (SA North) → AOAI | ~5 ms p50 | Yes (same region) | Per-execution + plan | Single-vendor Azure shops |
These are the workload shapes where the headless Worker pattern wins over Copilot Studio. Customer-facing, programmatic, or volumetric — and almost always in a Cloudflare-first stack.
Worker pulls top-k from Vectorize, stuffs context into a gpt-4o prompt, streams the answer back. Sub-100 ms first-token, single binding chain, no Microsoft tenant in the picture at all.
End-users aren't Microsoft licensees. Embed a chat widget on a public site, hit a Worker, hit AOAI. Copilot Studio's per-user licensing makes no sense here; raw Worker + token metering does.
Cron-triggered Worker reads new SharePoint or D1 rows, fans out to text-embedding-3-large in SA North, writes vectors into Vectorize. Cheaper than per-user Copilot indexing for back-office corpora.
Producer drops jobs into Cloudflare Queues, consumer Worker pulls and calls AOAI with retry. Survives load shedding, throttle bursts, and Azure capacity wobbles without blocking a user-facing request.
Each tenant supplies their own AOAI endpoint and key (so the data is in their Azure subscription, not yours). Worker dispatches per request.headers.get('X-Tenant') — clean POPIA boundary, single deploy.
User asks "show me overdue POs from supplier X". Worker calls AOAI to extract structured intent, then runs the GraphQL/REST against the ERP, then calls AOAI again to summarise. The Worker is the orchestration point.
Workers wasn't designed for LLM workloads. It accidentally became one of the best places to host them when streaming, sub-millisecond cold starts, and bindings turned out to matter more than CPU time.
southafricanorth. SSE response shape settles across providers — the pass-through pattern in concept 02 becomes idiomatic.A direct Worker call is the right answer for a specific shape of workload. Reach for Copilot Studio, AI Gateway, or Workflows when the constraints push past what an isolate is designed to do.
A Worker calling AOAI is rarely the whole architecture. It's the central hop in a chain that pulls on Cloudflare bindings, Microsoft regions, and ERP grounding. Here's the dependency footprint.
text-embedding-3-large, query top-k from the same Worker that calls gpt-4o.A working scaffolding for any "Cloudflare Worker calls Azure OpenAI in South Africa" prompt. The four concepts in section 03 stack — start with concept 01 to bootstrap, layer in 02 when the UX needs streaming, 03 when the throughput needs throttle handling, 04 when the auth posture needs OAuth instead of api-key. Skip 04 entirely for prototypes; reach for AI Gateway instead of writing your own retry layer if observability matters.
When the agent-context API ships, this node will also expose a wrangler.toml scaffold, a complete src/index.ts reference implementation with all four concepts wired together, and the Azure CLI sequence for the corresponding AOAI deployment.
Microsoft's REST reference for the request shape, Cloudflare's for the runtime, and the SKILL.md companions for the tree depth. Skip the Medium tutorials — most of them got the api-version wrong.
Reference src/index.ts with all four concepts wired, wrangler.toml with KV + secret bindings, and the Azure CLI sequence for the matching AOAI deployment. Shipping with the know.2nth.ai Worker API.