know.2nth.ai Technology Microsoft Copilot in South Africa Azure OpenAI from Workers
Technology · Microsoft · Implementation Node

Azure OpenAI
from a Worker.

The headless pattern under the SA in-country LLM stack — call gpt-4o in southafricanorth from a JNB-colocated Worker with sub-millisecond cold starts, full prompt-loop control, and the throttle-handling discipline the region demands. When the Copilot UX is overhead and you'd rather own the request shape end-to-end.

Implementation Cloudflare Workers Azure OpenAI SA North Last updated · May 2026

No Copilot UX. Just fetch(), the way Cloudflare engineers prefer.

The Copilot in South Africa node lays out three patterns for an in-country LLM stack on Microsoft. This page is the one Cloudflare-shop engineers actually deploy — a Worker calls Azure OpenAI directly, no Copilot UX layer, no Power Platform environment, no Dataverse. Just a JNB-colocated isolate making an HTTPS POST to aoai-imbila-sa.openai.azure.com.

The benefit isn't theoretical. From the JNB Cloudflare colo to the SA North Azure region the round trip is roughly 20 ms p50 — the same latency budget you'd hit talking to your own database. Cold starts are zero because Workers runs in V8 isolates. The whole prompt loop — auth, request shape, streaming, retries, caching — is yours to control, with no managed-orchestration layer making decisions silently on your behalf.

The trade-off is also honest: you write the production-readiness layer yourself. Copilot Studio handles throttling, fallback, conversation state, and Bing-grounding-disabled-by-policy out of the box. A raw Worker handles none of that until you write it. The rest of this page is what "writing it" looks like.

01 Browser
SA user
02 JNB colo
Worker isolate
03 KV / D1
RAG, sessions
04 Azure OpenAI
southafricanorth
05 Stream back
SSE pass-through
⬩ All four hops resolve inside the SA Geo · ~20 ms p50 round trip ⬩

The numbers that make a Worker the right caller.

Latency, cost, and quota all bend in your favour when the caller and the LLM both sit in country. The shape of the workload — short, stateless, request-scoped — is also exactly what V8 isolates are good at.

~20 ms
JNB Worker → SA North AOAI p50
Same building, basically
~0 ms
Worker cold start
V8 isolate, not a container
100 k
Free Worker requests / day
Per account, before metering kicks in
30–50 k
SA North TPM default quota
~30–50% of East US — plan for throttle

Four production-grade patterns. Build up, don't pre-optimise.

Start with concept 01 — the basic shape works for a prototype. Layer in the rest as the workload demands them. There's no point KV-caching a chat response if your throughput is twelve requests a day.

01

The basic shape — api-key auth and a fetch

Azure OpenAI uses the same REST shape as OpenAI's own API. Different base URL, an api-version query parameter, and an api-key header instead of Authorization: Bearer. That's the entire delta. From a Worker, you don't even need an SDK — fetch() is enough, and the response body is already a ReadableStream ready to hand back.

// src/index.ts — the smallest useful Worker that calls SA-North gpt-4o
type Env = {
  AOAI_ENDPOINT: string;     // https://aoai-imbila-sa.openai.azure.com
  AOAI_DEPLOYMENT: string;   // gpt-4o-za
  AOAI_KEY: string;          // wrangler secret put AOAI_KEY
};

export default {
  async fetch(req: Request, env: Env): Promise<Response> {
    const { messages } = await req.json();
    const url = `${env.AOAI_ENDPOINT}/openai/deployments/${env.AOAI_DEPLOYMENT}/chat/completions?api-version=2025-01-01-preview`;

    const aoai = await fetch(url, {
      method: 'POST',
      headers: { 'api-key': env.AOAI_KEY, 'Content-Type': 'application/json' },
      body: JSON.stringify({ messages, max_tokens: 800, temperature: 0.3 }),
    });

    return new Response(aoai.body, { status: aoai.status, headers: aoai.headers });
  }
};
02

Streaming SSE — pass the body straight through

Chat UX needs first-token-fast — the user wants to see the response start within ~300 ms of pressing send, not wait for the whole completion. Set stream: true on the request, and Azure responds with text/event-stream chunks. A Worker doesn't need to parse them. The ReadableStream from fetch() can be returned directly as the response body — Workers does no buffering, so the user's browser sees tokens as fast as they leave Azure.

// Same handler, streaming variant
const aoai = await fetch(url, {
  method: 'POST',
  headers: { 'api-key': env.AOAI_KEY, 'Content-Type': 'application/json' },
  body: JSON.stringify({
    messages,
    stream: true,
    max_tokens: 800,
    stream_options: { include_usage: true },  // final chunk has token counts
  }),
});

// Pass the SSE stream straight through. No JSON parsing, no buffering.
return new Response(aoai.body, {
  headers: {
    'Content-Type': 'text/event-stream',
    'Cache-Control': 'no-cache, no-transform',
    'X-Accel-Buffering': 'no',
  },
});
03

Throttle handling — exponential backoff with retry-after-ms

SA North's default token-per-minute quota is roughly 30–50% of the equivalent quota in East US. Burst traffic that auto-scales fine in US regions hits 429s here. Azure OpenAI returns a retry-after-ms header on throttle responses — honour it, fall back to exponential backoff if the header is missing, and cap retries at three. For idempotent prompts, layer KV caching on top so repeats absorb at zero cost.

async function chatWithRetry(
  env: Env, messages: unknown, attempt = 0
): Promise<Response> {
  const url = `${env.AOAI_ENDPOINT}/openai/deployments/${env.AOAI_DEPLOYMENT}/chat/completions?api-version=2025-01-01-preview`;
  const res = await fetch(url, {
    method: 'POST',
    headers: { 'api-key': env.AOAI_KEY, 'Content-Type': 'application/json' },
    body: JSON.stringify({ messages, max_tokens: 800 }),
  });

  if (res.status === 429 && attempt < 3) {
    const hdr = res.headers.get('retry-after-ms');
    const wait = hdr ? Number(hdr) : 500 * 2 ** attempt;  // 500, 1000, 2000
    await new Promise(r => setTimeout(r, wait));
    return chatWithRetry(env, messages, attempt + 1);
  }
  return res;
}

// Idempotent prompt? Hash + KV.
const key = `aoai:${await sha256(JSON.stringify(messages))}`;
const cached = await env.CACHE.get(key);
if (cached) return new Response(cached, { headers: { 'Content-Type': 'application/json' } });

const fresh = await chatWithRetry(env, messages);
const body = await fresh.text();
await env.CACHE.put(key, body, { expirationTtl: 3600 });
return new Response(body);
04

Auth upgrade — Entra ID OAuth with KV-cached token

API keys are fine for a prototype. For production — especially SARB-strict workloads — the better pattern is OAuth client credentials with a service principal in Entra ID. The Worker fetches an access token from login.microsoftonline.com, caches it in KV until ~5 minutes before expiry, and presents it as Authorization: Bearer <token> on each AOAI call. Rotate the client secret in Key Vault without redeploying the Worker. Workers can't have an Azure managed identity directly — this is the closest equivalent.

async function getAzureToken(env: Env): Promise<string> {
  const cached = await env.CACHE.get('aoai:token');
  if (cached) return cached;

  const res = await fetch(
    `https://login.microsoftonline.com/${env.AZURE_TENANT_ID}/oauth2/v2.0/token`,
    {
      method: 'POST',
      headers: { 'Content-Type': 'application/x-www-form-urlencoded' },
      body: new URLSearchParams({
        client_id: env.AZURE_CLIENT_ID,
        client_secret: env.AZURE_CLIENT_SECRET,
        scope: 'https://cognitiveservices.azure.com/.default',
        grant_type: 'client_credentials',
      }),
    }
  );
  const { access_token, expires_in } = await res.json() as any;

  // Cache 5 min short of expiry — Entra tokens are 60-min by default
  await env.CACHE.put('aoai:token', access_token, {
    expirationTtl: expires_in - 300,
  });
  return access_token;
}

// Use it on the AOAI call
const token = await getAzureToken(env);
const aoai = await fetch(url, {
  method: 'POST',
  headers: {
    'Authorization': `Bearer ${token}`,
    'Content-Type': 'application/json',
  },
  body: JSON.stringify({ messages }),
});

Worker → AOAI vs the alternatives.

A Worker calling Azure OpenAI is one of several callable shapes. The other shapes have their place — knowing where the boundary sits is half the architecture decision.

Caller pattern Latency to model SA-resident Cost shape Best for
Worker → AOAI (this page) ~20 ms p50 Yes (model in SA North) Token-metered + Workers requests Headless apps, customer-facing chat, RAG APIs
Worker → AI Gateway → AOAI ~25 ms p50 Yes (gateway + model) Same + free gateway tier Same as above plus telemetry, caching, fallback
Worker → Workers AI (env.AI) ~10 ms p50 Yes (Cloudflare hosted) Per-neuron, cheaper Routing/classification with Llama 3.1; not gpt-4o quality
Worker → Anthropic / OpenAI direct ~180 ms p50 No Token-metered Non-regulated workloads needing Claude or o3
Lambda (af-south-1) → AOAI ~30 ms p50 Yes (both in SA) Per-invocation + cold starts Teams already deep in AWS
Azure Function (SA North) → AOAI ~5 ms p50 Yes (same region) Per-execution + plan Single-vendor Azure shops

What we actually deploy this for.

These are the workload shapes where the headless Worker pattern wins over Copilot Studio. Customer-facing, programmatic, or volumetric — and almost always in a Cloudflare-first stack.

Headless RAG API

Vectorize-grounded Q&A endpoint

Worker pulls top-k from Vectorize, stuffs context into a gpt-4o prompt, streams the answer back. Sub-100 ms first-token, single binding chain, no Microsoft tenant in the picture at all.

Customer-facing chat

SaaS without M365 in the mix

End-users aren't Microsoft licensees. Embed a chat widget on a public site, hit a Worker, hit AOAI. Copilot Studio's per-user licensing makes no sense here; raw Worker + token metering does.

Embedding worker

Bulk document indexing

Cron-triggered Worker reads new SharePoint or D1 rows, fans out to text-embedding-3-large in SA North, writes vectors into Vectorize. Cheaper than per-user Copilot indexing for back-office corpora.

Async batch summarisation

Queue-driven document processing

Producer drops jobs into Cloudflare Queues, consumer Worker pulls and calls AOAI with retry. Survives load shedding, throttle bursts, and Azure capacity wobbles without blocking a user-facing request.

Multi-tenant SaaS

Per-tenant BYO Azure OpenAI

Each tenant supplies their own AOAI endpoint and key (so the data is in their Azure subscription, not yours). Worker dispatches per request.headers.get('X-Tenant') — clean POPIA boundary, single deploy.

ERP integration

Sage X3 / ERPNext natural-language layer

User asks "show me overdue POs from supplier X". Worker calls AOAI to extract structured intent, then runs the GraphQL/REST against the ERP, then calls AOAI again to summarise. The Worker is the orchestration point.

From "edge cache rule" to "production LLM caller".

Workers wasn't designed for LLM workloads. It accidentally became one of the best places to host them when streaming, sub-millisecond cold starts, and bindings turned out to matter more than CPU time.

2017
Workers GA
JavaScript at the edge via V8 isolates. Initially used as smarter CDN rules — nobody calling LLMs from edge yet.
2020
Workers Unbound · longer CPU budgets
CPU time bumped to 30 seconds (later 5 minutes on paid plans). Long-running streaming responses became feasible.
2022
D1 · KV · R2 · Queues
The state primitives that turn a Worker from "fancy proxy" into "applications platform". Token caching, RAG corpus storage, async LLM jobs all become single-binding code.
2023
Azure OpenAI in SA North · streaming standardised
gpt-3.5 then gpt-4 deploy in southafricanorth. SSE response shape settles across providers — the pass-through pattern in concept 02 becomes idiomatic.
2023
AI Gateway GA
Cloudflare ships an LLM-aware reverse proxy with caching, retry, rate limit, and observability — a free upgrade for the pattern on this page.
2024
gpt-4o in SA North
Multimodal frontier model in country. The Worker → SA-AOAI pattern crosses the threshold from "viable" to "default" for SA stacks.
2025
Workflows GA · Containers beta
Durable LLM pipelines for jobs that span minutes; Containers for the rare workload that needs a real OS. The Worker stays the front door.
2026
Current state
Worker → AOAI in SA North is the production default for new in-country LLM apps in our stack. This page documents how we wire it.

When to call AOAI direct. When to route through something else.

A direct Worker call is the right answer for a specific shape of workload. Reach for Copilot Studio, AI Gateway, or Workflows when the constraints push past what an isolate is designed to do.

✓ Use it when
  • The stack is already on Cloudflare. Pages, Workers, D1, KV are the substrate. Adding an LLM means one more binding, not a second cloud account.
  • The user isn't a Microsoft licensee. Customer-facing app, public website, mobile chat. Copilot Studio's per-user licensing model doesn't fit.
  • You want streaming SSE end-to-end. Worker → AOAI streaming pass-through is one of the cleanest first-token-fast paths going.
  • The prompt loop needs custom orchestration. RAG, tool calling, multi-step reasoning where you want to see and shape every hop. Copilot Studio's orchestration is opinionated; raw Workers is yours.
  • Volume justifies the metering complexity. Above ~50k requests/month, the Worker + AOAI direct model is cheaper than per-user Copilot Studio licensing.

Where this node plugs into the rest of the tree.

A Worker calling AOAI is rarely the whole architecture. It's the central hop in a chain that pulls on Cloudflare bindings, Microsoft regions, and ERP grounding. Here's the dependency footprint.

For agents loading this context

What this node gives you

A working scaffolding for any "Cloudflare Worker calls Azure OpenAI in South Africa" prompt. The four concepts in section 03 stack — start with concept 01 to bootstrap, layer in 02 when the UX needs streaming, 03 when the throughput needs throttle handling, 04 when the auth posture needs OAuth instead of api-key. Skip 04 entirely for prototypes; reach for AI Gateway instead of writing your own retry layer if observability matters.

When the agent-context API ships, this node will also expose a wrangler.toml scaffold, a complete src/index.ts reference implementation with all four concepts wired together, and the Azure CLI sequence for the corresponding AOAI deployment.

Go deeper.

Microsoft's REST reference for the request shape, Cloudflare's for the runtime, and the SKILL.md companions for the tree depth. Skip the Medium tutorials — most of them got the api-version wrong.

Agent context

Load this node into your agent

Reference src/index.ts with all four concepts wired, wrangler.toml with KV + secret bindings, and the Azure CLI sequence for the matching AOAI deployment. Shipping with the know.2nth.ai Worker API.