know.2nth.ai › Technology › Microsoft › Copilot in South Africa › Azure OpenAI from Workers

Technology · Microsoft · Implementation Node

Azure OpenAI
from a Worker.

The headless pattern under the SA in-country LLM stack — call gpt-4o in southafricanorth from a JNB-colocated Worker with sub-millisecond cold starts, full prompt-loop control, and the throttle-handling discipline the region demands. When the Copilot UX is overhead and you'd rather own the request shape end-to-end.

Implementation Cloudflare Workers Azure OpenAI SA North Last updated · May 2026

01 · What it is

No Copilot UX. Just `fetch()`, the way Cloudflare engineers prefer.

The Copilot in South Africa node lays out three patterns for an in-country LLM stack on Microsoft. This page is the one Cloudflare-shop engineers actually deploy — a Worker calls Azure OpenAI directly, no Copilot UX layer, no Power Platform environment, no Dataverse. Just a JNB-colocated isolate making an HTTPS POST to aoai-imbila-sa.openai.azure.com.

The benefit isn't theoretical. From the JNB Cloudflare colo to the SA North Azure region the round trip is roughly 20 ms p50 — the same latency budget you'd hit talking to your own database. Cold starts are zero because Workers runs in V8 isolates. The whole prompt loop — auth, request shape, streaming, retries, caching — is yours to control, with no managed-orchestration layer making decisions silently on your behalf.

The trade-off is also honest: you write the production-readiness layer yourself. Copilot Studio handles throttling, fallback, conversation state, and Bing-grounding-disabled-by-policy out of the box. A raw Worker handles none of that until you write it. The rest of this page is what "writing it" looks like.

01 Browser

SA user

02 JNB colo

Worker isolate

03 KV / D1

RAG, sessions

04 Azure OpenAI

southafricanorth

05 Stream back

SSE pass-through

⬩ All four hops resolve inside the SA Geo · ~20 ms p50 round trip ⬩

02 · Why it matters

The numbers that make a Worker the right caller.

Latency, cost, and quota all bend in your favour when the caller and the LLM both sit in country. The shape of the workload — short, stateless, request-scoped — is also exactly what V8 isolates are good at.

~20 ms

JNB Worker → SA North AOAI p50

Same building, basically

~0 ms

Worker cold start

V8 isolate, not a container

100 k

Free Worker requests / day

Per account, before metering kicks in

30–50 k

SA North TPM default quota

~30–50% of East US — plan for throttle

03 · How it works

Four production-grade patterns. Build up, don't pre-optimise.

Start with concept 01 — the basic shape works for a prototype. Layer in the rest as the workload demands them. There's no point KV-caching a chat response if your throughput is twelve requests a day.

The basic shape — `api-key` auth and a fetch

Azure OpenAI uses the same REST shape as OpenAI's own API. Different base URL, an api-version query parameter, and an api-key header instead of Authorization: Bearer. That's the entire delta. From a Worker, you don't even need an SDK — fetch() is enough, and the response body is already a ReadableStream ready to hand back.

// src/index.ts — the smallest useful Worker that calls SA-North gpt-4o
type Env = {
  AOAI_ENDPOINT: string;     // https://aoai-imbila-sa.openai.azure.com
  AOAI_DEPLOYMENT: string;   // gpt-4o-za
  AOAI_KEY: string;          // wrangler secret put AOAI_KEY
};

export default {
  async fetch(req: Request, env: Env): Promise<Response> {
    const { messages } = await req.json();
    const url = `${env.AOAI_ENDPOINT}/openai/deployments/${env.AOAI_DEPLOYMENT}/chat/completions?api-version=2025-01-01-preview`;

    const aoai = await fetch(url, {
      method: 'POST',
      headers: { 'api-key': env.AOAI_KEY, 'Content-Type': 'application/json' },
      body: JSON.stringify({ messages, max_tokens: 800, temperature: 0.3 }),
    });

    return new Response(aoai.body, { status: aoai.status, headers: aoai.headers });
  }
};

Streaming SSE — pass the body straight through

Chat UX needs first-token-fast — the user wants to see the response start within ~300 ms of pressing send, not wait for the whole completion. Set stream: true on the request, and Azure responds with text/event-stream chunks. A Worker doesn't need to parse them. The ReadableStream from fetch() can be returned directly as the response body — Workers does no buffering, so the user's browser sees tokens as fast as they leave Azure.

// Same handler, streaming variant
const aoai = await fetch(url, {
  method: 'POST',
  headers: { 'api-key': env.AOAI_KEY, 'Content-Type': 'application/json' },
  body: JSON.stringify({
    messages,
    stream: true,
    max_tokens: 800,
    stream_options: { include_usage: true },  // final chunk has token counts
  }),
});

// Pass the SSE stream straight through. No JSON parsing, no buffering.
return new Response(aoai.body, {
  headers: {
    'Content-Type': 'text/event-stream',
    'Cache-Control': 'no-cache, no-transform',
    'X-Accel-Buffering': 'no',
  },
});

Throttle handling — exponential backoff with `retry-after-ms`

SA North's default token-per-minute quota is roughly 30–50% of the equivalent quota in East US. Burst traffic that auto-scales fine in US regions hits 429s here. Azure OpenAI returns a retry-after-ms header on throttle responses — honour it, fall back to exponential backoff if the header is missing, and cap retries at three. For idempotent prompts, layer KV caching on top so repeats absorb at zero cost.

async function chatWithRetry(
  env: Env, messages: unknown, attempt = 0
): Promise<Response> {
  const url = `${env.AOAI_ENDPOINT}/openai/deployments/${env.AOAI_DEPLOYMENT}/chat/completions?api-version=2025-01-01-preview`;
  const res = await fetch(url, {
    method: 'POST',
    headers: { 'api-key': env.AOAI_KEY, 'Content-Type': 'application/json' },
    body: JSON.stringify({ messages, max_tokens: 800 }),
  });

  if (res.status === 429 && attempt < 3) {
    const hdr = res.headers.get('retry-after-ms');
    const wait = hdr ? Number(hdr) : 500 * 2 ** attempt;  // 500, 1000, 2000
    await new Promise(r => setTimeout(r, wait));
    return chatWithRetry(env, messages, attempt + 1);
  }
  return res;
}

// Idempotent prompt? Hash + KV.
const key = `aoai:${await sha256(JSON.stringify(messages))}`;
const cached = await env.CACHE.get(key);
if (cached) return new Response(cached, { headers: { 'Content-Type': 'application/json' } });

const fresh = await chatWithRetry(env, messages);
const body = await fresh.text();
await env.CACHE.put(key, body, { expirationTtl: 3600 });
return new Response(body);

Auth upgrade — Entra ID OAuth with KV-cached token

API keys are fine for a prototype. For production — especially SARB-strict workloads — the better pattern is OAuth client credentials with a service principal in Entra ID. The Worker fetches an access token from login.microsoftonline.com, caches it in KV until ~5 minutes before expiry, and presents it as Authorization: Bearer <token> on each AOAI call. Rotate the client secret in Key Vault without redeploying the Worker. Workers can't have an Azure managed identity directly — this is the closest equivalent.

async function getAzureToken(env: Env): Promise<string> {
  const cached = await env.CACHE.get('aoai:token');
  if (cached) return cached;

  const res = await fetch(
    `https://login.microsoftonline.com/${env.AZURE_TENANT_ID}/oauth2/v2.0/token`,
    {
      method: 'POST',
      headers: { 'Content-Type': 'application/x-www-form-urlencoded' },
      body: new URLSearchParams({
        client_id: env.AZURE_CLIENT_ID,
        client_secret: env.AZURE_CLIENT_SECRET,
        scope: 'https://cognitiveservices.azure.com/.default',
        grant_type: 'client_credentials',
      }),
    }
  );
  const { access_token, expires_in } = await res.json() as any;

  // Cache 5 min short of expiry — Entra tokens are 60-min by default
  await env.CACHE.put('aoai:token', access_token, {
    expirationTtl: expires_in - 300,
  });
  return access_token;
}

// Use it on the AOAI call
const token = await getAzureToken(env);
const aoai = await fetch(url, {
  method: 'POST',
  headers: {
    'Authorization': `Bearer ${token}`,
    'Content-Type': 'application/json',
  },
  body: JSON.stringify({ messages }),
});

04 · The ecosystem

Worker → AOAI vs the alternatives.

A Worker calling Azure OpenAI is one of several callable shapes. The other shapes have their place — knowing where the boundary sits is half the architecture decision.

Caller pattern	Latency to model	SA-resident	Cost shape	Best for
Worker → AOAI (this page)	~20 ms p50	Yes (model in SA North)	Token-metered + Workers requests	Headless apps, customer-facing chat, RAG APIs
Worker → AI Gateway → AOAI	~25 ms p50	Yes (gateway + model)	Same + free gateway tier	Same as above plus telemetry, caching, fallback
Worker → Workers AI (env.AI)	~10 ms p50	Yes (Cloudflare hosted)	Per-neuron, cheaper	Routing/classification with Llama 3.1; not gpt-4o quality
Worker → Anthropic / OpenAI direct	~180 ms p50	No	Token-metered	Non-regulated workloads needing Claude or o3
Lambda (af-south-1) → AOAI	~30 ms p50	Yes (both in SA)	Per-invocation + cold starts	Teams already deep in AWS
Azure Function (SA North) → AOAI	~5 ms p50	Yes (same region)	Per-execution + plan	Single-vendor Azure shops

05 · Use cases

What we actually deploy this for.

These are the workload shapes where the headless Worker pattern wins over Copilot Studio. Customer-facing, programmatic, or volumetric — and almost always in a Cloudflare-first stack.

Headless RAG API

Vectorize-grounded Q&A endpoint

Worker pulls top-k from Vectorize, stuffs context into a gpt-4o prompt, streams the answer back. Sub-100 ms first-token, single binding chain, no Microsoft tenant in the picture at all.

Customer-facing chat

SaaS without M365 in the mix

End-users aren't Microsoft licensees. Embed a chat widget on a public site, hit a Worker, hit AOAI. Copilot Studio's per-user licensing makes no sense here; raw Worker + token metering does.

Embedding worker

Bulk document indexing

Cron-triggered Worker reads new SharePoint or D1 rows, fans out to text-embedding-3-large in SA North, writes vectors into Vectorize. Cheaper than per-user Copilot indexing for back-office corpora.

Async batch summarisation

Queue-driven document processing

Producer drops jobs into Cloudflare Queues, consumer Worker pulls and calls AOAI with retry. Survives load shedding, throttle bursts, and Azure capacity wobbles without blocking a user-facing request.

Multi-tenant SaaS

Per-tenant BYO Azure OpenAI

Each tenant supplies their own AOAI endpoint and key (so the data is in their Azure subscription, not yours). Worker dispatches per request.headers.get('X-Tenant') — clean POPIA boundary, single deploy.

ERP integration

Sage X3 / ERPNext natural-language layer

User asks "show me overdue POs from supplier X". Worker calls AOAI to extract structured intent, then runs the GraphQL/REST against the ERP, then calls AOAI again to summarise. The Worker is the orchestration point.

06 · Evolution

From "edge cache rule" to "production LLM caller".

Workers wasn't designed for LLM workloads. It accidentally became one of the best places to host them when streaming, sub-millisecond cold starts, and bindings turned out to matter more than CPU time.

2017

Workers GA

JavaScript at the edge via V8 isolates. Initially used as smarter CDN rules — nobody calling LLMs from edge yet.

2020

Workers Unbound · longer CPU budgets

CPU time bumped to 30 seconds (later 5 minutes on paid plans). Long-running streaming responses became feasible.

2022

D1 · KV · R2 · Queues

The state primitives that turn a Worker from "fancy proxy" into "applications platform". Token caching, RAG corpus storage, async LLM jobs all become single-binding code.

2023

Azure OpenAI in SA North · streaming standardised

gpt-3.5 then gpt-4 deploy in southafricanorth. SSE response shape settles across providers — the pass-through pattern in concept 02 becomes idiomatic.

2023

AI Gateway GA

Cloudflare ships an LLM-aware reverse proxy with caching, retry, rate limit, and observability — a free upgrade for the pattern on this page.

2024

gpt-4o in SA North

Multimodal frontier model in country. The Worker → SA-AOAI pattern crosses the threshold from "viable" to "default" for SA stacks.

2025

Workflows GA · Containers beta

Durable LLM pipelines for jobs that span minutes; Containers for the rare workload that needs a real OS. The Worker stays the front door.

2026

Current state

Worker → AOAI in SA North is the production default for new in-country LLM apps in our stack. This page documents how we wire it.

07 · Decision guide

When to call AOAI direct. When to route through something else.

A direct Worker call is the right answer for a specific shape of workload. Reach for Copilot Studio, AI Gateway, or Workflows when the constraints push past what an isolate is designed to do.

✓ Use it when

The stack is already on Cloudflare. Pages, Workers, D1, KV are the substrate. Adding an LLM means one more binding, not a second cloud account.
The user isn't a Microsoft licensee. Customer-facing app, public website, mobile chat. Copilot Studio's per-user licensing model doesn't fit.
You want streaming SSE end-to-end. Worker → AOAI streaming pass-through is one of the cleanest first-token-fast paths going.
The prompt loop needs custom orchestration. RAG, tool calling, multi-step reasoning where you want to see and shape every hop. Copilot Studio's orchestration is opinionated; raw Workers is yours.
Volume justifies the metering complexity. Above ~50k requests/month, the Worker + AOAI direct model is cheaper than per-user Copilot Studio licensing.

✗ Skip it when

The workload is M365-grounded. Outlook, Teams, SharePoint, OneDrive content. Copilot Studio + Microsoft Graph beats writing your own Graph caller every time.
You need >5 min of CPU per request. Workers caps at 5 minutes on paid plans. Long batch summarisation jobs belong in Workflows or a Container, not a Worker.
Working set > 128 MB. Image stitching, large document parsing, big in-memory matrices — isolates aren't sized for it. Use a Container.
Strict private-endpoint requirement. If the AOAI resource has a Private Endpoint and is unreachable from the public internet, you need a Cloudflare Tunnel or Azure Function in the VNet — Worker direct fetch can't get there.
You need first-class observability. Route through AI Gateway instead. The 5 ms penalty buys you cache, retry, prompt logs, and per-token cost dashboards for free.

08 · 2ⁿ perspective

Where this node plugs into the rest of the tree.

A Worker calling AOAI is rarely the whole architecture. It's the central hop in a chain that pulls on Cloudflare bindings, Microsoft regions, and ERP grounding. Here's the dependency footprint.

tech/cloudflare

Workers & Pages

The runtime substrate. Read this first if you're new to bindings, isolates, or wrangler.

tech/microsoft

+ AI Gateway in front

Drop-in upgrade — caching, retry, prompt logs, cost dashboards, and per-tenant rate limits for ~5 ms latency. The natural next page after this one.

tech/cloudflare

The token cache and idempotent-response cache backing concept 03 and 04 — eventually consistent, globally replicated, sub-millisecond reads.

tech/cloudflare

Queues

The async pattern — producer drops a job, consumer Worker calls AOAI on its own schedule. Survives throttle bursts.

tech/cloudflare

Durable Objects

Single-writer chat state — conversation history, partial completions, streaming buffers — when KV's eventual consistency isn't enough.

tech/cloudflare/ai

Vectorize

RAG retrieval layer. Embed once via AOAI text-embedding-3-large, query top-k from the same Worker that calls gpt-4o.

tech/microsoft

Copilot in South Africa (parent)

The decision tree this node is the third option of. Read it to understand when not to use the Worker pattern.

biz/erp

ERP integrations (Sage X3, ERPNext)

The most common grounding source — a Worker calls AOAI as the natural-language layer over an existing GraphQL or REST backend.

For agents loading this context

What this node gives you

A working scaffolding for any "Cloudflare Worker calls Azure OpenAI in South Africa" prompt. The four concepts in section 03 stack — start with concept 01 to bootstrap, layer in 02 when the UX needs streaming, 03 when the throughput needs throttle handling, 04 when the auth posture needs OAuth instead of api-key. Skip 04 entirely for prototypes; reach for AI Gateway instead of writing your own retry layer if observability matters.

When the agent-context API ships, this node will also expose a wrangler.toml scaffold, a complete src/index.ts reference implementation with all four concepts wired together, and the Azure CLI sequence for the corresponding AOAI deployment.

09 · Resources

Go deeper.

Microsoft's REST reference for the request shape, Cloudflare's for the runtime, and the SKILL.md companions for the tree depth. Skip the Medium tutorials — most of them got the api-version wrong.

Docs · Microsoft Azure OpenAI REST API reference learn.microsoft.com · openai/reference Docs · Microsoft Azure OpenAI quotas & rate limits learn.microsoft.com · openai/quotas-limits Docs · Cloudflare Workers fetch() API developers.cloudflare.com · workers/fetch Docs · Cloudflare AI Gateway — caching, retry, telemetry developers.cloudflare.com · ai-gateway Docs · Cloudflare Workers platform limits (CPU, memory, sub-requests) developers.cloudflare.com · workers/platform/limits Docs · Microsoft Entra ID — OAuth client credentials flow learn.microsoft.com · client-creds-grant-flow Skill tree tech/microsoft/azure-ai/sa — model availability + region pinning github.com/2nth-ai/skills · azure-ai/sa Skill tree tech/cloudflare/workers — runtime patterns github.com/2nth-ai/skills · cloudflare/workers

Agent context

Load this node into your agent

Reference src/index.ts with all four concepts wired, wrangler.toml with KV + secret bindings, and the Azure CLI sequence for the matching AOAI deployment. Shipping with the know.2nth.ai Worker API.

Azure OpenAIfrom a Worker.

No Copilot UX. Just fetch(), the way Cloudflare engineers prefer.

The numbers that make a Worker the right caller.

Four production-grade patterns. Build up, don't pre-optimise.

The basic shape — api-key auth and a fetch

Streaming SSE — pass the body straight through

Throttle handling — exponential backoff with retry-after-ms

Auth upgrade — Entra ID OAuth with KV-cached token

Worker → AOAI vs the alternatives.

What we actually deploy this for.

Vectorize-grounded Q&A endpoint

SaaS without M365 in the mix

Bulk document indexing

Queue-driven document processing

Per-tenant BYO Azure OpenAI

Sage X3 / ERPNext natural-language layer

From "edge cache rule" to "production LLM caller".

When to call AOAI direct. When to route through something else.

Where this node plugs into the rest of the tree.

What this node gives you

Go deeper.

Load this node into your agent

Azure OpenAI
from a Worker.

No Copilot UX. Just `fetch()`, the way Cloudflare engineers prefer.

The basic shape — `api-key` auth and a fetch

Throttle handling — exponential backoff with `retry-after-ms`