AI
Gateway.

A proxy layer between your code and LLM providers. Token metering, response caching, rate limiting, and fallback routing — so you can control costs and reliability without changing your application code.

TechnologyAI ProxyOpen TierLast updated · Apr 2026

01 · What it is

Observability and control for your LLM spend.

AI Gateway sits between your Worker (or any HTTP client) and your LLM provider — OpenAI, Anthropic, Workers AI, Azure OpenAI, and others. You change one URL in your code (point to the gateway instead of the provider directly), and you get caching, logging, rate limiting, and fallback routing for free.

The most valuable feature: response caching. Identical prompts hit the cache instead of the provider, saving tokens and latency. For applications with repetitive queries (FAQ bots, template-based generation), this can cut LLM costs by 30-60%.

Fallback routing lets you define a chain: try Workers AI first, fall back to OpenAI if it fails, fall back to Anthropic if both fail. Your application code doesn't change — the gateway handles the routing.

02 · How it works

One URL change, multiple superpowers.

Proxying OpenAI through the gateway

Replace the OpenAI base URL with your AI Gateway URL. Everything else stays the same — same SDK, same auth headers, same request format.

// Before: direct to OpenAI
const baseURL = "https://api.openai.com/v1";

// After: through AI Gateway
const baseURL = "https://gateway.ai.cloudflare.com/v1/{account}/{gateway}/openai";

// Same OpenAI SDK call
const response = await fetch(`${baseURL}/chat/completions`, {
  method: "POST",
  headers: {
    "Authorization": `Bearer ${env.OPENAI_KEY}`,
    "Content-Type": "application/json",
  },
  body: JSON.stringify({
    model: "gpt-4o",
    messages: [{ role: "user", content: prompt }],
  }),
});

Fallback routing

Define a fallback chain. The gateway tries each provider in order and returns the first successful response.

// Universal endpoint with fallback
const response = await fetch(
  `https://gateway.ai.cloudflare.com/v1/{account}/{gateway}`,
  {
    method: "POST",
    headers: { "Content-Type": "application/json" },
    body: JSON.stringify([
      {
        provider: "workers-ai",
        endpoint: "@cf/meta/llama-3.1-8b-instruct",
        headers: {},
        query: { messages: [{ role: "user", content: prompt }] },
      },
      {
        provider: "openai",
        endpoint: "chat/completions",
        headers: { authorization: `Bearer ${env.OPENAI_KEY}` },
        query: { model: "gpt-4o-mini", messages: [{ role: "user", content: prompt }] },
      },
    ]),
  }
);

03 · Gotchas

Where it bites you.

Cache key

Caching is by exact prompt match

Even a single character difference is a cache miss. For dynamic prompts with user input, caching hit rates will be low. Best for template-based or system prompts.

Added latency

Gateway adds a hop

Routing through the gateway adds a small amount of latency (typically <10ms). For latency-critical streaming, measure the impact.

Streaming + caching

Cached responses don't stream

When a response is served from cache, it arrives as a complete response, not a stream. Your frontend code needs to handle both cases.

04 · Decision guide

When it fits. When it doesn't.

✓ Use it when

You need LLM cost visibility. Token metering per gateway, per provider, per model — without building your own tracking.
Repetitive prompts are common. FAQ bots, template-based generation, cached summaries — caching pays for itself immediately.
You want provider resilience. Fallback routing means one provider's outage doesn't take down your application.

✗ Skip it when

Every prompt is unique. If cache hit rates are near zero, the gateway adds latency without saving money.
You're only using Workers AI. Workers AI already runs on Cloudflare's network — the gateway proxy adds overhead without much benefit unless you need logging.
You need custom request transformation. AI Gateway is a transparent proxy. For request/response modification, use a Worker in front.

05 · Connections