know.2nth.aiTechnologytechcloudflareAI Gateway
Technology · Cloudflare · Skill Node

AI
Gateway.

A proxy layer between your code and LLM providers. Token metering, response caching, rate limiting, and fallback routing — so you can control costs and reliability without changing your application code.

TechnologyAI ProxyOpen TierLast updated · Apr 2026

Observability and control for your LLM spend.

AI Gateway sits between your Worker (or any HTTP client) and your LLM provider — OpenAI, Anthropic, Workers AI, Azure OpenAI, and others. You change one URL in your code (point to the gateway instead of the provider directly), and you get caching, logging, rate limiting, and fallback routing for free.

The most valuable feature: response caching. Identical prompts hit the cache instead of the provider, saving tokens and latency. For applications with repetitive queries (FAQ bots, template-based generation), this can cut LLM costs by 30-60%.

Fallback routing lets you define a chain: try Workers AI first, fall back to OpenAI if it fails, fall back to Anthropic if both fail. Your application code doesn't change — the gateway handles the routing.

One URL change, multiple superpowers.

01

Proxying OpenAI through the gateway

Replace the OpenAI base URL with your AI Gateway URL. Everything else stays the same — same SDK, same auth headers, same request format.

// Before: direct to OpenAI
const baseURL = "https://api.openai.com/v1";

// After: through AI Gateway
const baseURL = "https://gateway.ai.cloudflare.com/v1/{account}/{gateway}/openai";

// Same OpenAI SDK call
const response = await fetch(`${baseURL}/chat/completions`, {
  method: "POST",
  headers: {
    "Authorization": `Bearer ${env.OPENAI_KEY}`,
    "Content-Type": "application/json",
  },
  body: JSON.stringify({
    model: "gpt-4o",
    messages: [{ role: "user", content: prompt }],
  }),
});
02

Fallback routing

Define a fallback chain. The gateway tries each provider in order and returns the first successful response.

// Universal endpoint with fallback
const response = await fetch(
  `https://gateway.ai.cloudflare.com/v1/{account}/{gateway}`,
  {
    method: "POST",
    headers: { "Content-Type": "application/json" },
    body: JSON.stringify([
      {
        provider: "workers-ai",
        endpoint: "@cf/meta/llama-3.1-8b-instruct",
        headers: {},
        query: { messages: [{ role: "user", content: prompt }] },
      },
      {
        provider: "openai",
        endpoint: "chat/completions",
        headers: { authorization: `Bearer ${env.OPENAI_KEY}` },
        query: { model: "gpt-4o-mini", messages: [{ role: "user", content: prompt }] },
      },
    ]),
  }
);

Where it bites you.

Cache key

Caching is by exact prompt match

Even a single character difference is a cache miss. For dynamic prompts with user input, caching hit rates will be low. Best for template-based or system prompts.

Added latency

Gateway adds a hop

Routing through the gateway adds a small amount of latency (typically <10ms). For latency-critical streaming, measure the impact.

Streaming + caching

Cached responses don't stream

When a response is served from cache, it arrives as a complete response, not a stream. Your frontend code needs to handle both cases.

When it fits. When it doesn't.

✓ Use it when
  • You need LLM cost visibility. Token metering per gateway, per provider, per model — without building your own tracking.
  • Repetitive prompts are common. FAQ bots, template-based generation, cached summaries — caching pays for itself immediately.
  • You want provider resilience. Fallback routing means one provider's outage doesn't take down your application.

Where this node connects.

Go deeper.