A proxy layer between your code and LLM providers. Token metering, response caching, rate limiting, and fallback routing — so you can control costs and reliability without changing your application code.
AI Gateway sits between your Worker (or any HTTP client) and your LLM provider — OpenAI, Anthropic, Workers AI, Azure OpenAI, and others. You change one URL in your code (point to the gateway instead of the provider directly), and you get caching, logging, rate limiting, and fallback routing for free.
The most valuable feature: response caching. Identical prompts hit the cache instead of the provider, saving tokens and latency. For applications with repetitive queries (FAQ bots, template-based generation), this can cut LLM costs by 30-60%.
Fallback routing lets you define a chain: try Workers AI first, fall back to OpenAI if it fails, fall back to Anthropic if both fail. Your application code doesn't change — the gateway handles the routing.
Replace the OpenAI base URL with your AI Gateway URL. Everything else stays the same — same SDK, same auth headers, same request format.
// Before: direct to OpenAI const baseURL = "https://api.openai.com/v1"; // After: through AI Gateway const baseURL = "https://gateway.ai.cloudflare.com/v1/{account}/{gateway}/openai"; // Same OpenAI SDK call const response = await fetch(`${baseURL}/chat/completions`, { method: "POST", headers: { "Authorization": `Bearer ${env.OPENAI_KEY}`, "Content-Type": "application/json", }, body: JSON.stringify({ model: "gpt-4o", messages: [{ role: "user", content: prompt }], }), });
Define a fallback chain. The gateway tries each provider in order and returns the first successful response.
// Universal endpoint with fallback const response = await fetch( `https://gateway.ai.cloudflare.com/v1/{account}/{gateway}`, { method: "POST", headers: { "Content-Type": "application/json" }, body: JSON.stringify([ { provider: "workers-ai", endpoint: "@cf/meta/llama-3.1-8b-instruct", headers: {}, query: { messages: [{ role: "user", content: prompt }] }, }, { provider: "openai", endpoint: "chat/completions", headers: { authorization: `Bearer ${env.OPENAI_KEY}` }, query: { model: "gpt-4o-mini", messages: [{ role: "user", content: prompt }] }, }, ]), } );
Even a single character difference is a cache miss. For dynamic prompts with user input, caching hit rates will be low. Best for template-based or system prompts.
Routing through the gateway adds a small amount of latency (typically <10ms). For latency-critical streaming, measure the impact.
When a response is served from cache, it arrives as a complete response, not a stream. Your frontend code needs to handle both cases.