Announcing Envoy AI Gateway 1.0: A Stable Foundation for Enterprise AI Traffic

Learn more

Your AI Bill Is an AI Gateway Problem: How Enterprises Cut Spend Without Cutting Usage

An AI gateway — also called an LLM gateway — is where teams control AI cost, routing, caching, and governance. Why your bill is a gateway problem, why per-agent caps aren't enough, and why showback and chargeback are the real ROI lever.

Your AI Bill Is an AI Gateway Problem: How Enterprises Cut Spend Without Cutting Usage

By David Wang, Head of Product, Tetrate

An AI gateway — also called an LLM gateway — is where AI cost, reliability, and governance get controlled.

Teams are halving their AI spend while their token usage keeps climbing. They do it with a gateway: one layer in front of every model that applies routing, caching, and cost control before a request reaches a provider. Spend and usage come apart. Usage rises with adoption while the bill flattens, because each token is cheaper and goes to a model that fits the task.

Most teams reach for the cap first, but caps aren’t sustainable, because the spend isn’t coming from the people near their caps. At one company that published its numbers, more than 90% of employees never hit a usage limit at all. Capping them adds alerts and friction without moving the bill. The savings that last come from systemic change at the gateway: how calls are routed, whether they need to be sent at all, and whether teams can see what they spend.

The short version

  • An AI gateway (or LLM gateway) is the one place AI cost, reliability, and governance get controlled. In 2026 that makes it core infrastructure, not an add-on.
  • The savings that last come from systemic change at the gateway — routing and caching — not from caps.
  • Per-agent and per-harness controls help, but can’t be the only lever: they push work onto every team, miss the real cost, and are easy to bypass.
  • The lever that compounds is financial accountability — showback, then chargeback — run on attribution only a shared gateway produces. It’s the cloud FinOps playbook again.
  • A managed gateway is low-lift to adopt and useful on day one. Tetrate Agent Router Enterprise runs as one, and also governs the tools and actions agents add.

What is an AI gateway? (And is it the same as an LLM gateway?)

An AI gateway is a proxy between your applications and the model providers you call. It exposes one endpoint and one format — usually OpenAI-compatible — and translates each request to whichever model you’ve configured. Because every request passes through it, the gateway is the natural home for the concerns that otherwise leak into application code: routing, caching, retries and failover, PII redaction, logging, and budget enforcement.

You’ll also see this called an LLM gateway. The two terms describe the same category. “LLM gateway” is the older name, from when the traffic was mostly text completions; “AI gateway” is the broader term winning now, because the layer governs agents, tools, and multimodal traffic too. Treat them as synonyms throughout this post. We use “AI gateway” because that’s where the category is headed.

The average enterprise now calls several providers, and the leaderboard turns over every few months as OpenAI, Anthropic, Google, and the open-weight labs ship new versions. A request that wants a frontier model for planning can take a cheaper open-weight model for execution. That logic doesn’t belong in fifty services. It belongs in the one gateway they already route through.

The AI gateway has made the same trip the API gateway made a decade ago: from a convenience that shaved a few milliseconds to the load-bearing layer where the hard cross-cutting problems live. What’s different now is the traffic itself: these are agents, calling tools, taking actions, and spending money.

Without an AI gateway, every team builds its own

A team wants an LLM, so it wires up its own keys, its own routing, its own logging. The next team does the same, differently. A quarter later you have a dozen integrations, a dozen billing relationships, and a dozen logging conventions, and nobody can see the whole picture. Security can see it least of all.

The costs of that sprawl are specific:

  • Unattributed spend. Token cost lands on a shared key with no per-team, per-agent, or per-project breakdown. You can’t flag the team burning budget or find the one that isn’t adopting.
  • Uncontained outages. With no failover across providers, one provider blip takes down agents on unrelated teams at the same time.
  • Ungoverned actions. Logs are scattered across provider dashboards. Reconstructing why a multi-agent chain did something takes days, and there’s no consistent place to stop a risky tool call before it runs.

Whatever you fix at the gateway, you fix once, for every team and tool.

Why agent- and harness-level cost controls aren’t enough

The sharpest teams don’t stop at defaults. They do real cost work inside the agent and the coding harness: preprocessing prompts, routing each step to a model that fits, keeping context lean, setting per-agent budgets. It pays off, and you should do it. It just can’t be the only way you manage cost.

The first problem is that it pushes the work onto every team. Tuning limits and defaults in each harness means every team and developer has to set them, maintain them, and keep them in sync. It becomes one more setting to standardize on, and standardization-by-request doesn’t hold: configs drift, teams diverge, and you’re back to the sprawl you were escaping. Define a limit once at the gateway and it holds everywhere; define it per app and you police drift forever.

The second is that per-user caps don’t track cost. A request-count limit means little when one long-context call can cost as much as fifty ordinary ones, and every user can sit under their own cap while the org bill still runs past budget. OWASP lists “unbounded consumption” — denial of wallet — as its own risk class for LLM apps for this reason.

The third is that controls inside the app are opt-in and bypassable. A direct call to the provider skips every limit set in the harness, and the more places AI shows up — IDEs, CI, internal tools, agents — the more spend lives outside whatever you instrumented. That is how shadow AI grows: a Cloud Security Alliance survey found 82% of organizations turned up an AI agent or workflow in the past year that security and IT hadn’t sanctioned.

Harness-level tuning helps, but the controls that hold — defaults, routing, caching, budgets, attribution — belong one layer down, at the gateway, where they apply by default and don’t wait on anyone to configure them.

How an AI gateway cuts AI costs: routing and caching

Two levers move the bill, and both live at the gateway: routing and caching. They also interact in ways that aren’t obvious until you’ve run them in production.

Cheaper defaults. Defaults decide most of the spend. Set a capable, cheaper model as the default and reserve frontier models for the work that needs them, and cost drops without anyone losing access to anything.

Prompt caching. A cache hit needs the prompt prefix to match exactly, so the move is to build a long, stable prefix and hold it across turns. Then each request pays full rate only on its new tokens and reads the rest from cache. On conversational and agent workloads, hit rate is the single biggest driver of cost, and the gap between a naive setup and a tuned one is wide.

Cache-aware routing. This is the part most routing logic gets wrong. The naive approach scores each turn on its own and sends it to whichever model fits — sensible in isolation, expensive in practice, because the cache is per-model and switching mid-conversation throws away the warm cache you were riding. A good gateway weighs cache state against task difficulty: a conversation keeps its model while the cache is warm, and the chance to re-route returns only when it goes quiet long enough for the cache to expire. Route for the model alone and you save on one turn while paying for it on the next ten.

Semantic caching. For repetitive, FAQ-shaped traffic, the gateway can recognize that a new prompt is meaningfully the same as one it already answered and return the stored response without calling a model at all. The cheapest token is the one you never send. It carries quality risk, so you gate it to workloads where similar inputs have similar correct answers, but where it fits it removes whole classes of calls from the bill.

All of these are gateway capabilities; they hold no matter which models sit behind them.

This is what a healthy cost curve looks like: usage climbing with adoption, spend flattening, because the tokens are cheaper and sent to the model that fits.

Controlling AI spend: showback, then chargeback

Routing and caching cut what each call costs. What changes how much gets spent in the first place is accountability — showback, and eventually chargeback. Visibility alone won’t do it. A cost report a team can skim and ignore changes nothing.

This is the cloud playbook, replayed for AI. When cloud spend grew into a monthly invoice nobody owned, finance saw one number, engineering saw dashboards, and the gap between who used the cloud and who paid for it became real budget risk. The discipline that closed it, FinOps, came down to one move: attribute each cost to the team that created it, show that team its share, and once the numbers are trusted, put the cost on its budget. The FinOps Foundation’s surveys still put wasted cloud spend at around a quarter of the total, which is why cost allocation stays near the top of the practice’s priorities.

Showback and chargeback are not the same step. Showback reports a team’s spend while the bill stays central. Chargeback moves that spend onto the team’s own budget. Showback moves information; chargeback moves money. When teams watch cost land on their own budget, wasteful defaults get fixed and idle jobs get shut off; a report rarely does that on its own.

The cloud experience also teaches the order. Start with showback, let teams trust the numbers, and move to chargeback once attribution is accurate enough to defend — usually once you can map north of 90% of spend to an owner. Push chargeback onto shaky data too early and you lose the room. Showback isn’t a phase you outgrow, either; most mature organizations keep it as their primary model and reserve chargeback for where it matters most.

The gateway is what makes this work for AI. Every model and tool call passes through it with an identity attached, so it’s the one place that can produce the per-team, per-agent attribution showback and chargeback run on. Without that shared layer, you’re back to one AI invoice no one owns.

AI governance at the gateway: resilience, audit, and tools

Teams adopt a gateway to control spend. They keep it because it’s the only place that answers the questions that arrive the moment agents touch production data.

Resilience. Cross-provider failover and circuit breaking turn a hard dependency on one vendor into a setup that degrades gracefully instead of going dark.

Identity and audit. Every request carries an authenticated identity through SSO, and every tool call, retrieval, and decision lands in an immutable record of what data was used and who approved the output. With EU AI Act obligations now phasing in and audit expectations rising across SOC 2 and GDPR, the teams that built structured logging and data-residency routing in early will certify with less friction.

Tool governance. Agents don’t just call models; they call tools through MCP. The gateway is where you curate which tools an agent can reach, authenticate per profile, and filter what’s exposed — so a tool call is governed the same way a model call is.

The thread running through all of it: the request a model makes is invisible to the security stack your company built for humans. The gateway is where it becomes visible, and governable.

Build or buy an AI gateway?

The most sophisticated AI teams have built their own gateways, and the work is real: routing that accounts for cache state, budgets reserved atomically under concurrency, attribution clean enough to charge back on, logging that stays off the critical path, an eval harness to keep model quality from drifting. That’s distributed-systems engineering, and a standing commitment.

Most teams shouldn’t take it on. The gateway is load-bearing now, and a self-hosted proxy that bottlenecks past a few hundred requests per second, leaks memory until it restarts, or stalls under sustained agent loops becomes the thing that takes your AI down. Building from scratch makes sense for unusual latency, compliance, or routing needs; for nearly everyone else in 2026, the gateway is something to adopt.

Adopting an AI gateway: a one-line change, useful on day one

A common objection is that the team isn’t ready — no FinOps practice, no governance program, not enough scale to justify infrastructure. That has the order backwards.

A managed gateway is a one-line change: point your OpenAI-compatible clients at a new base URL and they keep working, with no data plane to stand up, scale, or patch. The “heavy infrastructure” picture comes from self-hosting a proxy; it doesn’t apply to a hosted service.

The payoff doesn’t wait for tuning, either. The moment traffic routes through it, you have one endpoint, failover across providers, and per-team visibility into spend — before you write a routing rule or set a cache. Optimization comes later; resilience and visibility come on day one.

Readiness comes out of running the gateway. You don’t need a mature cost or governance program first, because the gateway is what produces the attribution and controls those programs are built on. Route through it now, and turn on attribution, policy, and MCP governance later, without redeploying.

Tetrate Agent Router Enterprise: an enterprise AI gateway

A basic gateway governs model calls. It doesn’t see the tool calls and actions an agent takes once it’s running. Tetrate Agent Router Enterprise governs both, on one control plane.

You adopt it with a one-line base-URL change: point your OpenAI-compatible clients at it and they keep working. It’s built on Envoy AI Gateway, which Tetrate co-created and maintains, so it runs on the same Envoy-and-Go stack that already carries production traffic at scale. In Tetrate’s benchmarks it adds near-zero latency under sustained load, where Python proxies degrade.

It does three jobs in one place. It routes across models with failover and attributes cost per team, agent, and project, with showback and chargeback. It runs an MCP gateway that controls which tools an agent reaches, authenticated per profile. And it enforces runtime guardrails — PII redaction and policy — on every request, each carrying an identity through SSO. You turn these on as you need them.

Run it hosted, with no data plane to operate, or, where regulation requires it, with the data plane inside your own perimeter and the control plane shared across teams.

The same control plane that routes traffic and attributes spend is where guardrails run and where a risky action is held for approval before it executes. Cost control and governance come from one layer.

AI gateway FAQ

What is an AI gateway? An AI gateway is a proxy layer between your applications and your model providers. It exposes a single, unified API and centralizes routing, caching, failover, cost tracking, logging, and policy enforcement so those concerns don’t get reimplemented in every service.

Is an AI gateway the same as an LLM gateway? Yes — two names for the same category. “LLM gateway” is the older, engineer-coded term from when the traffic was mostly text completions. “AI gateway” is the broader term now winning, because the layer governs agents, tools, and multimodal traffic, not just language-model calls. Use whichever your team searches for; they point at the same infrastructure.

Does an AI gateway govern what agents do, or only the model calls? A basic gateway governs model calls. Governing the tools an agent calls (via MCP) and the actions it takes requires a gateway that extends to them. Tetrate Agent Router Enterprise does this on one control plane — model routing, MCP tool access, and runtime guardrails together — with identity on every request.

Aren’t we too early for an AI gateway? Probably not. A managed gateway is a one-line base-URL change with no infrastructure to run, and it pays off immediately: one endpoint, cross-provider failover, and per-team spend visibility before you tune any routing or caching. You don’t need a mature cost or governance program first — the gateway is what produces the attribution and controls those programs need.

Can I control AI costs with rate limits and per-team caps? They help, but they can’t be the only lever. Per-user caps don’t track real cost — one long-context call can cost as much as fifty ordinary ones, and everyone can stay under their cap while the org bill overshoots. Controls set inside each app or harness are opt-in and bypassable, and they push configuration work onto every team, which drifts over time. Durable cost control lives at the gateway — defaults, routing, caching, and budgets applied to every team by default — paired with attribution so spend has an owner.

What’s the difference between showback and chargeback for AI spend? Showback reports a team’s AI spend while the cost stays on a central budget; chargeback moves that cost onto the team’s own budget. Showback builds awareness; chargeback adds the financial signal that changes behavior. It mirrors the cloud FinOps playbook: start with showback, then move to chargeback once attribution is accurate — usually once you can map north of 90% of spend to an owner. The gateway is what produces the per-team, per-agent attribution both models need.

Do I need an AI gateway? Usually once you call a second model provider or your monthly token spend becomes meaningful. The signal is concrete: you’ve copy-pasted retry-and-failover code into three services, or no one can answer what your AI costs broken down by feature or team.

Should I build or buy an AI gateway? Buy or adopt, for nearly everyone in 2026. Building means an ongoing commitment to cache-aware routing, concurrency-safe budgets, off-critical-path logging, and quality evaluation. Build only for unusual latency, compliance, or routing needs a mature gateway can’t meet.

What is Tetrate Agent Router Enterprise? An enterprise AI gateway built on Envoy AI Gateway. It provides model routing with failover, cost attribution with showback and chargeback, an MCP gateway for tool access, and runtime guardrails, in a dedicated instance with enterprise SSO. It’s OpenAI-compatible, adopted with a one-line base-URL change, and runs hosted or with an on-premises data plane for regulated industries.

Product background Product background for tablets
Building AI agents

Agent Router Enterprise provides a managed AI Gateway, MCP Gateway, and AI Guardrails in your dedicated instance. Graduate agents from prototype to production with consistent model access, governed tool use, and runtime supervision — built on Envoy AI Gateway by its creators.

  • AI Gateway – Unified model catalog with automatic fallback across providers
  • MCP Gateway – Curated tool access with per-profile authentication and filtering
  • AI Guardrails – Enforce policies, prevent data loss, and supervise agent behavior
  • Learn more
    Replacing NGINX Ingress

    Tetrate Enterprise Gateway for Envoy (TEG) is the enterprise-ready replacement for NGINX Ingress Controller. Built on Envoy Gateway and the Kubernetes Gateway API, TEG delivers advanced traffic management, security, and observability without vendor lock-in.

  • 100% upstream Envoy Gateway – CVE-protected builds
  • Kubernetes Gateway API native – Modern, portable, and extensible ingress
  • Enterprise-grade support – 24/7 production support from Envoy experts
  • Learn more
    Decorative CTA background pattern background background
    Tetrate logo in the CTA section Tetrate logo in the CTA section for mobile

    Ready to enhance your
    network

    with more
    intelligence?