What Happens to Your AI Agents When Anthropic or OpenAI Goes Down? Building LLM Failover That Actually Works

What happens to your AI agents when a model provider goes down?

The short answer: In most organizations today, every agent fails simultaneously, every on-call gets paged at once, and recovery is manual. That’s because each team wired its agents directly to a provider’s SDK, so a single provider incident has a blast radius spanning every AI workload in the company. The fix is architectural: route all model traffic through a gateway with policy-driven, multi-provider failover, so a provider outage becomes a routing event absorbed in one place instead of an incident distributed to every team.

This post covers the failure modes that actually occur, why per-app retries don’t solve them, and the failover patterns that work in production. For gateway terminology, see our explainer on LLM gateway vs AI gateway vs MCP gateway.

How often do LLM providers actually fail?

Often enough that resilience is a design requirement, not paranoia. Every major provider has status-page incidents: elevated error rates, latency spikes, regional degradation, and occasional full outages. And hard downtime is only the most visible failure mode. Production teams hit four distinct ones:

Outages and elevated error rates. The provider returns 5xx errors or times out.
Capacity throttling. You receive 429s during provider-side congestion even though you’re within your own limits.
Silent model updates. The provider revs a model version and your agent’s behavior changes overnight with no error at all. This one is insidious because nothing alerts.
Regional and capability gaps. A model is degraded or unavailable in one region while fine in another.

If your AI agents are in revenue-critical or customer-facing paths, “the provider was down” is no more acceptable an answer than it would be for your database.

Why retries inside each application aren’t a resilience strategy

The default pattern is each team adding retry logic around its provider SDK. It fails for predictable reasons:

Retrying into an outage amplifies it. Hammering a degraded provider with retries adds load and delays your own recovery.
Every team builds it differently. Dozens of bespoke retry/fallback implementations means dozens of behaviors during an incident, none observable in one place.
No cross-provider escape hatch. Retries against the same provider can’t help when the provider itself is the problem. Real failover means a second provider, with credentials, model mapping, and routing already in place.
It can’t handle silent updates. Retries only respond to errors. Version pinning and policy-controlled rollover are gateway functions.
The blast radius stays global. Even with perfect per-app retries, a provider incident still pages every team simultaneously because nothing is absorbed centrally.

Resilience plumbing copied into every agent is exactly the kind of repeated infrastructure that belongs in a shared layer. See our 2026 enterprise AI gateway comparison for how vendors handle failover differently.

What does production-grade LLM failover look like?

Five patterns, all enforced at the gateway so every agent inherits them without code changes:

1. Cross-provider failover for the same model

The cleanest failover keeps the model constant and changes the provider. Frontier models are increasingly available through multiple channels: Anthropic’s models via the Anthropic API, AWS Bedrock, and Google Vertex AI; OpenAI’s via OpenAI and Azure OpenAI. Failing over from Anthropic’s API to the same Claude model on Vertex preserves agent behavior, prompts, and evaluation assumptions. Your agents shouldn’t notice anything except continuity.

2. Cross-model fallback chains

When the same model isn’t available elsewhere, define an ordered fallback to an approved alternative model, with the substitution logged. The key word is approved: fallback targets should come from a curated catalog, not whatever the proxy finds, because an unvetted model substitution in a regulated workflow is its own incident.

3. Model version pinning

Pin production workloads to a specific model version so provider-side updates don’t silently change behavior. Roll versions forward deliberately, through the gateway, after evaluation, for all agents at once.

4. Load balancing and health-based routing

Distribute traffic across providers and regions based on health signals and latency, with automatic cooldown of degraded targets, so you shed load away from a failing provider before it becomes a full outage for you.

5. Budget-aware and region-aware policy

Failover must respect the same policies as the primary path: regulated workloads pinned to approved regions, budgets enforced on whichever provider serves the request, and identity carried through so attribution survives the switch.

Tetrate Agent Router Enterprise provides continuous runtime governance for GenAI systems. Enforce policies, control costs, and maintain compliance at the infrastructure layer — without touching application code.

Learn more

How do you test failover before you need it?

Treat it like any other disaster-recovery capability:

Simulate provider failure at the gateway (force errors for a provider) in a staging environment and verify agents continue on the secondary path.
Measure mean time to failover. This belongs on the platform team’s dashboard alongside uptime. If failover takes a manual config change, it isn’t failover.
Reconcile after the drill: token attribution, logs, and behavior on the secondary provider should match policy.
Run the drill quarterly. Provider channels, models, and credentials drift. A failover path that hasn’t been exercised in six months is a hypothesis.

Why this is a gateway problem, not an application problem

Everything above (provider credentials for multiple channels, health tracking, version pinning, approved catalogs, policy-aware routing) is shared state and shared policy. The only place it can live coherently is a control point that all AI traffic flows through. That’s the gateway’s job, and it’s also why gateway reliability itself matters: a failover layer that crashes under load (a pattern production users have reported with lightweight Python proxies under sustained backpressure) just relocates your single point of failure. For context on that reliability pattern, see our analysis of the LiteLLM supply chain incident and migration guide.

Tetrate Agent Router Enterprise implements these patterns on Envoy, the proxy technology that has handled failover for mission-critical enterprise traffic for a decade, with policy-driven multi-provider routing, same-model cross-provider failover, version pinning, and health-based load balancing managed from one control plane and enforced at data planes in your own VPC or on-prem.

Frequently asked questions

Does multi-provider failover require maintaining multiple provider contracts? You need credentials for each failover channel, but bring-your-own-key support means existing contracts carry over, and cloud-marketplace channels (Bedrock, Vertex, Azure) often fall under agreements you already have.

Will responses differ when I fail over? With same-model cross-provider failover, behavior is essentially preserved. With cross-model fallback, expect differences: that’s why fallback targets should be evaluated and approved in advance, and substitutions logged.

How is this different from OpenRouter-style routing? Hosted aggregators provide multi-model routing through their cloud, which adds their infrastructure as a dependency and offers limited policy control. Enterprise failover runs inside your perimeter, under your policies, with your identity and budget context attached.

What’s the latency cost of a gateway in the path? A well-engineered gateway adds low single-digit milliseconds, which is noise against LLM inference times measured in hundreds of milliseconds to seconds. The resilience and observability gains dominate.

Tetrate Agent Router Enterprise delivers policy-driven failover across providers and models, built on the CNCF-backed Envoy AI Gateway. Book a demo and we’ll run a provider-failure simulation against your traffic patterns.

Sources

Provider status pages and incident histories (OpenAI, Anthropic, major cloud AI platforms)
Production reports on Python proxy reliability under sustained load
Enterprise gateway failover pattern documentation

Tetrate Agent Router Enterprise provides continuous runtime governance for GenAI systems. Enforce policies, control costs, and maintain compliance at the infrastructure layer.

Announcing token brokering for cost control in Tetrate Agent Router Enterprise

What Happens to Your AI Agents When Anthropic or OpenAI Goes Down? Building LLM Failover That Actually Works

What happens to your AI agents when a model provider goes down?

How often do LLM providers actually fail?

Why retries inside each application aren’t a resilience strategy

What does production-grade LLM failover look like?