Announcing Built On Envoy: Making Envoy Extensions Accessible to Everyone

Learn more

Business Continuity: Failover that is Fast and Predictable

When production breaks, speed to a healthy path is what counts. Learn how to implement fast, predictable failover in Kubernetes using service-level routing guided by real-time locality, capacity, and health.

Business%20Continuity%3A%20Failover%20that%20is%20Fast%20and%20Predictable

When production breaks, speed to a healthy path is what counts, and in Kubernetes that requires service-level failover guided by real-time locality, capacity, and health. Too many teams still reach for DNS flips and runbooks. DNS Time To Live values linger, cutovers drift, people scramble, and recovery time stretches beyond what the incident requires. In multi-cluster and multi-region applications, failures surface at different layers, which produces inconsistent experiences and forces manual judgment in the middle of an incident.

Most teams already do failover, but much of it comes from a VM playbook: VIPs, host pairs, and DNS steering around static endpoints. Kubernetes changes the ground rules. Pods are ephemeral, IPs churn, and the unit of recovery is a service, not a host. Failover that works well in VMs becomes uneven in clusters because traffic needs to move by topology and health across zones and regions, not by lifting a virtual IP. That is why the platform should make the routing decisions using simple rules based on health and locality, but GSLB lacks visibility into per-service pod health and DNS caching slows failover.

What good looks like: move failover out of DNS and app code and into platform policy. Traffic stays local in steady state. When health drops, routing prefers the next healthy zone, then the next region, using locality rules and live health checks. The same approach applies at the edge and for east-west traffic. Teams have one view of topology, capacity, and health, so decisions are repeatable.

Here is how the problems map to the approach and what you get out of it.

Problem in today’s failoverWhat good looks likeHow to implement itOutcome
DNS flips and runbooks stretch RTO and invite errorsFailover happens by policy based on health and priorityTopology-aware global routing with health checks and outlier detectionFaster, deterministic recovery
Multi-cluster and multi-region behavior is inconsistentThe same rules apply in every cluster and regionCentral policy with distribution to local control planesPredictable failover and failback everywhere
Edge and east-west paths are handled differentlyStandard patterns at the edge and for east-west trafficTiered gateways with reusable templatesRepeatable operations and fewer one-offs
Operators lack a single view of topology and capacityDecisions are guided by live visibilityUnified views for topology, health, and capacity headroomQuicker diagnosis and safer failback

Together, these patterns turn failover from a manual event into repeatable platform behavior.

How to implement this with open source

You can build this model with Istio and Envoy Gateway. Set up multi-cluster service discovery, deploy tiered gateways for edge and east-west traffic, and enable locality-aware load balancing with a clear order from zone to zone to region. Turn on active health checks at the gateway and outlier detection so unhealthy endpoints are ejected quickly. Configure priorities, thresholds, hold-down timers, and connection draining so new sessions move to healthy capacity and existing sessions complete cleanly. Manage configuration through Git and a workflow like Argo CD. Run regular drills and measure RTO, error rate during cutover, and time to stable failback.

How to implement this with Tetrate Service Bridge (TSB)

TSB allows you to achieve the same model through configuration. Register clusters, apply standard edge and east-west gateway templates, set locality and regional priorities in one place, and enable health checks and outlier detection. Use topology and service views to validate drills and failback. If you want to see this mapped to your own regions and SLOs, contact us to request a demo.

Teams that run this model recover faster and with less variance because routing follows live health and locality rules instead of ad-hoc DNS edits. Incidents get easier to manage because every cluster and region behaves the same way. Resilience and cost align better since traffic stays local in steady state and only expands to other zones or regions when needed.

Learn more about Tetrate Service Bridge to see how it can help you implement fast, predictable failover in your environment.

Contact us to learn how Tetrate can help your journey. Follow us on LinkedIn for latest updates and best practices.

Product background Product background for tablets
Building AI agents

Agent Router Enterprise provides managed LLM & MCP Gateways plus AI Guardrails in your dedicated instance. Graduate agents from prototype to production with consistent model access, governed tool use, and runtime supervision — built on Envoy AI Gateway by its creators.

  • LLM Gateway – Unified model catalog with automatic fallback across providers
  • MCP Gateway – Curated tool access with per-profile authentication and filtering
  • AI Guardrails – Enforce policies, prevent data loss, and supervise agent behavior
  • Learn more
    Replacing NGINX Ingress

    Tetrate Enterprise Gateway for Envoy (TEG) is the enterprise-ready replacement for NGINX Ingress Controller. Built on Envoy Gateway and the Kubernetes Gateway API, TEG delivers advanced traffic management, security, and observability without vendor lock-in.

  • 100% upstream Envoy Gateway – CVE-protected builds
  • Kubernetes Gateway API native – Modern, portable, and extensible ingress
  • Enterprise-grade support – 24/7 production support from Envoy experts
  • Learn more
    Decorative CTA background pattern background background
    Tetrate logo in the CTA section Tetrate logo in the CTA section for mobile

    Ready to enhance your
    network

    with more
    intelligence?