Simplify Local AI Agents with Goose and Tetrate Agent Router Service

Learn more

Business Continuity: Failover that is Fast and Predictable

When production breaks, speed to a healthy path is what counts. Learn how to implement fast, predictable failover in Kubernetes using service-level routing guided by real-time locality, capacity, and health.

Business%20Continuity%3A%20Failover%20that%20is%20Fast%20and%20Predictable

When production breaks, speed to a healthy path is what counts, and in Kubernetes that requires service-level failover guided by real-time locality, capacity, and health. Too many teams still reach for DNS flips and runbooks. DNS Time To Live values linger, cutovers drift, people scramble, and recovery time stretches beyond what the incident requires. In multi-cluster and multi-region applications, failures surface at different layers, which produces inconsistent experiences and forces manual judgment in the middle of an incident.

Most teams already do failover, but much of it comes from a VM playbook: VIPs, host pairs, and DNS steering around static endpoints. Kubernetes changes the ground rules. Pods are ephemeral, IPs churn, and the unit of recovery is a service, not a host. Failover that works well in VMs becomes uneven in clusters because traffic needs to move by topology and health across zones and regions, not by lifting a virtual IP. That is why the platform should make the routing decisions using simple rules based on health and locality, but GSLB lacks visibility into per-service pod health and DNS caching slows failover.

What good looks like: move failover out of DNS and app code and into platform policy. Traffic stays local in steady state. When health drops, routing prefers the next healthy zone, then the next region, using locality rules and live health checks. The same approach applies at the edge and for east-west traffic. Teams have one view of topology, capacity, and health, so decisions are repeatable.

Here is how the problems map to the approach and what you get out of it.

Problem in today’s failoverWhat good looks likeHow to implement itOutcome
DNS flips and runbooks stretch RTO and invite errorsFailover happens by policy based on health and priorityTopology-aware global routing with health checks and outlier detectionFaster, deterministic recovery
Multi-cluster and multi-region behavior is inconsistentThe same rules apply in every cluster and regionCentral policy with distribution to local control planesPredictable failover and failback everywhere
Edge and east-west paths are handled differentlyStandard patterns at the edge and for east-west trafficTiered gateways with reusable templatesRepeatable operations and fewer one-offs
Operators lack a single view of topology and capacityDecisions are guided by live visibilityUnified views for topology, health, and capacity headroomQuicker diagnosis and safer failback

Together, these patterns turn failover from a manual event into repeatable platform behavior.

How to implement this with open source

You can build this model with Istio and Envoy Gateway. Set up multi-cluster service discovery, deploy tiered gateways for edge and east-west traffic, and enable locality-aware load balancing with a clear order from zone to zone to region. Turn on active health checks at the gateway and outlier detection so unhealthy endpoints are ejected quickly. Configure priorities, thresholds, hold-down timers, and connection draining so new sessions move to healthy capacity and existing sessions complete cleanly. Manage configuration through Git and a workflow like Argo CD. Run regular drills and measure RTO, error rate during cutover, and time to stable failback.

How to implement this with Tetrate Service Bridge (TSB)

TSB allows you to achieve the same model through configuration. Register clusters, apply standard edge and east-west gateway templates, set locality and regional priorities in one place, and enable health checks and outlier detection. Use topology and service views to validate drills and failback. If you want to see this mapped to your own regions and SLOs, contact us to request a demo.

Teams that run this model recover faster and with less variance because routing follows live health and locality rules instead of ad-hoc DNS edits. Incidents get easier to manage because every cluster and region behaves the same way. Resilience and cost align better since traffic stays local in steady state and only expands to other zones or regions when needed.

Learn more about Tetrate Service Bridge to see how it can help you implement fast, predictable failover in your environment.

Contact us to learn how Tetrate can help your journey. Follow us on LinkedIn for latest updates and best practices.

Product background Product background for tablets
New to service mesh?

Get up to speed with free online courses at Tetrate Academy and quickly learn Istio and Envoy.

Learn more
Using Kubernetes?

Tetrate Enterprise Gateway for Envoy (TEG) is the easiest way to get started with Envoy Gateway for production use cases. Get the power of Envoy Proxy in an easy-to-consume package managed via the Kubernetes Gateway API.

Learn more
Getting started with Istio?

Tetrate Istio Subscription (TIS) is the most reliable path to production, providing a complete solution for running Istio and Envoy securely in mission-critical environments. It includes:

  • Tetrate Istio Distro – A 100% upstream distribution of Istio and Envoy.
  • Compliance-ready – FIPS-verified and FedRAMP-ready for high-security needs.
  • Enterprise-grade support – The ONLY enterprise support for 100% upstream Istio, ensuring no vendor lock-in.
  • Learn more
    Need global visibility for Istio?

    TIS+ is a hosted Day 2 operations solution for Istio designed to streamline workflows for platform and support teams. It offers:

  • A global service dashboard
  • Multi-cluster visibility
  • Service topology visualization
  • Workspace-based access control
  • Learn more
    Decorative CTA background pattern background background
    Tetrate logo in the CTA section Tetrate logo in the CTA section for mobile

    Ready to enhance your
    network

    with more
    intelligence?