Modern multi-service, multi-cloud applications can be fragile, particularly when running in highly-automated infrastructure with many moving parts. To make them robust in the face of production traffic, you need to eliminate single points of failure, and replicate services across clusters and cloud regions.
Tetrate Service Bridge achieves the goal of making your applications highly-available, by dynamically controlling traffic into and between your infrastructure. Let’s see how this is done, considering North-South and East-West traffic, as well as optimizations for DNS GSLB integrations.
Setting the scene
The examples presented here use AWS EKS clusters to host services, and AWS Route 53 to manage DNS. They can be adapted to any Kubernetes platform (on-prem, cloud-hosted, even hybrid) and to any suitable DNS service.
As an example for these use cases, we will use the Istio Bookinfo application. Bookinfo uses multiple dependent services (productpage, details, reviews, ratings) to create a simple application:
Setting the Scene
The examples presented here use AWS EKS clusters to host services, and AWS Route 53 to manage DNS. They can be adapted to any Kubernetes platform (on-prem, cloud-hosted, even hybrid) and to any suitable DNS service.
As an example for these use cases, we will use the Istio Bookinfo application. Bookinfo uses multiple dependent services (productpage, details, reviews, ratings) to create a simple application:
The application is deployed into two AWS EKS workload clusters (cluster-work1 and cluster-work2), each in a different AWS region (us-east-1 and us-west-1). Each region also contains an edge cluster (cluster-edge1 and cluster-edge2) with an Edge Gateway:
The Edge Gateway IPs are published using DNS records in Route 53 (other GSLB solutions will work similarly):
In this blog, we’ll explore how to make the application highly available using Tetrate Service Bridge and the underlying Istio data plane. We’ll consider several possible failure scenarios:
Failure Recovery Scenarios and Traffic Optimizations
North-South Failures (traffic flow from client to workload cluster)
- Stop publishing the application from the workload cluster
- Example: Delete application’s Ingress resource on cluster-work1
- Response: Switch traffic to cluster-work2
- A connectivity failure to the workload cluster
- Example: Scale Ingress Gateway to 0 replicas on cluster-work1
- Response: Switch traffic to cluster-work2
- Stop publishing the application from the Edge Gateway
- Example: Delete application’s Edge Ingress resource on cluster-edge1
- Response: Tetrate removes DNS entry for Edge Gateway on cluster-edge1
- A complete failure of the edge cluster
- Example: Delete application’s Edge Ingress resource on cluster-edge1
- Response: DNS health checks detect failure and update DNS to retire cluster-edge1
East-West Failures (traffic flows within the workload clusters)
- An internal failure within the application
- Example: Scale an application component to 0 replicas on cluster-work1
- Response: Immediately route internal traffic to the component on cluster-work2
- The application is deleted from a workload cluster
- Example: Delete all application components from cluster-work1
- Response: Immediately route internal traffic to the application on cluster-work2
Traffic Optimizations (optimize traffic flow when a failure occurs)
- A workload cluster fails or cannot respond to application requests
- Example: Delete Ingress resource or scale Ingress Gateway to 0 replicas on cluster-work1
- Response: Switch traffic to cluster-work2. Retire DNS entry for local Edge Gateway (cluster-edge1) to avoid latency and transit costs
North-South High Availability
North-South traffic refers to traffic entering your application environment, generally from remote, over-the-internet clients. North-South traffic typically passes through a series of gateways and is then routed to the entry point for an application.
We host the application in two EKS workload clusters, cluster-work1 and cluster-work2, running in two AWS regions. In each region, we also deploy an Edge Gateway in dedicated clusters cluster-edge1 and cluster-edge2.
The purpose of the Edge Gateway is to receive and terminate traffic from external, internet-based clients. The Edge Gateway then load-balances the requests across the working workload clusters. You can use a DNS GSLB solution to distribute traffic across the edge clusters; in this example, we have used Tetrate’s AWS-controller to drive Route 53 DNS, but other solutions may also be used.
In general, all gateways will use locality-prioritized load balancing to favor targets located in the same cloud region, only failing over if the local targets have failed.
Why use Edge Gateways rather than exposing the workload clusters directly? At first glance, the architecture does look wasteful, and the edge clusters appear redundant. The value of the edge clusters become clear when you consider that:
- As you scale to multiple applications and multiple clusters in each region, the complexity of exposing each workload cluster and tracking which applications are present on each becomes overwhelming. A small number of Edge clusters eliminates this complexity.
- Workload clusters can be fluid, with frequently-changing configuration and unpredictable scaling events, Errors and availability problems are more likely on Workload clusters than on Edge clusters.
A frontend tier of edge gateways provides a stable entry point for all applications and clusters, and acts as a buffer to inspect and filter traffic, so that only traffic for known, published applications is forwarded to the workload clusters. Network Reachability and firewall rules ensure that the Workload clusters can only be reached from the downstream Edge clusters.
Failover Scenario One: Application Unpublished from a Work Cluster
Provoke this scenario by deleting the bookinfo Gateway resource from cluster-work1:
% tctl delete -f bookinfo-ingress-1.yaml
This scenario models an issue when a deployment might fail and the Gateway resource is not published. The Tetrate solution quickly learns the new application topology and reconfigures the Edge gateways to forward traffic to cluster-work2:
Requests for the application to cluster-work1 may fail for approximately 1-2 seconds while the configuration change is propagated. Once propagated, the Edge Gateways do not send requests to cluster-work1. When the Gateway resource is re-published in cluster-work1, edgegw-1 quickly responds to use that new, local instance.
Failover Scenario Two: Complete Ingress Failure at a Work Cluster
Provoke this scenario by scaling the Ingress Gateway deployment down to 0 replicas. The AWS load balancer for cluster-work1 continues to accept traffic, but has nowhere to forward it to:
% kubectl scale deployment ingressgw-1 -n bookinfo --replicas=0
This scenario models a total failure of cluster-work1. The Edge Gateway edgegw-1 identifies that the target is not responding, and fails over to the remote cluster:
Connection attempts to cluster-work1 will fail for approximately 2-4 seconds while the failure of cluster-work1 is established. Infrastructure errors like this are handled with caution (to avoid flip-flopping when isolated errors occur), and so take slightly longer to detect and to recover from.
Failover Scenario Three: Application Unpublished from an Edge Cluster
Provoke this scenario by deleting the bookinfo Gateway resource from cluster-edge1. The Edge Gateway on cluster-edge1 will not serve requests for the application:
% tctl delete -f bookinfo-edge-1.yaml
This scenario models an unusual configuration error where the application is unpublished from an Edge Gateway. Using the AWS-controller, the Tetrate solution retires the DNS record for the bookinfo Edge Gateway resource on cluster-edge1:
The Tetrate solution quickly updates Route 53, and in our testing, we observed between 30-90 seconds of downtime before the DNS is updated. Generally, modern web browsers will attempt to re-resolve DNS entries if the first requests deliver non-responsive IP addresses, so the impact on many clients is minimized.
When the application is re-published on edgegw-1, the Tetrate solution notices and adds edgegw-1 back to the Route 53 DNS record.
Failover Scenario Four: Edge Gateway Fails on Edge Cluster
Provoke this scenario by scaling down the Edge Gateway deployment to 0 replicas. Initially, clients will continue to send traffic to cluster-edge1.
% kubectl scale deployment edgegw-1 -n edge --replicas=0
This scenario models a total failure of one of the Edge Gateways. We would hope that this scenario is very rare, because the Edge Gateway clusters are simple, stable and persistent. The Tetrate solution cannot retire the DNS entry for that Edge cluster (because the cluster has failed), but the AWS Route 53 health check detects that the endpoint is not functioning and takes it out of the DNS RR replies:
The downtime from this event varies, depending on the speed of the AWS Route 53 health checks, frequency of DNS updates, propagation time and client caching. In testing, the public DNS records are updated within 90 seconds of the error being provoked.
When the Edge Gateway is restored, the health checks detect that it is operating correctly and the gateway’s IPs are added back to the Route 53 DNS records.
East-West High Availability
East-West traffic refers to traffic flowing within your application environment, between its dependent services. East-West traffic may be contained to a single environment (e.g. cluster) or may flow from one internal cluster to another.
What happens if a component within the bookinfo application fails? This case is covered by East-West failover.
The Tetrate solution prepares the environment so that if a local service instance were to fail, the Istio sidecar proxies immediately send requests over a secure, mTLS connection to a remote service instance. This service instance is accessed through a Tetrate East-West gateway running on the remote cluster.
In this failover configuration, Tetrate configures the routing to favor the local cluster whenever possible, and only fails over when all local service instances fail. You can also use East-West gateways to implement secure cross-cluster connectivity, consuming a remotely-located service as if it were local. This is all achieved through the magic of Tetrate’s cross-cluster Service Registry, mTLS everywhere, and identity propagation.
How Does Tetrate’s Identity Propagation Work?
Identity Propagation is necessary because Istio security rules are based on the source and destination for traffic, as defined by the SPIFFE identities of each party. Within a single cluster, the identities are well-defined.
During a failover event, traffic is forwarded securely and automatically to a remote service instance using intermediate proxies (such as the East-West gateway). From the destination’s perspective, the source of the traffic is the identity of the last gateway in the chain, meaning that security rules and logging are compromised.
The Tetrate solution uses a custom module (Envoy WASM extension) to inject a signed copy of the originating identity into the request, and to restore this identity when security rules are applied at the destination.
In the diagram, the ‘ratings’ service in cluster-work1 has just failed. The ratings service in cluster-work2 receives traffic from the local East-West gateway (from an unknown, ‘External Service’) and deduces that the traffic is coming from the reviews-v2 client in cluster-work1.
The result is security policies are correctly and intuitively applied to traffic across all clusters, no matter how many intermediate proxies are used to forward the request. The Tetrate solution automatically considers and resolves failover concerns, dramatically simplifying the task of creating accurate, concise security rules.
Failover Scenario One: Application Component Failure
Provoke this scenario by scaling the Details deployment to 0 replicas in cluster-work1:
% kubectl scale deployment details-v1 -n bookinfo --replicas=0
This scenario models an issue when a single service in the cluster fails, perhaps due to resource starvation, a failed deployment or an application error. The Tetrate solution immediately ensures that requests for the details service are routed to the remaining working instance on cluster-work2:
Detection and failover happens almost instantaneously, so any errors in the productpage application are very short-lived. Identity propagation ensures that any access control policies are correctly applied, so the application continues to function and there’s no need to manually configure ‘back doors’ to enable cross-cluster traffic.
When the failed service recovers, the Tetrate solution quickly switches back to the recovered local instance of the service.
Failover Scenario Two: Application Failure
We provoke this scenario by deleting the bookinfo application from cluster-work1, and test behavior by sending traffic to the Edge Gateway in that cloud region:
% kubectl delete -n bookinfo -f bookinfo-app.yaml service "details" deleted serviceaccount "bookinfo-details" deleted deployment.apps "details-v1" deleted service "ratings" deleted serviceaccount "bookinfo-ratings" deleted deployment.apps "ratings-v1" deleted service "reviews" deleted serviceaccount "bookinfo-reviews" deleted deployment.apps "reviews-v1" deleted deployment.apps "reviews-v2" deleted deployment.apps "reviews-v3" deleted service "productpage" deleted serviceaccount "bookinfo-productpage" deleted deployment.apps "productpage-v1" deleted
This scenario models an issue where the entire application fails. While the application is being deleted, it returns some application-level errors; as soon as the productpage entry-point is deleted, the Tetrate solution immediately ensures that the Ingress Gateway in cluster-work1 forwards requests securely to the application instance in cluster-work2:
When the failed application is restored, the Tetrate solution quickly switches back to the recovered local instance of the application.
Traffic Optimizations
There is one more optimization that the Tetrate solution can bring to bear, to eliminate the latency and transit costs of cross-region traffic.
You may have noticed that some failure scenarios could result in one workload cluster in one region being unavailable. Nevertheless, the edge gateway in that same region continues to function and responds to requests for the application. It sends requests to a functioning workload cluster in a remote region:
In the Edge Ingress resource, you can create a special health check request for the application. When a health check request is received, it bypasses Tetrate’s automated failover and is sent to the local workload cluster. If the local workload cluster fails, the health check request will also fail even though regular application requests continue to function.
Add this health check to your GSLB solution:
The effect of this health check is to retire the Edge Gateway in the cloud region where the workload cluster has failed. It is no longer served in DNS responses, and clients are only directed to edge gateways that are in the same region as the working instances of your application.
What Have We Achieved?
We have seen how the Tetrate solution can achieve a very high degree of availability for your applications, across clusters and clouds, in the face of a wide variety of possible scenarios. Infrastructure failures, failed deployments, internal errors – all such scenarios are addressed and managed by the Tetrate solution. For all scenarios other than a catastrophic edge gateway failure (where failover is governed by DNS), failover is almost immediate and the impacts are minimized.
In every case, high availability is configured as a property of the platform, not of the application. This means that application teams do not need to modify their applications or deployment pipelines in any way; high availability will be achieved without any actions on their part.
The Tetrate solution scales seamlessly to multiple cloud regions, and multiple workload clusters in some or all of the regions. It also scales seamlessly for multiple different applications, each with its own FQDN name, performing health checks and failover individually for each application.
###
If you’re new to service mesh, Tetrate has a bunch of free online courses available at Tetrate Academy that will quickly get you up to speed with Istio and Envoy.
Are you using Kubernetes? Tetrate Enterprise Gateway for Envoy (TEG) is the easiest way to get started with Envoy Gateway for production use cases. Get the power of Envoy Proxy in an easy-to-consume package managed by the Kubernetes Gateway API. Learn more ›
Getting started with Istio? If you’re looking for the surest way to get to production with Istio, check out Tetrate Istio Subscription. Tetrate Istio Subscription has everything you need to run Istio and Envoy in highly regulated and mission-critical production environments. It includes Tetrate Istio Distro, a 100% upstream distribution of Istio and Envoy that is FIPS-verified and FedRAMP ready. For teams requiring open source Istio and Envoy without proprietary vendor dependencies, Tetrate offers the ONLY 100% upstream Istio enterprise support offering.
Need global visibility for Istio? TIS+ is a hosted Day 2 operations solution for Istio designed to simplify and enhance the workflows of platform and support teams. Key features include: a global service dashboard, multi-cluster visibility, service topology visualization, and workspace-based access control.
Get a Demo