Application High Availability Architecture with Tetrate’s Enterprise Service Mesh

Modern multi-service, multi-cloud applications can be fragile, particularly when running in highly-automated infrastructure with many moving parts.

Owen Garrett

April 16, 2024

Application%20High%20Availability%20Architecture%20with%20Tetrate%E2%80%99s%20Enterprise%20Service%20Mesh

Modern multi-service, multi-cloud applications can be fragile, particularly when running in highly-automated infrastructure with many moving parts. To make them robust in the face of production traffic, you need to eliminate single points of failure, and replicate services across clusters and cloud regions.

Tetrate Service Bridge achieves the goal of making your applications highly-available, by dynamically controlling traffic into and between your infrastructure. Let’s see how this is done, considering North-South and East-West traffic, as well as optimizations for DNS GSLB integrations.

Setting the scene

The examples presented here use AWS EKS clusters to host services, and AWS Route 53 to manage DNS. They can be adapted to any Kubernetes platform (on-prem, cloud-hosted, even hybrid) and to any suitable DNS service.

As an example for these use cases, we will use the Istio Bookinfo application. Bookinfo uses multiple dependent services (productpage, details, reviews, ratings) to create a simple application:

Setting the Scene

The examples presented here use AWS EKS clusters to host services, and AWS Route 53 to manage DNS. They can be adapted to any Kubernetes platform (on-prem, cloud-hosted, even hybrid) and to any suitable DNS service.

As an example for these use cases, we will use the Istio Bookinfo application. Bookinfo uses multiple dependent services (productpage, details, reviews, ratings) to create a simple application:

Figure 1: Figure 1. Topology of Istio’s reference application, Bookinfo.

The application is deployed into two AWS EKS workload clusters (cluster-work1 and cluster-work2), each in a different AWS region (us-east-1 and us-west-1). Each region also contains an edge cluster (cluster-edge1 and cluster-edge2) with an Edge Gateway:

Figure 2: Figure 2. Workload and Edge clusters are deployed across two cloud regions for high availability.

Figure 3: Figure 3. Traffic is received by the internet-facing Edge Gateways, and then forwarded to the workload clusters. The local workload cluster is preferred for latency and cost (transit traffic) reasons.

The Edge Gateway IPs are published using DNS records in Route 53 (other GSLB solutions will work similarly):

Figure 4: Figure 4. AWS Console: The DNS records to share traffic between the two Edge Gateways.

In this blog, we’ll explore how to make the application highly available using Tetrate Service Bridge and the underlying Istio data plane. We’ll consider several possible failure scenarios:

Failure Recovery Scenarios and Traffic Optimizations

North-South Failures (traffic flow from client to workload cluster)

Stop publishing the application from the workload cluster
- Example: Delete application’s Ingress resource on cluster-work1
- Response: Switch traffic to cluster-work2
A connectivity failure to the workload cluster
- Example: Scale Ingress Gateway to 0 replicas on cluster-work1
- Response: Switch traffic to cluster-work2
Stop publishing the application from the Edge Gateway
- Example: Delete application’s Edge Ingress resource on cluster-edge1
- Response: Tetrate removes DNS entry for Edge Gateway on cluster-edge1
A complete failure of the edge cluster
- Example: Delete application’s Edge Ingress resource on cluster-edge1
- Response: DNS health checks detect failure and update DNS to retire cluster-edge1

East-West Failures (traffic flows within the workload clusters)

An internal failure within the application
- Example: Scale an application component to 0 replicas on cluster-work1
- Response: Immediately route internal traffic to the component on cluster-work2
The application is deleted from a workload cluster
- Example: Delete all application components from cluster-work1
- Response: Immediately route internal traffic to the application on cluster-work2

Traffic Optimizations (optimize traffic flow when a failure occurs)

A workload cluster fails or cannot respond to application requests
- Example: Delete Ingress resource or scale Ingress Gateway to 0 replicas on cluster-work1
- Response: Switch traffic to cluster-work2. Retire DNS entry for local Edge Gateway (cluster-edge1) to avoid latency and transit costs

North-South High Availability

North-South traffic refers to traffic entering your application environment, generally from remote, over-the-internet clients. North-South traffic typically passes through a series of gateways and is then routed to the entry point for an application.

We host the application in two EKS workload clusters, cluster-work1 and cluster-work2, running in two AWS regions. In each region, we also deploy an Edge Gateway in dedicated clusters cluster-edge1 and cluster-edge2.

The purpose of the Edge Gateway is to receive and terminate traffic from external, internet-based clients. The Edge Gateway then load-balances the requests across the working workload clusters. You can use a DNS GSLB solution to distribute traffic across the edge clusters; in this example, we have used Tetrate’s AWS-controller to drive Route 53 DNS, but other solutions may also be used.

In general, all gateways will use locality-prioritized load balancing to favor targets located in the same cloud region, only failing over if the local targets have failed.

Why use Edge Gateways rather than exposing the workload clusters directly? At first glance, the architecture does look wasteful, and the edge clusters appear redundant. The value of the edge clusters become clear when you consider that:
As you scale to multiple applications and multiple clusters in each region, the complexity of exposing each workload cluster and tracking which applications are present on each becomes overwhelming. A small number of Edge clusters eliminates this complexity.
Workload clusters can be fluid, with frequently-changing configuration and unpredictable scaling events, Errors and availability problems are more likely on Workload clusters than on Edge clusters.
A frontend tier of edge gateways provides a stable entry point for all applications and clusters, and acts as a buffer to inspect and filter traffic, so that only traffic for known, published applications is forwarded to the workload clusters. Network Reachability and firewall rules ensure that the Workload clusters can only be reached from the downstream Edge clusters.

### Failover Scenario One: Application Unpublished from a Work Cluster

Provoke this scenario by deleting the bookinfo Gateway resource from cluster-work1:

% tctl delete -f bookinfo-ingress-1.yaml

This scenario models an issue when a deployment might fail and the Gateway resource is not published. The Tetrate solution quickly learns the new application topology and reconfigures the Edge gateways to forward traffic to cluster-work2:

Figure 5: Figure 5. ingressgw-1 has stopped publishing the bookinfo application. edgegw-1 stops using its local target and fails over to ingressgw-2.

Requests for the application to cluster-work1 may fail for approximately 1-2 seconds while the configuration change is propagated. Once propagated, the Edge Gateways do not send requests to cluster-work1. When the Gateway resource is re-published in cluster-work1, edgegw-1 quickly responds to use that new, local instance.

Failover Scenario Two: Complete Ingress Failure at a Work Cluster

Provoke this scenario by scaling the Ingress Gateway deployment down to 0 replicas. The AWS load balancer for cluster-work1 continues to accept traffic, but has nowhere to forward it to:

% kubectl scale deployment ingressgw-1 -n bookinfo --replicas=0

This scenario models a total failure of cluster-work1. The Edge Gateway edgegw-1 identifies that the target is not responding, and fails over to the remote cluster:

Figure 6: Figure 6. The Ingress Gateway in cluster-work1 has failed.edgegw-1 identifies this failure and fails-over to ingressgw-2.

Connection attempts to cluster-work1 will fail for approximately 2-4 seconds while the failure of cluster-work1 is established. Infrastructure errors like this are handled with caution (to avoid flip-flopping when isolated errors occur), and so take slightly longer to detect and to recover from.

Failover Scenario Three: Application Unpublished from an Edge Cluster

Provoke this scenario by deleting the bookinfo Gateway resource from cluster-edge1. The Edge Gateway on cluster-edge1 will not serve requests for the application:

% tctl delete -f bookinfo-edge-1.yaml

This scenario models an unusual configuration error where the application is unpublished from an Edge Gateway. Using the AWS-controller, the Tetrate solution retires the DNS record for the bookinfo Edge Gateway resource on cluster-edge1:

Figure 7: Figure 7. AWS Console: The Tetrate AWS-controller will update the DNS record for the service so only active Edge gateways are returned.

The Tetrate solution quickly updates Route 53, and in our testing, we observed between 30-90 seconds of downtime before the DNS is updated. Generally, modern web browsers will attempt to re-resolve DNS entries if the first requests deliver non-responsive IP addresses, so the impact on many clients is minimized.

When the application is re-published on edgegw-1, the Tetrate solution notices and adds edgegw-1 back to the Route 53 DNS record.

Failover Scenario Four: Edge Gateway Fails on Edge Cluster

Provoke this scenario by scaling down the Edge Gateway deployment to 0 replicas. Initially, clients will continue to send traffic to cluster-edge1.

% kubectl scale deployment edgegw-1 -n edge --replicas=0

This scenario models a total failure of one of the Edge Gateways. We would hope that this scenario is very rare, because the Edge Gateway clusters are simple, stable and persistent. The Tetrate solution cannot retire the DNS entry for that Edge cluster (because the cluster has failed), but the AWS Route 53 health check detects that the endpoint is not functioning and takes it out of the DNS RR replies:

Figure 8: Figure 8. AWS Console: AWS Route 53 will retire the DNS record corresponding to the failed Edge Gateway.

The downtime from this event varies, depending on the speed of the AWS Route 53 health checks, frequency of DNS updates, propagation time and client caching. In testing, the public DNS records are updated within 90 seconds of the error being provoked.

When the Edge Gateway is restored, the health checks detect that it is operating correctly and the gateway’s IPs are added back to the Route 53 DNS records.

East-West High Availability

East-West traffic refers to traffic flowing within your application environment, between its dependent services. East-West traffic may be contained to a single environment (e.g. cluster) or may flow from one internal cluster to another.

What happens if a component within the bookinfo application fails? This case is covered by East-West failover.

The Tetrate solution prepares the environment so that if a local service instance were to fail, the Istio sidecar proxies immediately send requests over a secure, mTLS connection to a remote service instance. This service instance is accessed through a Tetrate East-West gateway running on the remote cluster.

In this failover configuration, Tetrate configures the routing to favor the local cluster whenever possible, and only fails over when all local service instances fail. You can also use East-West gateways to implement secure cross-cluster connectivity, consuming a remotely-located service as if it were local. This is all achieved through the magic of Tetrate’s cross-cluster Service Registry, mTLS everywhere, and identity propagation.

How Does Tetrate’s Identity Propagation Work?
Identity Propagation is necessary because Istio security rules are based on the source and destination for traffic, as defined by the SPIFFE identities of each party. Within a single cluster, the identities are well-defined.
During a failover event, traffic is forwarded securely and automatically to a remote service instance using intermediate proxies (such as the East-West gateway). From the destination’s perspective, the source of the traffic is the identity of the last gateway in the chain, meaning that security rules and logging are compromised.
The Tetrate solution uses a custom module (Envoy WASM extension) to inject a signed copy of the originating identity into the request, and to restore this identity when security rules are applied at the destination.
In the diagram, the ‘ratings’ service in cluster-work1 has just failed. The ratings service in cluster-work2 receives traffic from the local East-West gateway (from an unknown, ‘External Service’) and deduces that the traffic is coming from the reviews-v2 client in cluster-work1.
The result is security policies are correctly and intuitively applied to traffic across all clusters, no matter how many intermediate proxies are used to forward the request. The Tetrate solution automatically considers and resolves failover concerns, dramatically simplifying the task of creating accurate, concise security rules.

### Failover Scenario One: Application Component Failure

Provoke this scenario by scaling the Details deployment to 0 replicas in cluster-work1:

% kubectl scale deployment details-v1 -n bookinfo --replicas=0

This scenario models an issue when a single service in the cluster fails, perhaps due to resource starvation, a failed deployment or an application error. The Tetrate solution immediately ensures that requests for the details service are routed to the remaining working instance on cluster-work2:

Figure 10: Figure 10. When the details-v1 service in cluster-work1 fails, the Tetrate solution moves requests to the instance in cluster-work2.

Detection and failover happens almost instantaneously, so any errors in the productpage application are very short-lived. Identity propagation ensures that any access control policies are correctly applied, so the application continues to function and there’s no need to manually configure ‘back doors’ to enable cross-cluster traffic.

When the failed service recovers, the Tetrate solution quickly switches back to the recovered local instance of the service.

Failover Scenario Two: Application Failure

We provoke this scenario by deleting the bookinfo application from cluster-work1, and test behavior by sending traffic to the Edge Gateway in that cloud region:

% kubectl delete -n bookinfo -f bookinfo-app.yaml
service "details" deleted
serviceaccount "bookinfo-details" deleted
deployment.apps "details-v1" deleted
service "ratings" deleted
serviceaccount "bookinfo-ratings" deleted
deployment.apps "ratings-v1" deleted
service "reviews" deleted
serviceaccount "bookinfo-reviews" deleted
deployment.apps "reviews-v1" deleted
deployment.apps "reviews-v2" deleted
deployment.apps "reviews-v3" deleted
service "productpage" deleted
serviceaccount "bookinfo-productpage" deleted
deployment.apps "productpage-v1" deleted

This scenario models an issue where the entire application fails. While the application is being deleted, it returns some application-level errors; as soon as the productpage entry-point is deleted, the Tetrate solution immediately ensures that the Ingress Gateway in cluster-work1 forwards requests securely to the application instance in cluster-work2:

Figure 11: Figure 11. When the application instance in cluster-work1 fails (is deleted), the Tetrate solution moves requests to the application instance in cluster-work2.

When the failed application is restored, the Tetrate solution quickly switches back to the recovered local instance of the application.

Traffic Optimizations

There is one more optimization that the Tetrate solution can bring to bear, to eliminate the latency and transit costs of cross-region traffic.

You may have noticed that some failure scenarios could result in one workload cluster in one region being unavailable. Nevertheless, the edge gateway in that same region continues to function and responds to requests for the application. It sends requests to a functioning workload cluster in a remote region:

Figure 12: Figure 12. Cross-region failover ensures that the application is not impacted, but does incur latency and transit costs.

In the Edge Ingress resource, you can create a special health check request for the application. When a health check request is received, it bypasses Tetrate’s automated failover and is sent to the local workload cluster. If the local workload cluster fails, the health check request will also fail even though regular application requests continue to function.

Add this health check to your GSLB solution:

Figure 13: Figure 13. GLSB health check.

The effect of this health check is to retire the Edge Gateway in the cloud region where the workload cluster has failed. It is no longer served in DNS responses, and clients are only directed to edge gateways that are in the same region as the working instances of your application.

What Have We Achieved?

We have seen how the Tetrate solution can achieve a very high degree of availability for your applications, across clusters and clouds, in the face of a wide variety of possible scenarios. Infrastructure failures, failed deployments, internal errors – all such scenarios are addressed and managed by the Tetrate solution. For all scenarios other than a catastrophic edge gateway failure (where failover is governed by DNS), failover is almost immediate and the impacts are minimized.

In every case, high availability is configured as a property of the platform, not of the application. This means that application teams do not need to modify their applications or deployment pipelines in any way; high availability will be achieved without any actions on their part.

The Tetrate solution scales seamlessly to multiple cloud regions, and multiple workload clusters in some or all of the regions. It also scales seamlessly for multiple different applications, each with its own FQDN name, performing health checks and failover individually for each application.

Owen Garrett

April 16, 2024

New to service mesh?

Get up to speed with free online courses at Tetrate Academy and quickly learn Istio and Envoy.

Learn more

Using Kubernetes?

Tetrate Enterprise Gateway for Envoy (TEG) is the easiest way to get started with Envoy Gateway for production use cases. Get the power of Envoy Proxy in an easy-to-consume package managed via the Kubernetes Gateway API.

Learn more

Getting started with Istio?

Tetrate Istio Subscription (TIS) is the most reliable path to production, providing a complete solution for running Istio and Envoy securely in mission-critical environments. It includes:

Tetrate Istio Distro – A 100% upstream distribution of Istio and Envoy.

Compliance-ready – FIPS-verified and FedRAMP-ready for high-security needs.

Enterprise-grade support – The ONLY enterprise support for 100% upstream Istio, ensuring no vendor lock-in.

Learn more

Need global visibility for Istio?

TIS+ is a hosted Day 2 operations solution for Istio designed to streamline workflows for platform and support teams. It offers:

A global service dashboard

Multi-cluster visibility

Service topology visualization

Workspace-based access control

Learn more

Announcing Tetrate Agent Router Service: Intelligent routing for GenAI developers

Application High Availability Architecture with Tetrate’s Enterprise Service Mesh

Setting the Scene

Failure Recovery Scenarios and Traffic Optimizations

North-South Failures (traffic flow from client to workload cluster)

East-West Failures (traffic flows within the workload clusters)

Traffic Optimizations (optimize traffic flow when a failure occurs)

North-South High Availability

Failover Scenario Two: Complete Ingress Failure at a Work Cluster

Failover Scenario Three: Application Unpublished from an Edge Cluster

Failover Scenario Four: Edge Gateway Fails on Edge Cluster

East-West High Availability

How Does Tetrate’s Identity Propagation Work?

Failover Scenario Two: Application Failure

Traffic Optimizations

What Have We Achieved?

New to service mesh?

Using Kubernetes?

Getting started with Istio?

Need global visibility for Istio?

Ready to enhance your
network
with more
intelligence?

Announcing Tetrate Agent Router Service: Intelligent routing for GenAI developers

Application High Availability Architecture with Tetrate’s Enterprise Service Mesh

Setting the Scene

Failure Recovery Scenarios and Traffic Optimizations

North-South Failures (traffic flow from client to workload cluster)

East-West Failures (traffic flows within the workload clusters)

Traffic Optimizations (optimize traffic flow when a failure occurs)

North-South High Availability

Failover Scenario Two: Complete Ingress Failure at a Work Cluster

Failover Scenario Three: Application Unpublished from an Edge Cluster

Failover Scenario Four: Edge Gateway Fails on Edge Cluster

East-West High Availability

How Does Tetrate’s Identity Propagation Work?

Failover Scenario Two: Application Failure

Traffic Optimizations

What Have We Achieved?

New to service mesh?

Using Kubernetes?

Getting started with Istio?

Need global visibility for Istio?

Ready to enhance your network with more intelligence?

Ready to enhance your
network
with more
intelligence?