Announcing Tetrate Agent Operations Director for GenAI Runtime Visibility and Governance

Learn more
< Back

Tetrate solves a critical production issue on a cluster serving millions of users

Tetrate%20solves%20a%20critical%20production%20issue%20on%20a%20cluster%20serving%20millions%20of%20users

Critical Production Issue: Resolving CrashLoopBackOff in Istio on EKS

The Problem

A U.S. federal customer encountered a critical issue while updating Amazon EKS nodes’ AMIs across more than 30 clusters running Istio. In one of these clusters, sidecar-injected workloads failed to start, showing a CrashLoopBackOff error. This Severity 1 incident posed a direct threat to business resiliency, affecting millions of users worldwide.

One key component of the customer’s production infrastructure is its ability to mirror traffic across sibling clusters. During maintenance, affected clusters are gracefully isolated from traffic. However, given strict security compliance and limited available production infrastructure, quickly deploying fallback clusters is not always feasible.

With one sibling cluster inoperable, even without active traffic, the system’s resiliency was weakened. This raised urgent concerns: Would the same issue occur in other clusters? Would errors surface once traffic resumed? A swift and decisive resolution was needed.

The Investigation

The issue first presented as Istiod taking an unusually long time to initialize, generating the following event:

istio-system    28m    Warning   Unhealthy   pod/istiod-74b69db89d-226lp    Readiness probe failed: HTTP probe failed with status code: 503

Additionally, no injector webhooks were available, causing pods to fail during initialization:

Error creating: Internal error occurred: failed calling webhook "sidecar-injector.istio.io": failed to call webhook: Post "https://istiod.istio-system.svc:443/inject?timeout=10s": no endpoints available for service "istiod"

While Istiod’s startup delays were temporary, they signaled a deeper issue. Once it was up and sidecars were injected, the pods still failed to reach a ready state. The sidecar process was aborting before becoming operational.

Tetrate offers an enterprise-ready, 100% upstream distribution of Istio, Tetrate Istio Subscription (TIS). TIS is the easiest way to get started with Istio for production use cases. TIS+, a hosted Day 2 operations solution for Istio, adds a global service registry, unified Istio metrics dashboard, and self-service troubleshooting.

Learn more

Tetrate Support Steps In

With the customer’s team focused on upgrading other clusters, they had limited internal bandwidth to diagnose the issue. A Severity 1 ticket was raised with Tetrate Support. Within six minutes, a Tetrate engineer was reviewing logs, and within 20 minutes, two additional experts had joined a live debugging session.

The following logs from the sidecars revealed key details:

info    Starting proxy agent
info    starting
info    Status server has successfully terminated
info    Agent draining Proxy
warn    ca    ca request failed, starting attempt 1 in 99.993416ms
warn    Aborting proxy

However, all Certificate Signing Request (CSR) events appeared successful:

default       21m      Normal    Signed    certificatesigningrequest/csr-zzwwvn2xw7lcq2fb95   The CSR has been signed

In this setup, Istio offloaded the CA role to an external service (e.g., Cert-Manager, AWS ACM, or HashiCorp Vault). Each istio-agent in the sidecar requested a signed certificate, which Istiod passed to the external CA. The CA then signed and returned the certificate for secure TLS communication.

Since CA requests were successfully processed, but sidecars were still failing, the team suspected an issue with the custom CA. Further investigation confirmed that the CA was overwhelmed, struggling to handle the sudden influx of certificate requests as hundreds of new pods spun up simultaneously. The default Istiod CA request timeout of 10 seconds exacerbated the issue, leading to a “thundering herd” effect.

The Solution

To mitigate the load and stabilize the cluster, Tetrate engineers implemented a three-step approach:

  1. Scaled Down Istiod: The number of Istiod replicas was temporarily reduced from 40 to 1 to slow down internal Istio processes, easing the CA’s workload.
  2. Cleared Failed CSRs: Many CSRs were stuck in the Approved, Failed state. These were identified and deleted.
  3. Controlled Pod Restarts: Instead of allowing all pods to restart simultaneously, they were gradually restarted in a controlled manner to prevent another surge in CA requests.

Results

The issue was resolved in less than an hour, restoring the production cluster to full resiliency. The root cause was not a direct Istio malfunction but rather a dependency failure—an overloaded external CA.

By quickly identifying and addressing the issue, Tetrate Support helped the customer avoid a major disruption to their production environment, ensuring continued reliability for millions of users. This case underscores the importance of understanding the entire service mesh ecosystem, not just Istio itself, when diagnosing complex infrastructure failures.

Product background Product background for tablets
New to service mesh?

Get up to speed with free online courses at Tetrate Academy and quickly learn Istio and Envoy.

Learn more
Using Kubernetes?

Tetrate Enterprise Gateway for Envoy (TEG) is the easiest way to get started with Envoy Gateway for production use cases. Get the power of Envoy Proxy in an easy-to-consume package managed via the Kubernetes Gateway API.

Learn more
Getting started with Istio?

Tetrate Istio Subscription (TIS) is the most reliable path to production, providing a complete solution for running Istio and Envoy securely in mission-critical environments. It includes:

  • Tetrate Istio Distro – A 100% upstream distribution of Istio and Envoy.
  • Compliance-ready – FIPS-verified and FedRAMP-ready for high-security needs.
  • Enterprise-grade support – The ONLY enterprise support for 100% upstream Istio, ensuring no vendor lock-in.
  • Learn more
    Need global visibility for Istio?

    TIS+ is a hosted Day 2 operations solution for Istio designed to streamline workflows for platform and support teams. It offers:

  • A global service dashboard
  • Multi-cluster visibility
  • Service topology visualization
  • Workspace-based access control
  • Learn more
    Decorative CTA background pattern background background
    Tetrate logo in the CTA section Tetrate logo in the CTA section for mobile

    Ready to enhance your
    network

    with more
    intelligence?