
Critical Production Issue: Resolving CrashLoopBackOff in Istio on EKS
The Problem
A U.S. federal customer encountered a critical issue while updating Amazon EKS nodes’ AMIs across more than 30 clusters running Istio. In one of these clusters, sidecar-injected workloads failed to start, showing a CrashLoopBackOff error. This Severity 1 incident posed a direct threat to business resiliency, affecting millions of users worldwide.
One key component of the customer’s production infrastructure is its ability to mirror traffic across sibling clusters. During maintenance, affected clusters are gracefully isolated from traffic. However, given strict security compliance and limited available production infrastructure, quickly deploying fallback clusters is not always feasible.
With one sibling cluster inoperable, even without active traffic, the system’s resiliency was weakened. This raised urgent concerns: Would the same issue occur in other clusters? Would errors surface once traffic resumed? A swift and decisive resolution was needed.
The Investigation
The issue first presented as Istiod taking an unusually long time to initialize, generating the following event:
istio-system 28m Warning Unhealthy pod/istiod-74b69db89d-226lp Readiness probe failed: HTTP probe failed with status code: 503
Additionally, no injector webhooks were available, causing pods to fail during initialization:
Error creating: Internal error occurred: failed calling webhook "sidecar-injector.istio.io": failed to call webhook: Post "https://istiod.istio-system.svc:443/inject?timeout=10s": no endpoints available for service "istiod"
While Istiod’s startup delays were temporary, they signaled a deeper issue. Once it was up and sidecars were injected, the pods still failed to reach a ready state. The sidecar process was aborting before becoming operational.
Tetrate offers an enterprise-ready, 100% upstream distribution of Istio, Tetrate Istio Subscription (TIS). TIS is the easiest way to get started with Istio for production use cases. TIS+, a hosted Day 2 operations solution for Istio, adds a global service registry, unified Istio metrics dashboard, and self-service troubleshooting.
Get access now ›
Tetrate Support Steps In
With the customer’s team focused on upgrading other clusters, they had limited internal bandwidth to diagnose the issue. A Severity 1 ticket was raised with Tetrate Support. Within six minutes, a Tetrate engineer was reviewing logs, and within 20 minutes, two additional experts had joined a live debugging session.
The following logs from the sidecars revealed key details:
info Starting proxy agent info starting info Status server has successfully terminated info Agent draining Proxy warn ca ca request failed, starting attempt 1 in 99.993416ms warn Aborting proxy
However, all Certificate Signing Request (CSR) events appeared successful:
default 21m Normal Signed certificatesigningrequest/csr-zzwwvn2xw7lcq2fb95 The CSR has been signed
In this setup, Istio offloaded the CA role to an external service (e.g., Cert-Manager, AWS ACM, or HashiCorp Vault). Each istio-agent in the sidecar requested a signed certificate, which Istiod passed to the external CA. The CA then signed and returned the certificate for secure TLS communication.
Since CA requests were successfully processed, but sidecars were still failing, the team suspected an issue with the custom CA. Further investigation confirmed that the CA was overwhelmed, struggling to handle the sudden influx of certificate requests as hundreds of new pods spun up simultaneously. The default Istiod CA request timeout of 10 seconds exacerbated the issue, leading to a “thundering herd” effect.
The Solution
To mitigate the load and stabilize the cluster, Tetrate engineers implemented a three-step approach:
- Scaled Down Istiod: The number of Istiod replicas was temporarily reduced from 40 to 1 to slow down internal Istio processes, easing the CA’s workload.
- Cleared Failed CSRs: Many CSRs were stuck in the Approved, Failed state. These were identified and deleted.
- Controlled Pod Restarts: Instead of allowing all pods to restart simultaneously, they were gradually restarted in a controlled manner to prevent another surge in CA requests.
Results
The issue was resolved in less than an hour, restoring the production cluster to full resiliency. The root cause was not a direct Istio malfunction but rather a dependency failure—an overloaded external CA.
By quickly identifying and addressing the issue, Tetrate Support helped the customer avoid a major disruption to their production environment, ensuring continued reliability for millions of users. This case underscores the importance of understanding the entire service mesh ecosystem, not just Istio itself, when diagnosing complex infrastructure failures.
###
If you’re new to service mesh, Tetrate has a bunch of free online courses available at Tetrate Academy that will quickly get you up to speed with Istio and Envoy.
Are you using Kubernetes? Tetrate Enterprise Gateway for Envoy (TEG) is the easiest way to get started with Envoy Gateway for production use cases. Get the power of Envoy Proxy in an easy-to-consume package managed by the Kubernetes Gateway API. Learn more ›
Getting started with Istio? If you’re looking for the surest way to get to production with Istio, check out Tetrate Istio Subscription. Tetrate Istio Subscription has everything you need to run Istio and Envoy in highly regulated and mission-critical production environments. It includes Tetrate Istio Distro, a 100% upstream distribution of Istio and Envoy that is FIPS-verified and FedRAMP ready. For teams requiring open source Istio and Envoy without proprietary vendor dependencies, Tetrate offers the ONLY 100% upstream Istio enterprise support offering.
Need global visibility for Istio? TIS+ is a hosted Day 2 operations solution for Istio designed to simplify and enhance the workflows of platform and support teams. Key features include: a global service dashboard, multi-cluster visibility, service topology visualization, and workspace-based access control.
Get a Demo