Announcing Built On Envoy: Making Envoy Extensions Accessible to Everyone

Learn more

Tetrate solves a critical production issue on a cluster serving millions of users

A U.S. federal customer encountered a critical issue while updating Amazon EKS nodes AMIs across more than 30 clusters running Istio

Tetrate%20solves%20a%20critical%20production%20issue%20on%20a%20cluster%20serving%20millions%20of%20users

Critical Production Issue: Resolving CrashLoopBackOff in Istio on EKS

The Problem

A U.S. federal customer encountered a critical issue while updating Amazon EKS nodes’ AMIs across more than 30 clusters running Istio. In one of these clusters, sidecar-injected workloads failed to start, showing a CrashLoopBackOff error. This Severity 1 incident posed a direct threat to business resiliency, affecting millions of users worldwide.

One key component of the customer’s production infrastructure is its ability to mirror traffic across sibling clusters. During maintenance, affected clusters are gracefully isolated from traffic. However, given strict security compliance and limited available production infrastructure, quickly deploying fallback clusters is not always feasible.

With one sibling cluster inoperable, even without active traffic, the system’s resiliency was weakened. This raised urgent concerns: Would the same issue occur in other clusters? Would errors surface once traffic resumed? A swift and decisive resolution was needed.

The Investigation

The issue first presented as Istiod taking an unusually long time to initialize, generating the following event:

istio-system    28m    Warning   Unhealthy   pod/istiod-74b69db89d-226lp    Readiness probe failed: HTTP probe failed with status code: 503

Additionally, no injector webhooks were available, causing pods to fail during initialization:

Error creating: Internal error occurred: failed calling webhook "sidecar-injector.istio.io": failed to call webhook: Post "https://istiod.istio-system.svc:443/inject?timeout=10s": no endpoints available for service "istiod"

While Istiod’s startup delays were temporary, they signaled a deeper issue. Once it was up and sidecars were injected, the pods still failed to reach a ready state. The sidecar process was aborting before becoming operational.

Tetrate offers an enterprise-ready, 100% upstream distribution of Istio, Tetrate Istio Subscription (TIS). TIS is the easiest way to get started with Istio for production use cases. TIS+, a hosted Day 2 operations solution for Istio, adds a global service registry, unified Istio metrics dashboard, and self-service troubleshooting.

Learn more

Tetrate Support Steps In

With the customer’s team focused on upgrading other clusters, they had limited internal bandwidth to diagnose the issue. A Severity 1 ticket was raised with Tetrate Support. Within six minutes, a Tetrate engineer was reviewing logs, and within 20 minutes, two additional experts had joined a live debugging session.

The following logs from the sidecars revealed key details:

info    Starting proxy agent
info    starting
info    Status server has successfully terminated
info    Agent draining Proxy
warn    ca    ca request failed, starting attempt 1 in 99.993416ms
warn    Aborting proxy

However, all Certificate Signing Request (CSR) events appeared successful:

default       21m      Normal    Signed    certificatesigningrequest/csr-zzwwvn2xw7lcq2fb95   The CSR has been signed

In this setup, Istio offloaded the CA role to an external service (e.g., Cert-Manager, AWS ACM, or HashiCorp Vault). Each istio-agent in the sidecar requested a signed certificate, which Istiod passed to the external CA. The CA then signed and returned the certificate for secure TLS communication.

Since CA requests were successfully processed, but sidecars were still failing, the team suspected an issue with the custom CA. Further investigation confirmed that the CA was overwhelmed, struggling to handle the sudden influx of certificate requests as hundreds of new pods spun up simultaneously. The default Istiod CA request timeout of 10 seconds exacerbated the issue, leading to a “thundering herd” effect.

The Solution

To mitigate the load and stabilize the cluster, Tetrate engineers implemented a three-step approach:

  1. Scaled Down Istiod: The number of Istiod replicas was temporarily reduced from 40 to 1 to slow down internal Istio processes, easing the CA’s workload.
  2. Cleared Failed CSRs: Many CSRs were stuck in the Approved, Failed state. These were identified and deleted.
  3. Controlled Pod Restarts: Instead of allowing all pods to restart simultaneously, they were gradually restarted in a controlled manner to prevent another surge in CA requests.

Results

The issue was resolved in less than an hour, restoring the production cluster to full resiliency. The root cause was not a direct Istio malfunction but rather a dependency failure—an overloaded external CA.

By quickly identifying and addressing the issue, Tetrate Support helped the customer avoid a major disruption to their production environment, ensuring continued reliability for millions of users. This case underscores the importance of understanding the entire service mesh ecosystem, not just Istio itself, when diagnosing complex infrastructure failures.

Product background Product background for tablets
Building AI agents

Agent Router Enterprise provides managed LLM & MCP Gateways plus AI Guardrails in your dedicated instance. Graduate agents from prototype to production with consistent model access, governed tool use, and runtime supervision — built on Envoy AI Gateway by its creators.

  • LLM Gateway – Unified model catalog with automatic fallback across providers
  • MCP Gateway – Curated tool access with per-profile authentication and filtering
  • AI Guardrails – Enforce policies, prevent data loss, and supervise agent behavior
  • Learn more
    Replacing NGINX Ingress

    Tetrate Enterprise Gateway for Envoy (TEG) is the enterprise-ready replacement for NGINX Ingress Controller. Built on Envoy Gateway and the Kubernetes Gateway API, TEG delivers advanced traffic management, security, and observability without vendor lock-in.

  • 100% upstream Envoy Gateway – CVE-protected builds
  • Kubernetes Gateway API native – Modern, portable, and extensible ingress
  • Enterprise-grade support – 24/7 production support from Envoy experts
  • Learn more
    Decorative CTA background pattern background background
    Tetrate logo in the CTA section Tetrate logo in the CTA section for mobile

    Ready to enhance your
    network

    with more
    intelligence?