
Business Challenge
Upgrading software infrastructure is critical for leveraging new features, maintaining security, and ensuring optimal performance. However, such upgrades often present unexpected challenges that can disrupt business operations. A global enterprise aiming to enhance its service mesh capabilities faced severe issues while upgrading its Istio Gateway from version 1.20.6 to 1.22.2. Addressing this challenge was vital to maintaining service reliability and avoiding downtime for critical applications.
Technical Problem
The customer encountered significant technical hurdles while upgrading their Istio Gateway. Despite updating the Gateway deployment to version 1.22.2, the Gateway pods failed to become ready, rendering them non-functional. Logs revealed a CERTIFICATE_VERIFY_FAILED error during mutual TLS (mTLS) communication with upstream services. Additionally, the Envoy configuration was not being applied correctly, leading to incomplete listener and endpoint setups. These issues severely impacted their ability to deploy critical workloads, delaying their production timeline and risking SLA violations for internal teams. The complexity of the upgrade and lack of visibility into root causes compounded the challenge, necessitating immediate resolution.
Resolution
To address the issue, our team followed a structured approach:
1. Initial Diagnosis
We began by gathering detailed logs from both Istiod (control plane) and the Gateway pods. Debug logs from Istiod indicated that it was sending certificates to the Gateway, but connections were subsequently being terminated. Using Istio’s diagnostic tools (istioctl bug-report and istioctl proxy-status), we confirmed that the Gateway’s Envoy configuration was incomplete, with no listeners or endpoints applied.
2. Connectivity Verification
To eliminate network connectivity as a potential root cause, we deployed diagnostic tools (e.g., netcat and curl) in the same namespace as the Gateway. These tests confirmed that the Gateway could reach the Istiod control plane without any issues.
3. Configuration Analysis
A detailed review of the Gateway’s deployment YAML and associated Helm charts revealed that no critical configuration parameters were missing. However, we identified a significant behavioral change introduced in Istio 1.21: the AUTO_SNI feature was enabled by default. This feature automatically sets the Server Name Indication (SNI) for upstream TLS communication based on the request’s :authority header.
This change caused the Gateway to send an unintended SNI when initiating TLS communication with upstream services, leading to a certificate validation failure. Previously, the SNI was explicitly defined, ensuring compatibility with the customer’s certificate setup. With AUTO_SNI enabled, the Gateway derived the SNI dynamically from incoming requests, which in some cases did not match the expected certificate Subject Alternative Name (SAN) of the upstream service.
For example, if a client requested qa-pod1-mdm.upgrade.infaqa.com, Istio’s AUTO_SNI feature could cause the Gateway to send qa-pod1-mdm.upgrade.infaqa.com as the SNI instead of datastore-service.mdmnext-qa-upgr1.svc.cluster.local, which was the expected value. Since the upstream service’s certificate did not include this unintended SNI, the TLS handshake failed, resulting in a CERTIFICATE_VERIFY_FAILED error.By disabling AUTO_SNI, the Gateway reverted to the expected behavior, ensuring that the correct SNI was used for mutual TLS communication.
4. Hypothesis Testing
To validate the impact of AUTO_SNI, we temporarily disabled the feature by setting the ENABLE_AUTO_SNI environment variable to false in the Istiod deployment configuration:
$ curl -v --cacert ca.crt --cert rrqa.crt --key rrqa.key 'https://mks-demo-service-engineer-tms-mks.ingress.cluster.partner.com:8443'
This change prevented Istiod from automatically modifying the SNI during upstream communication.
5. Resolution Verification
After disabling AUTO_SNI, the Gateway pods became ready, and Envoy received the correct configuration, including listeners and endpoints. Subsequent tests confirmed that mutual TLS communication with upstream services was successful, and the CERTIFICATE_VERIFY_FAILED errors were resolved.
6. Final Testing
To ensure stability, we conducted additional tests by simulating production traffic across various scenarios. The Gateway’s behavior remained consistent, with no further errors observed. We also ensured that the solution was documented for future reference.
Business Impact
The issue was resolved within 48 hours, restoring full functionality to the customer’s service mesh. Disabling the AUTO_SNI feature allowed the customer to proceed with their Istio Gateway upgrade, avoiding critical delays to their production rollout. This solution ensured uninterrupted service delivery and preserved their SLA commitments, avoiding potential financial and reputational impacts.
###
If you’re new to service mesh, Tetrate has a bunch of free online courses available at Tetrate Academy that will quickly get you up to speed with Istio and Envoy.
Are you using Kubernetes? Tetrate Enterprise Gateway for Envoy (TEG) is the easiest way to get started with Envoy Gateway for production use cases. Get the power of Envoy Proxy in an easy-to-consume package managed by the Kubernetes Gateway API. Learn more ›
Getting started with Istio? If you’re looking for the surest way to get to production with Istio, check out Tetrate Istio Subscription. Tetrate Istio Subscription has everything you need to run Istio and Envoy in highly regulated and mission-critical production environments. It includes Tetrate Istio Distro, a 100% upstream distribution of Istio and Envoy that is FIPS-verified and FedRAMP ready. For teams requiring open source Istio and Envoy without proprietary vendor dependencies, Tetrate offers the ONLY 100% upstream Istio enterprise support offering.
Need global visibility for Istio? TIS+ is a hosted Day 2 operations solution for Istio designed to simplify and enhance the workflows of platform and support teams. Key features include: a global service dashboard, multi-cluster visibility, service topology visualization, and workspace-based access control.
Get a Demo