Istio Gateway Upgrade Challenges – How We Solved TLS Issues and Ensured Seamless Service Delivery
Upgrading software infrastructure is critical for leveraging new features, maintaining security, and ensuring optimal performance.

Business Challenge
Upgrading software infrastructure is critical for leveraging new features, maintaining security, and ensuring optimal performance. However, such upgrades often present unexpected challenges that can disrupt business operations. A global enterprise aiming to enhance its service mesh capabilities faced severe issues while upgrading its Istio Gateway from version 1.20.6 to 1.22.2. Addressing this challenge was vital to maintaining service reliability and avoiding downtime for critical applications.
Technical Problem
The customer encountered significant technical hurdles while upgrading their Istio Gateway. Despite updating the Gateway deployment to version 1.22.2, the Gateway pods failed to become ready, rendering them non-functional. Logs revealed a CERTIFICATE_VERIFY_FAILED
error during mutual TLS (mTLS) communication with upstream services. Additionally, the Envoy configuration was not being applied correctly, leading to incomplete listener and endpoint setups. These issues severely impacted their ability to deploy critical workloads, delaying their production timeline and risking SLA violations for internal teams. The complexity of the upgrade and lack of visibility into root causes compounded the challenge, necessitating immediate resolution.
Tetrate offers an enterprise-ready, 100% upstream distribution of Istio, Tetrate Istio Subscription (TIS). TIS is the easiest way to get started with Istio for production use cases. TIS+, a hosted Day 2 operations solution for Istio, adds a global service registry, unified Istio metrics dashboard, and self-service troubleshooting.
Resolution
To address the issue, our team followed a structured approach:
1. Initial Diagnosis
We began by gathering detailed logs from both Istiod (control plane) and the Gateway pods. Debug logs from Istiod indicated that it was sending certificates to the Gateway, but connections were subsequently being terminated. Using Istio’s diagnostic tools (istioctl bug-report and istioctl proxy-status), we confirmed that the Gateway’s Envoy configuration was incomplete, with no listeners or endpoints applied.
2. Connectivity Verification
To eliminate network connectivity as a potential root cause, we deployed diagnostic tools (e.g., netcat and curl) in the same namespace as the Gateway. These tests confirmed that the Gateway could reach the Istiod control plane without any issues.
3. Configuration Analysis
A detailed review of the Gateway’s deployment YAML and associated Helm charts revealed that no critical configuration parameters were missing. However, we identified a significant behavioral change introduced in Istio 1.21: the AUTO_SNI
feature was enabled by default. This feature automatically sets the Server Name Indication (SNI) for upstream TLS communication based on the request’s :authority header.
This change caused the Gateway to send an unintended SNI when initiating TLS communication with upstream services, leading to a certificate validation failure. Previously, the SNI was explicitly defined, ensuring compatibility with the customer’s certificate setup. With AUTO_SNI
enabled, the Gateway derived the SNI dynamically from incoming requests, which in some cases did not match the expected certificate Subject Alternative Name (SAN) of the upstream service.
For example, if a client requested qa-pod1-mdm.upgrade.infaqa.com, Istio’s AUTO_SNI
feature could cause the Gateway to send qa-pod1-mdm.upgrade.infaqa.com as the SNI instead of datastore-service.mdmnext-qa-upgr1.svc.cluster.local, which was the expected value. Since the upstream service’s certificate did not include this unintended SNI, the TLS handshake failed, resulting in a CERTIFICATE_VERIFY_FAILED
error. By disabling AUTO_SNI
, the Gateway reverted to the expected behavior, ensuring that the correct SNI was used for mutual TLS communication.
4. Hypothesis Testing
To validate the impact of AUTO_SNI
, we temporarily disabled the feature by setting the ENABLE_AUTO_SNI
environment variable to false in the Istiod deployment configuration:
$ curl -v --cacert ca.crt --cert rrqa.crt --key rrqa.key 'https://mks-demo-service-engineer-tms-mks.ingress.cluster.partner.com:8443'
This change prevented Istiod from automatically modifying the SNI during upstream communication.
5. Resolution Verification
After disabling AUTO_SNI
, the Gateway pods became ready, and Envoy received the correct configuration, including listeners and endpoints. Subsequent tests confirmed that mutual TLS communication with upstream services was successful, and the CERTIFICATE_VERIFY_FAILED
errors were resolved.
6. Final Testing
To ensure stability, we conducted additional tests by simulating production traffic across various scenarios. The Gateway’s behavior remained consistent, with no further errors observed. We also ensured that the solution was documented for future reference.
Business Impact
The issue was resolved within 48 hours, restoring full functionality to the customer’s service mesh. Disabling the AUTO_SNI
feature allowed the customer to proceed with their Istio Gateway upgrade, avoiding critical delays to their production rollout. This solution ensured uninterrupted service delivery and preserved their SLA commitments, avoiding potential financial and reputational impacts.