The Istio Service Mesh is packed with features that make hundreds of companies’ Kubernetes environments more secure, agile, and resilient. All those features are orchestrated from a scalable, stateless, and loosely coupled component called Istiod, a software component at the heart of Istio constantly receiving updates from K8s API, transmitting configs and updates to the sidecars, as well as being their CA authority. And it generates lots of metrics.
While istiod works perfectly out of the box, it may not continue to do so without proper maintenance; and yes, there are some cases where it may not operate at peak performance if maintenance is not carried out, especially once the mesh extends across teams in an organization and the burden of responsibility upon its shoulders bears heavier.
This is why service mesh operators must watch closely for the key metrics Istiod generates, as they will help to prevent issues and to diagnose whether a given problem is or is not related to the mesh. Because, as an extra networking layer, Istio is in every hop of your data path and you need the tools and information to sort out the question: “Is this issue I’m experiencing related to the service mesh?”
Besides the control Plane, Istio has a data plane composed of Envoy sidecars injected to the pods and the gateways. I will cover its key metrics in the follow-up article on key metrics in Istio data plane (watch this space).
Tetrate offers an enterprise-ready, 100% upstream distribution of Istio, Tetrate Istio Subscription (TIS). TIS is the easiest way to get started with Istio for production use cases. TIS+, a hosted Day 2 operations solution for Istio, adds a global service registry, unified Istio metrics dashboard, and self-service troubleshooting.
Get access now ›
What to Look For: Istio Golden Metrics
A quick way to identify the scope of a question like “what to observe in istiod?” is by running this command:
$ ~ kubectl exec -it $(kubectl get po -l app=istiod -oname) -- curl localhost:15014/metrics | grep -P '^(?!#).*?' | awk -F'{' '{print $1}' | sort | uniq | wc -l
87
That’s more than 87 unique metrics—and that doesn’t include saturation and other specific metrics not counted because they are null in my cluster.A suggestion is to look out for metrics that account for latency, traffic, errors, and saturation. That combo is called the Golden Metrics.
Istio Latency Metrics
istiod’s latency relates to how much time it takes for its messages to be transmitted and digested by the sidecars. That wire transmits all of your Istio Custom Resource Definitions (CRDs) configs, new pods’ IPs, deletion of kubernetes services, and the start up config batch for newly created sidecars, amongst other data.
A must-follow metric is pilot_proxy_convergence_time_bucket
, as it indicates how much time it takes for istiod configs to be live in the sidecar. This metric has helped our customers spot performance issues in their control plane and adjust before they are even impacting any app.
A second recommendation is pilot_xds_push_time_bucket
, specially at the EDS (endpoint discovery service) dimension as this config stream is in charge of transmitting the changes in pod’s IPs. Just imagine if a pod is deleted and it takes 5 seconds for all the other running sidecars to update… that will have an impact on requests being sent to a non listening IP:port. In some cases, transient timeouts and increased tail latency can be addressed by understanding this metric.
Istio Traffic Metrics
Traffic is all about how much activity is going through istiod. It is a good indicator for how much service mesh users are changing configs like adding new routes, but, crucially, how intense the changes are in the cluster, as it is correlated with pods and kubernetes services changes. This is valuable as it provides a single place to measure the footprint of certain cluster changes—both at service and Istio config levels, to analyze it in context with other relevant data, and to plan for the future.
A key metric is pilot_xds_pushes, which includes by default a dimension, type
, mapped to CDS, LDS, RDS and EDS. The first three relate to Istio configurations like Virtual Services, Gateways or Destination Rules, whilst the last one follows closely the changes in pods IPs or Endpoints. This metric is key because it helps you understand what changes and their intensity is Istiod having to process at any time.
Istio Error Metrics
This is a serious category. Any error in istiod needs to be addressed with the utmost diligence, even though, given the loosely coupled architecture, the sidecars continue working normally if they lose connections with it and as long as there are no Pod or CRD changes.
Two metrics require attention here. First, sidecar_injection_failure_total
keeps track of an istiod function we haven’t yet discussed: a mutating webhook that changes pods’ specs at creation time to insert Istio machinery like sidecars and init-containers. Second, pilot_xds_write_timeout
indicates istiod to sidecar pushes were not able to be processed in time by one of the parties.
Istio Saturation Metrics
Last but not least, a likely scenario when proper monitoring is not in place revolves around istiod being resource starved. CPU and memory allocations, as well as horizontal pod autoscaling (HPA), should be closely watched and adjusted based on past patterns. For example, new app version rollouts, node upgrades that require big evictions, and peak traffic scenarios in the last years are watermarks to use as reference to periodically adjust resources and HPA.
In the past, one of our customers was experiencing disruption to traffic when rolling out their new app versions, but this had never happened before. It was a confusing and unexpected situation in a live production environment. Through the inspection of container_cpu_usage_seconds_total
we pin pointed something was not right with the Control Plane, as it was possibly throttled by high CPU usage. Digging a little bit further, a second issue arose: HPA desired replicas were in the max allowed threshold. A quick adjustment brought the cluster back to its regular operation.
istiod metrics like process_virtual_memory_bytes
need to be combined with kube-state-metrics to give you an intuitive reading of the current istiod memory saturation.
Parting Thoughts
Once you have the proper dashboards to observe the bespoken metrics and the team understands its importance, you will unlock increased diagnosis accuracy and speed, better resiliency, and happier end users.
###
If you’re new to service mesh, Tetrate has a bunch of free online courses available at Tetrate Academy that will quickly get you up to speed with Istio and Envoy.
Are you using Kubernetes? Tetrate Enterprise Gateway for Envoy (TEG) is the easiest way to get started with Envoy Gateway for production use cases. Get the power of Envoy Proxy in an easy-to-consume package managed by the Kubernetes Gateway API. Learn more ›
Getting started with Istio? If you’re looking for the surest way to get to production with Istio, check out Tetrate Istio Subscription. Tetrate Istio Subscription has everything you need to run Istio and Envoy in highly regulated and mission-critical production environments. It includes Tetrate Istio Distro, a 100% upstream distribution of Istio and Envoy that is FIPS-verified and FedRAMP ready. For teams requiring open source Istio and Envoy without proprietary vendor dependencies, Tetrate offers the ONLY 100% upstream Istio enterprise support offering.
Need global visibility for Istio? TIS+ is a hosted Day 2 operations solution for Istio designed to simplify and enhance the workflows of platform and support teams. Key features include: a global service dashboard, multi-cluster visibility, service topology visualization, and workspace-based access control.
Get a Demo