Istio recently announced “ambient mesh”—an experimental, “sidecar-less” deployment model for Istio. We’ve written about sidecar vs. sidecar-less recently in the context of getting the most performance and resiliency out of the service mesh. In this article, we’ll present our take on ambient mesh in particular.
If you want to get started with a production-ready Istio distribution today, try Tetrate Istio Distro (TID). TID is a vetted, upstream distribution of Istio that is simple to install, manage, and upgrade with FIPS-certified builds available for FedRAMP environments. If you need a unified and consistent way to secure and manage services across a fleet of applications, check out Tetrate Service Bridge (TSB), our comprehensive edge-to-workload application connectivity platform built on Istio and Envoy.
What Is Ambient Mesh?
Ambient mesh is an experimental new deployment model recently introduced to Istio. It splits the duties currently performed by the Envoy sidecar into two separate components: a node-level component for encryption (called “ztunnel”) and an L7 Envoy instance deployed per service for all other processing (called “waypoint”). The ambient mesh model is an attempt to gain some efficiencies in potentially improved lifecycle and resource management—at least, that’s the motivation.
Why Should I Care about Ambient Mesh?
For the majority of service mesh users, the exact deployment model of the Istio data plane is a choice that you probably don’t need to think too hard about. The default is probably fine. For some service mesh users, specifically those who have large, horizontally-scaled footprints of a small number of services—where the waypoint architecture gains the most efficiency—the ambient mesh model will be useful as it matures into production-ready infrastructure software.
- The ambient mesh deployment model makes some trade-offs versus the sidecar model, especially regarding lifecycle management, resource utilization, troubleshooting, and security posture. One is not clearly better than the other.
- Ambient mesh is an experiment that won’t be production ready until 2023 at the earliest—i.e., don’t build on it yet. Today, performance is worse, it’s less featureful, and extensively used technology like CNI has undefined behavior. However, we expect this situation to improve rapidly as the implementation is hardened in the coming months.
- Much of the mesh functionality you care about—like per-request traffic management and security controls, distributed tracing, and application-level RED metrics—happens in L7. It’s not yet clear how broadly applicable the L4-only part of ambient mesh will be and to what extent breaking out these data plane duties will help drive mesh adoption.
The rest of this post is our take on the tradeoffs of the ambient model compared to Istio’s existing sidecar deployment model and which one might be right for you and when.
How Does L4 and L7 Processing Work in Istio?
Since ambient splits L4 and L7 processing, it’s important to understand exactly what mesh behavior happens in each layer:
|Service-to-service authentication||SPIFFE, via mTLS certs. Istio issues a short-lived X.509 certificate that encodes the pod’s service account identity.||N/A—service identity in Istio is based on TLS only.|
|Service-to-service authorization||Network-based authorization, plus identity-based policy, e.g.:
|Full policy, e.g.:
|End-user authentication||N/A—we can’t apply per-user settings.||Local authentication of JWTs, support for remote authentication via OAuth and OIDC flows.|
|End-user authorization||N/A—see above.||Service-to-service policies can be extended to require end-user credentials with specific scopes, issuers, principal, audiences, etc.—but it cannot be used for full user-to-resource access control. Full user-to-resource access should be implemented using external authorization.|
|Envoy’s External Authorization API (ext_authz)||Cannot perform any per-request policy; ext_authz API is only configurable for L7 traffic.||Enforce per-request policy with decisions from an external service, e.g. OPA.|
|Logging||Basic network information: network 5-tuple, bytes sent/received, etc. See Envoy docs.||Full request metadata logging, in addition to basic network information.|
|Tracing||Not today; possible eventually with HBONE.||Envoy participates in distributed tracing. See Istio overview on tracing.|
|Metrics||TCP only (bytes sent/received, number of packets, etc.).||L7 RED metrics: rate of requests, rate of errors, request duration (latency).|
|Load balancing||Connection level only. See TCP traffic shifting task.||Per request, enabling e.g. canary deployments, gRPC traffic, etc. See HTTP traffic shifting task.|
|Circuit breaking||TCP only.||HTTP settings in addition to TCP.|
|Outlier detection||On connection establishment/failure.||On request success/failure.|
|Rate limiting||Rate limit on L4 connection data only, on connection establishment, with global and local rate limiting options.||Rate limit on L7 request metadata, per request.|
|Timeouts||Connection establishment only (connection keep-alive is configured via circuit breaking settings).||Per request.|
|Retries||Retry connection establishment||Retry per request failure.|
|Fault Injection||N/A—fault injection cannot be configured on TCP connections.||Full application and connection-level faults (timeouts, delays, specific response codes).|
|Traffic Mirroring||N/A—HTTP only||Percentage-based mirroring of requests to multiple backends.|
It’s worth remembering that a proxy operating at L7 can do everything in the L4 and L7 columns, while a proxy operating at L4 can perform only the L4 column. With a clear understanding of what happens where—and the limitations of L4 vs L7—we can look at the tradeoffs that ambient mesh makes compared to the sidecar model.
Should I Use Ambient Mesh Today? (Not Yet.)
As of its announcement in September 2022, ambient mesh is an experimental proof of concept. By nearly every metric, it performs worse than sidecars, and it has quite a few limitations. Ambient mesh is not ready to be used in production environments (and for our customers—platform teams at large enterprises—application development and test environments count as production environments, too).
However, we expect that state to change relatively quickly as engineers across the community work on the deployment model. It will progress through feature phases like all other Istio features. Watch the Istio features list for ambient mesh to be promoted into an Alpha state some time in 2023. We anticipate it leaving Beta late 2023 or early 2024—and we would not recommend it for production use before then.
Ambient Mesh Assumptions about Service Mesh and Istio
Summarizing the ambient mesh announcement blog posts, the core assumptions motivating the architecture are:
- Assumption: Envoy’s L7 functionality is what makes it challenging to onboard new apps into the mesh.
- Assumption: Envoy’s L7 functionality is where CVEs are found (the vast majority of Envoy CVEs are in L7 code, not the L4 code that handles TLS and connections), therefore holding certs of multiple pods at the node level in a strictly L4 proxy is acceptable, whereas doing L7 at the node level is not.
- Assumption: Sidecars often result in over allocation of resources.
- Assumption: An extra network hop is cheaper than an Envoy doing L7 computations (the move from two L7 sidecars to one L7 waypoint adds a hop but removes an Envoy performing L7 processing).
- Assumption: Istio’s most valuable feature is encryption in transit, so it is valuable to optimize for making that use case easy.
How Ambient Mesh Assumptions Match Our Experience with Customers in the Field
Our experiences working closely with some of the largest enterprises in the world to enable service mesh adoption don’t quite prove out those motivating ideas:
- L7 functionality: some L7 features of the mesh can make adoption harder for applications, but in our experience more breakages in application onboarding occur due to changes in connection lifetimes or issues with double encryption. These problems will manifest similarly regardless of sidecar or node-level proxy, but are more challenging for application teams to troubleshoot in a node deployment model (where they usually lack permissions to inspect logs of privileged/node level components). For a deeper dive into node-level proxies vs sidecars, see our blog post mentioned above.
- L7 CVEs: looking them over, we see:
- Thirty-three are related to L7 processing, mainly parsing or HTTP handling.
- The remaining 12 are L4 or inherent to Envoy (connection handling, certificate handling, noisy neighbor DOS, buffer overruns, etc.).
- The average severity of the L7 CVEs is higher than the non-L7 CVEs.
An L4-only Envoy does offer a reduced attack surface compared to an L7 Envoy because there is less code (and fewer CVEs) to exploit. It remains to be seen if that attack surface is low enough to justify holding the identities of every pod on the node. The crux of the ambient mesh security model rests on how much we can trust the ztunnel component—which is the component the community intends to focus on evolving first. Overall, ambient’s security model is at best a step sideways compared to the sidecar model, but has more difficult boundaries to reason about when fitting it into your existing security model.
- Resource utilization: It’s true that sidecars can result in poor resource utilization if pod resource requests are not configured, and if techniques like configuring resource visibility or the Sidecar API Resource are not used. However, our experience with Istio deployments that tightly control resource visibility and limit configuration scope via Sidecar API Resources is that sidecar resource utilization is very low, and we can set much smaller resource requests per sidecar than Istio’s default profile. It is very challenging to maintain the Sidecar API Resources for this type of configuration by hand—which is why Tetrate Service Bridge generates it automatically based on higher-level access constructs.
We’re excited to see how resource utilization improvements manifest with the ambient deployment model—there’s serious potential to deploy a lot fewer Envoys for the same mesh behavior because a standalone waypoint Envoy can typically process significantly more traffic than an individual service instance (and its sidecar) sees.
- Extra network hop vs. sidecar: One of the most interesting possibilities that ambient’s deployment model offers is removing the extra L7 Envoy in the sidecar architecture. Because communication in the mesh is sidecar-to-sidecar, and both client and server apply L7 policy, we have to do L7 processing twice for each request. In ambient mode, that policy will be performed by the server’s Waypoint—so L7 processing happens just once per request. However, there’s still a ztunnel on either side of the connection doing L4 processing.
It remains to be seen if this trade—a network hop instead of an Envoy doing L7 processing—is worthwhile in general. Certainly in cloud provider networks in the same availability zone where latency is low and connections are reliable, it probably is worthwhile. However, many of our customers deploy the service mesh on prem and in a variety of physical sites that often don’t look like cloud provider networks.
- mTLS: Istio’s encryption in transit is without a doubt one of its most powerful features. It’s used (in FIPS validated form) for FIPS compliance, PCI compliance, and in a variety of other security-first environments. However, when we look at the capabilities of the mesh, it’s not common that encryption alone is the reason for adoption: typically it’s encryption in conjunction with L7 policy (including traffic control) and observability that motivate investment in the technology. Looking at the table above, it’s clear that those capabilities cannot be achieved with ztunnel alone—they require an L7 Envoy. In fact, most of the mesh usage we see today requires an L7 Envoy. We’re wildly enthusiastic about anything that makes mesh adoption easier, but we’re not yet confident that ambient mesh’s deployment model will deliver significantly on that promise.
Parting Thoughts on Ambient Mesh
Ambient mesh is an interesting take on the sidecar-less service mesh model. We’re excited to see how it develops, especially if it helps make mesh adoption easier. There are specific use cases where we expect this approach will yield benefits, but it’s still early days and the jury’s still out on whether or not the tradeoffs will be worth it. Either way, it will likely be some time before ambient mesh should be considered ready for production. Until then, as they say, watch this space.
To get started with service mesh today, Tetrate Istio Distro is the easiest way to install, manage, and upgrade Istio. It provides a vetted upstream distribution of Istio that’s tested and optimized for specific platforms by Tetrate plus a CLI that facilitates acquiring, installing, and configuring multiple Istio versions. Tetrate Istio Distro also offers FIPS certified Istio builds for FedRAMP environments.
For enterprises that need a unified and consistent way to secure and manage services and traditional workloads across complex, heterogeneous deployment environments, we offer Tetrate Service Bridge, our flagship edge-to-workload application connectivity platform built on Istio and Envoy.