This is the third in a series of service mesh best practices articles excerpted from Tetrate’s forthcoming book, Istio in Production, by Tetrate founding engineer Zack Butcher.
Istio is like a set of Legos: it has many capabilities that can be assembled just about any way you want. The structure that emerges is based on how you assemble the parts. In the previous installment of this blog series, we described an opinionated runtime topology to build a robust, resilient and reliable infrastructure. In this article, we’ll describe an opinionated set of mesh configurations to help achieve robustness, resiliency, reliability and security at runtime.
Istio supports global default configuration in what it calls the root namespace—by default, istio-system
. Configuration published in the root namespace applies to all services by default, but any configuration published in a local namespace will override it. As a result, some configurations should be published in the root namespace and not allowed to be published anywhere else (like the PeerAuthentication policy for enforcing encryption in transit). Other configuration should be authored in each service’s own namespace (like the VirtualService controlling resiliency settings for it).
The most successful mesh adoptions we see hide the mesh itself behind another interface: something like Helm templates, Terraform, or more advanced solutions like Tetrate Service Bridge (TSB). The core idea is to expose only the small subset of mesh functions that application developers should configure, ideally in a language they understand (e.g., TSB can be configured with an annotated OpenAPI spec). To begin with, we typically expose only traffic settings and authorization to application developers. Authentication and telemetry are controlled centrally by their respective teams, or the platform team on their behalf. Many of these best practices—and others—are described in the NIST SP800-204 series, especially SP 800-204A and SP 800-204B. The Istio project site also has a set of best practices, which are worth bookmarking as well.
Service Mesh Naming Conventions
Recommendation: Develop and maintain a consistent naming scheme for Istio resources, preferably based on the service or host they configure.
Recommendation: Maintain consistent names for teams across clusters. A namespace should be owned by a single team.
Istio resources should be named based on the service or host they configure: a ServiceEntry
adding api.example.com
to the mesh should be named external-api-example-com
; the DestinationRule
, VirtualService
, PeerAuthentication
and Authorization
policies for the service should all have the same name, too. An internal service Payments
in the PCI
namespace (hostname payments.pci in application code) should be called payments-pci
and all of its mesh configuration names should match as well. These naming schemes aren’t hard and fast rules, but you should establish and stick to a consistent convention within your organization.
Those resources should all be published in the namespace of the service they configure, or in the istio-system namespace for mesh-wide configuration. External services are usually published into istio-system and are managed by a central team (the platform or security teams).
We recommend consistent names for teams across clusters: a namespace should be owned by a single team regardless of cluster tenancy model (see next section).
Service Mesh Global Settings
Configuration visibility. Istio has an idea of configuration visibility: configuration can apply to the entire cluster by default, or it can apply only to the local namespace, or even only to individual services (with opt-in to be visible to the entire cluster, or just specific namespaces). You should default the exportTo
field to the local namespace (“.”) for performance and security. You should set this default for Services, VirtualServices, and DestinationRules at install time. See Istio’s global configuration to configure these defaults:
apiVersion: install.istio.io/v1alpha1
kind: IstioOperator
metadata:
name: controlplane
namespace: istio-system
spec:
# profile: default
# ...
meshConfig:
defaultServiceExportTo:
- "."# only the namespace the resource is published in
defaultVirtualServiceExportTo:
- "."
# equivalent, just different YAML syntax
defaultDestinationRuleExportTo: ["."]
Listing 1: Example default global settings.
Sidecar resource configuration for each namespace. Istio’s Sidecar
(networking.istio.io/v1beta1
API) resource controls which configuration Istio sends to sidecars in each namespace. For the best performance and lowest overhead, you should manage a Sidecar
configuration for each namespace and curate the egress section to include only the hosts the services must communicate with. This will cause Istio to send less configuration to Envoy instances in that namespace, reducing their memory and CPU consumption. Combined with a registry only outbound traffic policy (see next bullet), the Sidecar
resource can also help limit the surface area an attacker can pivot to via Envoy because a host not in the egress section of the Sidecar
will be “outbound traffic” to that Envoy instance. This by itself is not a sufficient security policy alone (see the Security section below), but adds an additional layer of defense the attacker must traverse.
Author explicit outbound (egress) traffic policy. Istio offers options for configuring how it handles a service in the mesh attempting to communicate with an endpoint unknown to Istio: the Outbound Traffic Policy. Istio can either allow all traffic, or limit traffic to only those services known to the mesh. You should configure Istio at installation to allow connection only to services in the registry. Further, you should model all external services you need to communicate with as ServiceEntries
in the mesh (with, for example, DNS resolution for SaaS services, and so on), using DestinationRules
to configure TLS for communicating with them. These external services should be managed centrally by the security team, or the platform team on their behalf.
Runtime Traffic Management Configuration
Use consistent, global names for services and use Istio to map them to local instances. You should use consistent, global names for accessing services. You can use Istio to map those global names to local instances. For example, payments.tetrate.internal could be consumed by all internal applications and Istio can be used to map that name to an instance of a service like “payments.default.svc.cluster.local
in the us-east-2
Kubernetes cluster”. This global naming scheme allows developers to think about all services like a SaaS, without needing to think through specifics of the runtime topology and make it easy to do things like failover, canaries and cross-cluster routing down the line as your mesh usage matures or organizational needs evolve.
Define coarse default resiliency settings in the root configuration. You should define coarse timeout, retry, circuit breaking and outlier detection settings for all services in the mesh. You can achieve this using a VirtualService
in the root configuration namespace. Individual teams should specify their own in their local namespace to override the defaults.
Offer simplified “low/medium/high” resiliency settings to app teams. In systems that hide the mesh’s underlying API behind a higher level interface, it’s valuable to offer a simplified “low/medium/high” knob for application developers that configure default circuit breaking and outlier detection settings, as these have quite a lot of fields with nuanced meaning that are easy to misconfigure, resulting in poor performance for that application.
Runtime Security Configuration
The following security recommendations are drawn from our work establishing the U.S. security standards for microservices applications published by the National Institute of Standards and Technology (NIST) in the SP 800-204 series. You can read all of NIST’s security recommendations for microservices applications in our comprehensive guide.
Minimum controls. Zero trust at runtime requires at minimum the following five controls:
- Encrypt everything in transit: provide message authenticity and eavesdropping protection (SP 800-204, §MS-SS-4).
- Authenticate service-to-service communication: every application should authenticate the identity of the applications it communicates with (SP 800-204A, §SM-DR16; SP 800-204B, §APE-SR-3).
- Authorize service-to-service access: every application should authorize the applications it communicates with using their runtime identity (SP 800-204B, §SAUZ-SR-1).
- Authenticate end-user identity: every request must be authenticated at each hop in your service call graph (SP 800-204B, §EAUN-SR-1, §EUAZ-SR-3).
- Authorize end-user to resource access: every access to every resource should be authorized, not just once at the front door (SP 800-204B, §EAUZ-SR-3).
Istio provides encryption in transit (we discuss enabling this globally above), as well as authenticatable service identity (SPIFFE) and service-to-service access control (Istio AuthorizationPolicy
). Further, it can be configured to authenticate some forms of end-user identity on behalf of applications (JWTs, OIDC tokens), and finally Istio supports pluggable authorization systems (Envoy’s ext_authz) to enforce end-user to resource access.
Install a restrictive default authorization policy. In line with Istio best practices, you should install a default authorization policy that allows no traffic, publishing AuthorizationPolicy
objects for each service to manage what they’re allowed to communicate with (SP 800-204B, SAUZ-SR-1). Two authorization policies that help accomplish that:
apiVersion: security.istio.io/v1beta1 kind: AuthorizationPolicy metadata: name: deny-all-audit namespace: istio-system spec: action: AUDIT | apiVersion: security.istio.io/v1beta1 kind: AuthorizationPolicy metadata: name: deny-all namespace: istio-system spec: {} # or action: ALLOW |
Listing 2: An Istio AuthorizationPolicy that would deny all traffic, but instead audit logs it. You might run such a policy for a few weeks to develop an understanding of the policies you’ll need before you enable enforcement. | Listing 3: An Istio AuthorizationPolicy that denies all traffic. Alternatively, you can create a policy that ALLOW s but with an empty rule set, which is the same as an empty body. |
Require mTLS for service-to-service communication by default. Encryption in transit should be set to strict (i.e., mTLS required to communicate with services) by configuring a PeerAuthentication
resource in the root namespace managed by the security or platform team. Services outside the mesh calling applications in the mesh should communicate via the application ingress gateway, which can present simple TLS (or even clear text) to the external service since it is unlikely to have certificates to perform mTLS with the mesh. Services inside the mesh calling out should be configured to use simple TLS or clear text with a DestinationRule
for the external service (NIST SP 800-204A, §SM-DR8).
TLS configuration defaults. Istio ships out of the box with a good TLS setup (TLS minimum version 1.2, with a limited set of cipher suites), but you may need to adjust it for your environment (e.g., to comply with FIPS 140-3 in a FedRAMP environment).
- Envoy supports configuring a minimum TLS version, and the set of supported cipher suites, per service by configuring gateways.
- We recommend enforcing TLS 1.3 as the minimum version if possible (which it is if you’re doing only mTLS Envoy-to-Envoy) and using gateways for external traffic that requires older or less secure TLS configurations.
Assign a unique runtime identity to each service to facilitate expressive, fine-grained authorization policy and limit exposure to attacks. Assign a unique runtime identity to each service you’re deploying. In Kubernetes, do not use the default Kubernetes service account in each namespace, but assign a unique service account for each service in each namespace. Authorization policy can only be managed easily at the granularity of identity. When multiple runtime components share the same identity it is very hard to manage an access control policy that expresses your intended access without allowing too-broad access for some components using the shared identity. This results in a larger surface area exposed to attackers who might compromise one component of the system (NIST SP 800-204A §SM-DR11, §SM-DR18).
Restrict service-to-service communication to the local namespace. By default, service-to-service communication should be restricted to the local namespace. Unfortunately, this can’t be written as a single AuthorizationPolicy in the root config namespace. Instead, a default AuthorizationPolicy allowing access only in the local namespace can be templatized as a default, and application teams should be allowed to write their own, more specialized (restricted) policies (NIST SP800-204B, §SAUZ-SR-1).
Parting Thoughts and What’s Next
We hope these best practices gained from our years of experience helping customers build a successful service mesh practice will help facilitate your deployments. If you haven’t yet, take a look at the other posts in the service mesh best practices series:
- Part 1: How Service Mesh Layers Microservices Security with Traditional Security
- Part 2: Service Mesh Deployment Best Practices for Security and High Availability
For a comprehensive overview of the NIST standards for microservices security, download our free guide.
Next Up: Service Mesh Best Practices for Multi-Tenancy
In the next post in our series on service mesh best practices, we’ll discuss common tenancy decision points we see customers grappling with and focus on how the mesh helps to facilitate those decisions. The topics we’ll cover include Kubernetes cluster ownership, namespace ownership, configuration ownership, and how to use service mesh application gateways to mitigate shared-fate outages.
###
If you’re new to service mesh and Kubernetes security, we have a bunch of free online courses available at Tetrate Academy that will quickly get you up to speed with Istio and Envoy.
If you’re looking for a fast way to get to production with Istio, check out Tetrate Istio Distribution (TID), Tetrate’s hardened, fully upstream Istio distribution, with FIPS-verified builds and support available. It’s a great way to get started with Istio knowing you have a trusted distribution to begin with, an expert team supporting you and also have the option to get to FIPS compliance quickly if you need to.As you add more apps to the mesh, you’ll need a unified way to manage those deployments and to coordinate the mandates of the different teams involved. That’s where Tetrate Service Bridge comes in. Learn more about how Tetrate Service Bridge makes service mesh more secure, manageable and resilient here, or contact us for a quick demo.