Organize Your Team to Meet SLOs and SLAs in the Cloud Native Era

As an engineering leader, your role in organizing your team to meet SLOs and SLAs in the era of cloud native architecture is crucial to your company’s

Erica Hughberg

August 8, 2024

Organize%20Your%20Team%20to%20Meet%20SLOs%20and%20SLAs%20in%20the%20Cloud%20Native%20Era

As an engineering leader, your role in organizing your team to meet SLOs and SLAs in the era of cloud native architecture is crucial to your company’s success. This article provides insights and practical guidance on how to organize your team to get clarity on the current state of your teams and systems to make the right investments for the future. We’ll also look at the importance of consistency in edge metrics—like those delivered by Envoy Gateway—to the success of a monthly service review.

Whether you are building B2B SaaS solutions, internal software, or B2C software, your work directly impacts the performance of our systems, which is critical to ensuring our business’s success.

As the providers of a SaaS solution, you understand the weight of ensuring that our software is meeting our clients’ service level agreements (SLAs).

If you provide systems for critical business functions within your organization, you know that how well you meet your service level objectives (SLOs) directly impacts the success of the business.

If you have a B2C solution, you know how important it is to ensure that every user has a good experience with your software.

The Performance Clarity Challenges of Cloud Computing

While the era of cloud architecture has improved scalability and resiliency, tracking service quality levels has become increasingly complex. Our distributed multi-component systems have posed more challenges than clarity regarding how a system is doing.

As the number of components in our systems grows, it becomes increasingly difficult to get a clear picture of what is impacting your SLOs. It also gets harder to measure, pinpoint problems, and hold people accountable for taking action to improve performance.

When examining SLAs and SLOs, we need to consider the quality delivered to clients and users. By stepping back from all the small components that make up your system, we can gain clarity. We focus on the time between a request arriving at the boundary of your system and the response.

Set Your Team Up for Success

Establish goals, accountability, and transparency in your organization to clarify whether and where investments are necessary to meet SLAs and SLOs.

Establish Goals

Each team should be responsible for memorializing their SLOs. Not all services are equally critical, so SLOs vary slightly between services.

Establish Accountability

Ask all your teams to appoint a person responsible for their team’s Service performance.

Organize Monthly Service Review

Measure

It is no surprise that the first step is to measure. If you’re unaware of the status of service quality, you won’t be able to make decisions rooted in reality.

It can be overwhelming to pick what to measure; however, start simple and measure the most critical items. Three areas that are directly linked to the quality of the service you provide to users:

Outages
- How many complete outages did we have, for how long, and which components?
Errors
- Application error rate trends – what is the % of errors in total traffic?
Performance
- API response performance trends – is it getting slower or faster?
- Static content delivery performance – is it getting slower or faster?

Attribute

For accountability clarity, ensure all metrics are appropriately attributed to the application and team.

Attribute performance metrics by:

Owning team
Application
Environment

Teams might argue that they depend on other underlying systems that aren’t able to meet their applications’ performance needs. However, accountability is essential here. As a leader, you must empower the person owning a product’s service quality to have the necessary conversations with the owners of the systems they depend on. Remember, often, the solution sits on both sides of the fence.

Report

Organize the findings into a digestible monthly report, broken down by team and applications.

Ask all service performance owners to add commentary to the report to shed light on the outages, errors, and performance changes.

Ask all service performance owners to add any action points taken in the past month and any planned actions anticipated to impact performance (positive or negative).

Review

Set up a monthly meeting to review performance with all service performance owners, underlying systems and infrastructure owners, and internal business stakeholders. Make sure your operations team is involved in this meeting.

Act

Take action to address issues in the system’s performance quality. Address performance issues with engineering actions and measure the impact.

Pick Edge Components That Make It Easy to Measure Performance

Consistency in edge metrics will make measuring and reporting much easier. It is important to pick a solution that provides the granularity to measure and attribute appropriately.

Envoy Proxy, a mature reverse proxy originally developed at Lyft, allows you to capture rich metrics from requests. It enables you to measure performance, collect metrics data to attribute performance, and use it to report on service performance.

However, without a scalable control plane, Envoy Proxy can be difficult to manage and configure. The easiest way to use Envoy Proxy to handle incoming requests to your system is to run it as a Kubernetes Gateway managed by Envoy Gateway.

Even though Envoy Gateway enables you to use Envoy Proxy as a Kubernetes Gateway, you can route traffic outside and inside Kubernetes, allowing you to have a consistent technology component regardless of whether you have Kubernetes-hosted services.

Erica Hughberg

August 8, 2024

New to service mesh?

Get up to speed with free online courses at Tetrate Academy and quickly learn Istio and Envoy.

Learn more

Using Kubernetes?

Tetrate Enterprise Gateway for Envoy (TEG) is the easiest way to get started with Envoy Gateway for production use cases. Get the power of Envoy Proxy in an easy-to-consume package managed via the Kubernetes Gateway API.

Learn more

Getting started with Istio?

Tetrate Istio Subscription (TIS) is the most reliable path to production, providing a complete solution for running Istio and Envoy securely in mission-critical environments. It includes:

Tetrate Istio Distro – A 100% upstream distribution of Istio and Envoy.

Compliance-ready – FIPS-verified and FedRAMP-ready for high-security needs.

Enterprise-grade support – The ONLY enterprise support for 100% upstream Istio, ensuring no vendor lock-in.

Learn more

Need global visibility for Istio?

TIS+ is a hosted Day 2 operations solution for Istio designed to streamline workflows for platform and support teams. It offers:

A global service dashboard

Multi-cluster visibility

Service topology visualization

Workspace-based access control

Learn more

Announcing Tetrate Agent Operations Director for GenAI Runtime Visibility and Governance

Organize Your Team to Meet SLOs and SLAs in the Cloud Native Era

The Performance Clarity Challenges of Cloud Computing

Set Your Team Up for Success

Establish Goals

Establish Accountability