Organize Your Team to Meet SLOs and SLAs in the Cloud Native Era

As an engineering leader, your role in organizing your team to meet SLOs and SLAs in the era of cloud native architecture is crucial to your company’s success. This article provides insights and practical guidance on how to organize your team to get clarity on the current state of your teams and systems to make the right investments for the future. We’ll also look at the importance of consistency in edge metrics—like those delivered by Envoy Gateway—to the success of a monthly service review.
Tetrate offers an enterprise-ready, 100% upstream distribution of Istio, Tetrate Istio Subscription (TIS). TIS is the easiest way to get started with Istio for production use cases. TIS+, a hosted Day 2 operations solution for Istio, adds a global service registry, unified Istio metrics dashboard, and self-service troubleshooting.
Whether you are building B2B SaaS solutions, internal software, or B2C software, your work directly impacts the performance of our systems, which is critical to ensuring our business’s success.
As the providers of a SaaS solution, you understand the weight of ensuring that our software is meeting our clients’ service level agreements (SLAs).
If you provide systems for critical business functions within your organization, you know that how well you meet your service level objectives (SLOs) directly impacts the success of the business.
If you have a B2C solution, you know how important it is to ensure that every user has a good experience with your software.
The Performance Clarity Challenges of Cloud Computing
While the era of cloud architecture has improved scalability and resiliency, tracking service quality levels has become increasingly complex. Our distributed multi-component systems have posed more challenges than clarity regarding how a system is doing.
As the number of components in our systems grows, it becomes increasingly difficult to get a clear picture of what is impacting your SLOs. It also gets harder to measure, pinpoint problems, and hold people accountable for taking action to improve performance.
When examining SLAs and SLOs, we need to consider the quality delivered to clients and users. By stepping back from all the small components that make up your system, we can gain clarity. We focus on the time between a request arriving at the boundary of your system and the response.
Set Your Team Up for Success
Establish goals, accountability, and transparency in your organization to clarify whether and where investments are necessary to meet SLAs and SLOs.
Establish Goals
Each team should be responsible for memorializing their SLOs. Not all services are equally critical, so SLOs vary slightly between services.
Establish Accountability
Ask all your teams to appoint a person responsible for their team’s Service performance.
Organize Monthly Service Review
Measure
It is no surprise that the first step is to measure. If you’re unaware of the status of service quality, you won’t be able to make decisions rooted in reality.
It can be overwhelming to pick what to measure; however, start simple and measure the most critical items. Three areas that are directly linked to the quality of the service you provide to users:
- Outages
- How many complete outages did we have, for how long, and which components?
- Errors
- Application error rate trends – what is the % of errors in total traffic?
- Performance
- API response performance trends – is it getting slower or faster?
- Static content delivery performance – is it getting slower or faster?
Attribute
For accountability clarity, ensure all metrics are appropriately attributed to the application and team.
Attribute performance metrics by:
- Owning team
- Application
- Environment
Teams might argue that they depend on other underlying systems that aren’t able to meet their applications’ performance needs. However, accountability is essential here. As a leader, you must empower the person owning a product’s service quality to have the necessary conversations with the owners of the systems they depend on. Remember, often, the solution sits on both sides of the fence.
Report
Organize the findings into a digestible monthly report, broken down by team and applications.
Ask all service performance owners to add commentary to the report to shed light on the outages, errors, and performance changes.
Ask all service performance owners to add any action points taken in the past month and any planned actions anticipated to impact performance (positive or negative).
Review
Set up a monthly meeting to review performance with all service performance owners, underlying systems and infrastructure owners, and internal business stakeholders. Make sure your operations team is involved in this meeting.
Act
Take action to address issues in the system’s performance quality. Address performance issues with engineering actions and measure the impact.
Pick Edge Components That Make It Easy to Measure Performance
Consistency in edge metrics will make measuring and reporting much easier. It is important to pick a solution that provides the granularity to measure and attribute appropriately.
Envoy Proxy, a mature reverse proxy originally developed at Lyft, allows you to capture rich metrics from requests. It enables you to measure performance, collect metrics data to attribute performance, and use it to report on service performance.
However, without a scalable control plane, Envoy Proxy can be difficult to manage and configure. The easiest way to use Envoy Proxy to handle incoming requests to your system is to run it as a Kubernetes Gateway managed by Envoy Gateway.
Even though Envoy Gateway enables you to use Envoy Proxy as a Kubernetes Gateway, you can route traffic outside and inside Kubernetes, allowing you to have a consistent technology component regardless of whether you have Kubernetes-hosted services.