I no longer live here, so I can finally share these photos publicly.
This post is part of a series that will be addressing multiple complex software issues like built-in security, compliance, IAM, cost-to-serve, and more. The goal is to share expertise on building and maintaining better complex software systems: here’s what worked for the author from multiple roles covering various stages of start-ups as well as enterprise environments, on customer-facing SaaS products, as well as internal infrastructure positions.
First off, let’s talk about observability.
A team with strong ownership of the software services they run needs to figure out their own approach to observability. One-size-fits-all solutions are not efficient. This post attempts to strip down complex system software observability to the bare bones, after which the reader would be able to enable better choices in terms of tooling, processes, efforts divested.
Feel free to skip to the end for CONCLUSION, or, a TLDR
Observability is nuanced. Different companies place varying degrees of importance and take up an assortment of approaches to relevant initiatives. How a small company implements observability will be different from how Google does it even if they follow Google SRE recommendations, and different parts of an organization will have different observability priorities.
THE SYSTEM: A NON-DOMAIN-SPECIFIC MODEL
This section presents a simple complex systems model of a system, completely out of the software context. This model allows the opportunity to rid ourselves of heavily loaded and topic-specific terms.
In the process of implementing observability processes, there are several subsystems we need to consider: the engineers,the customers (end users of the software product, either internal teams or external customers), and the product (the services and processes a team maintains). These subsystems are parts of larger overlapping and interconnected systems.
For the purpose of simplifying this discussion, let’s only look at the 3 stocks of these subsystems:
team’s engineering resources
customer happiness
product development error budget
These stocks will accumulate/deplete over time in an interrelated way even with no intervention, for example:
Depleting the product error budget will lead to a depletion in customer happiness.
Allocating engineering resources to the restoration of product error budget will lead to a restoration of customer happiness.
A team’s observability stack should enable the team to understand the flows between these 3 stocks, so that influence can be applied in a precise and effective way: this is the key to observability, to avoid conjecture-led consequences by carrying out evidence-based actions.
Examples of patterns teams should actively aim to avoid:
Allocating engineering resources to the restoration of a product error budget that has no influence on customer happiness.
Over-committing the allocation of engineer resources to product error budgets such that there is not enough for feature development or worse yet, it is so depleted to the point of being nonrenewable (e.g. engineer team burnout)
etc..
Also missing from the model is the concept of delays, namely any changes made to the system will not immediately create an effect. There will always be a delay in the effects of our efforts:
Using more engineering resources to replenish the error budget will see the error budget refilled in the future, not right away.
A replenished error budget will see restored customer happiness in the future, not right away.
A prolonged delay means we will not be able to accurately estimate whether the adjustments we’ve made to the flows between stocks are correct and effective. Due to this, there are some critical links in our system that rely heavily on assumptions. We cannot avoid making assumptions, but we can reduce mistakes made by rigorously reexamining our assumptions.
ON-GOING OBSERVABILITY MAINTENANCE SHOULD BE ITERATIVE
A team with strong ownership of services should have an ongoing process that they use to ask questions and validate assumptions.
This process should be structured. I will talk about this in a later blog.
Eventual ideal state of observability: SLO-based alerts are the only high priority ones that would wake people up.
This means (a) the team is preserving engineering resources by ensuring the quality of our on-call experience, (b) the team isn’t aren’t missing alerts that are critical to customer happiness and system reliability
THE DATA WE COLLECT SHOULD BE DRIVEN BY WHAT WE WANT TO OBSERVE, NOT THE OTHER WAY AROUND.
If something a team need to observe requires the collection of additional data, then efforts should be invested in collecting that data.
Oftentimes, the anti-pattern of building alerts based on existing available data can be observed. This anti-pattern is part of the natural system decay under constrained resources, and being mindful of it is the first step to amelioration.
ALERTING DRIVEN BY AN SLO-MINDSET
SLOs are a loaded concept rife with interpretations. Here we specifically refer to SLOs as statements we can make about a service that are mission critical.
To be clear, when we speak of SLOs they are really just an organized way of presenting certain alerts and a manner of wrapping up analytics in a statement. Its most helpful aspect is to zero in and identify high impact areas that warrant obtaining data for.
For example: the mission critical statements of a fridge is to (1) keep food at 2–4 degrees in the fridge, (2) at below 0 in the freezer, and (3) to be up 100% of the time. If a service-owner team can do this exercise for every service they own, then they can identify what they should be alerting on with more precision, and that’s what they want because then:
(1) They only respond to pages outside of business hours that are critical and have high impact.
(2) They have data to guide us on efforts they should be spending on active development vs maintenance.
CONCLUSION, or, a TLDR
A team owning a subsystem/subsystems within a complex system should aim to find an approach that is adapted to their needs and standardize around it. While best practices and guidelines are available, it is important we use them as a starting point and tailor the purpose as well as methodology to good fit.
The observability goals of a service-owning team could be defined as such:
Ensure the observability data captured by services enables the timely and reliable detection of issues impacting the system relying on these services.
Empower operational excellence through hygienic and consistent observability tooling and processes.
Or in more simple terms:
Make customers happy — only fix what’s useful
Work efficiently — prioritize working on what’s important
Reduce toil — only respond to necessary pages.
The service-owning team can begin to simplify the view of their system by consolidating it to three stocks:
Customer happiness
Team’s engineering resources
Product development error budget
The service-owning team can achieve their goals by
Setting up on-going iterative processes for observability maintenance.
Being aware of anti-patterns around setting up alerts around available and low-hanging-fruit data.
Building alerts around mission-critical statements (SLOs) and making data-driven decisions.
Future part of this series will share some templates for setting up these processes.