Why Intuitive Troubleshooting Has Stopped Working for You

www.honeycomb.io

We’re operating distributed microservice ecosystems on top of a deep stack of frameworks, abstractions and runtimes that are all running on other people’s servers (aka “the cloud”). The days of naming servers after Greek gods and sshing into a box to run tail and top are long gone for most of us.

The move to these modern architectures is for good reason. Engineering organizations are under constant pressure to deliver more value, in less time, at a faster and faster pace.

Towering monoliths and artisanally handcrafted server configurations simply can’t compete with the scalability and flexibility of small, independently deployable services, managed by a multitude of teams, and running on top of elastically scaling infrastructure.

However, this shift has come at a cost. Our systems moved from the realm of the complicated into the realm of the complex; and with that shift, we have discovered that traditional approaches to understanding and troubleshooting production environments simply will not work in this new world. [..] In the past, we could understand our complicated systems by troubleshooting based on experience and known unknowns: What’s the CPU load, how many successful logins have we had in the last hour, what’s the average latency of each API endpoint?

We primarily relied on pre-configured dashboards that could answer those standard questions. Maybe sometimes we dug a bit deeper, with logs or some additional ad hoc queries, but the primary tools for understanding the behavior of our system was oriented toward fixed, aggregate analysis.

Today, tooling that only provides a pre-formed, aggregated view is no longer sufficient. Understanding complex systems requires probing them in exploratory and open-ended ways, formulating a series of ad-hoc and very specific questions about system behavior, looking at the results from various dimensions, and then formulating new questions — all within a tight feedback loop.

This need for ad hoc exploration and dissection has led to the rise of a new class of tools: observability. Observability allows us to probe deep into our systems to understand behavior, down to the level of individual requests between services.

It lets you roll up those individual behaviors into aggregate trends across arbitrary dimensions or break down those trends at any resolution, down to a single customer ID. Observability tools provide the capabilities necessary to move through multiple turns of an OODA loop extremely rapidly, building understanding as we go.