• 9 min read

Triage: The quickest way to finding the needle in the haystack

In terms of observability, we live in fascinating times. OpenTelemetry has democratized the collection of telemetry beyond the inflection point: proprietary agents are a thing of the past, although some parts of our industry will take longer than others to realize it. But every development brings its own challenges..

On telemetry, haystacks, and unorthodox solutions

Making telemetry trivial to collect has brought a second problem: we are collecting lots of telemetry. We set up the collection once for many things that sound useful, and we let the bits flow and pile up forevermore.

However, the more telemetry we have, the harder it can be to extract insights from it. (Not to mention the toll on the budget.)

Think of your telemetry as a haystack: the larger it is, the harder it can be for you to find the needle in it. Reducing the size of the haystack is definitely part of the solution, and yesterday we introduced you to Spam Filters. But we want Dash0 to go beyond: we want to increase the efficacy of your telemetry. This is why Dash0 has features like AI-inferred log severities and Semantic Convention upgrades.

Nevertheless, irrespective of how large the haystack is or how good the hay is, finding a needle in it is never a good experience, especially an experience you go through daily or under the heavy pressure of production incidents.

To find all the needles you need in your haystacks, you could invest a lot of time. You could also enlist the help of others, who could also invest a lot of time. Or, you could pull out the most powerful magnet you can lay hands on, and find your needles with a tiny fraction of time and effort.

Troubleshooting and the human condition

At Dash0, we have a saying: telemetry, without context, is just data. For example, without knowing which system generates errors, you cannot assess how important those errors are, i.e., the impact.

OpenTelemetry, thanks to the semantic conventions, has given us an entire vocabulary to talk about the systems we monitor in ways that observability tools can understand and capitalize on, and at Dash0 we work every day to squeeze every bit of value out of that metadata. We like telemetry metadata so much, in fact, that we built our pricing model with the goal of incentivizing you to send all the metadata you care about.

When you troubleshoot errors, you rely on this metadata to make inferences and decisions. You read the telemetry, you understand it, and formulate hypotheses about what can be the cause. That is as good a process as humans can do unaided, but it is limited by our own bandwidth: we can take in only so much information at once. We have blind spots and preconceptions (“It’s always DNS!”). At a certain scale, what you see is just a haystack. You can formulate hypotheses to troubleshoot only as fast as you can take data in.

So, let Dash0 offer hypotheses for you fast and at scale based on the same metadata you rely on.

Dash0 Triage: Let the machine find your needles for you

We taught Dash0 to leverage the wealth of metadata to find commonalities, differences and patterns in your telemetry. Dash0’s Triage does just this: you draw a box around what interests you, and let Dash0 tell you why it is so very interesting.

Let’s see how Triage works in practice using the “Product catalog failure” use-case of the OpenTelemetry Demo application. The productcatalogservice consistently fails when the product with id OLJCESPC7Z is looked up. This is immediately visible in Dash0 when grouping on app.product.id (grouping is also a brand-new feature in Dash0), but getting there assumes you already know what to look for:

Dash0’s Grouping is very powerful. And to get you to group on the right things, there’s Triage!

Instead, let’s begin our journey from the Service list, where we see that the productcatalogservice is under the weather:

Like everything in Dash0’s, the Service list uses color very intentionally to guide your attention to where it’s needed the most!

From the Service list, one click on the productcatalogservice opens the sidebar, which provides access to a lot of information, including the Telemetry tab, which provides an overview of logs, tracing and metric data associated with this service. The preview of the Tracing Heat Map shows errors and duration outliers a bit all over the place. The View in Tracing button allows you to drill into the tracing data for productcatalogservice.

From the Service sidebar, it is one click to drill down into the Tracing data for a service.

The Tracing Heat Map indeed shows a service with errors all over the place.The productcatalogservice has a significant amount of errors (shown as the redness of the dots). Let’s start a Triage with the plume in front of our eyes, it seems to be representative enough of the state of the service.

Red a bit everywhere, let’s start from one of the plumes of the Heat Map.

Triage compares the spans in the selection with those outside. We have not selected spans on the lowest bucket of the Heat Map, as they do not have errors, and the result is pretty striking.

First taste of Triage, and the offending product identifier is the first result returned!

And that’s it: the product id OLJCESPC7Z is staring us directly in the face as the most probable correlation!

Tooltips provide additional information about the correlation, including the possibility to filter for or filter out specific attribute values.

Hovering over the OLJCESPC7Z product identifier shows us the correlation, expressed in terms of how often it occurs in the selection, compared with the rest. Finding this out looking at single spans would have taken a long time, as not every request for the OLJCESPC7Z product identifier results in an error:

About 9.1% of requests with the OLJCESPC7Z product identifier result in errors. It would have been neither easy nor fast to find out this correlation without Triage.

But there’s more to Triage!

Triage is a quintessential Dash0 experience. We care a lot about your user experience in Dash0. We are building Dash0 to help you when you are under the duress of outages. We want your troubleshooting experience to be intuitive and visceral. You must be able to “touch” the data and manipulate it to get to insights.

Triage it’s not just about what telemetry is “interesting”. Something that stands out, stands out in comparison with something else.¹ Triage offers you several ways of comparing your selection with different subsets of the data on screen.

Compare with everything else

Comparing a selection that stands out against all the rest is a great way to start analyzing new, not-understood problems, which we used in our walkthrough.

Compare with the other spans in the same timeframe

Comparing the selection with the rest of the data in the same timeframe of the reference data, which is best to analyze outliers, especially in terms of duration.

Compare with everything before

Comparing the selection with telemetry before the selection in the global timeframe is an excellent way to analyze the first occurrence of spikes or error clusters.

Analyze attribute values

Something else that Triage also gives you, is a quick way to analyze how attributes and their values cluster across all the available telemetry. Maybe you are not looking to troubleshoot issues, but rather understand the reality of your data. One way to do this in Dash0 so far has been the Filter dialog, which breaks down filter possibilities by values. (Incidentally: building the Filtering experience this way made Triage so much simpler to build, as we had a lot of the mechanics in place already.)

Triage is in Beta

This is the first iteration of Dash0 triage. It is available in open beta for Tracing, which is the signal with the largest amount of metadata on average. At time of writing this blog, we know of various cosmetic bugs and UX details that need improvement, but we don’t want to wait to learn with and from you and your use cases.

Triage will also come for Logging and Metrics when we feel we have confirmation through your feedback that we nailed user experience. Indeed, expect the user experience to be tuned and improved in the next few days and weeks.

Try it and let us know!

We hope you will like Triage! Like all Dash0 features, we are our own first adopters. We have been using Triage internally as soon as it got deployed. And while we had big expectations about its usefulness, they were frankly blown away by the actual experience. Suffice to say, the words “pry Triage out of my cold, dead hands” were uttered in defiance by our SREs that one time other engineers considered temporarily turning it off to tackle teething performance problems.

Give Triage a spin, and let us know what you think of it. We need your feedback: we want to polish Triage to a shine.

─────────

¹ Which is why, by the way, “Comparisons” was the original internal name for Triage.