How Mezmo Uses a Telemetry Pipeline to Handle Metrics, Part II
11.27.24
In April, we published a Part I blog on the topic of how we planned to use our own Telemetry Pipelines product here at Mezmo to manage metrics data. Remember that utopia of observability we dreamed about? Well, we’re not quite there yet, but we’ve made progress. We’ve gone from “Wouldn’t it be cool if…” to “Holy crap, it’s actually working!”
We’re now running OpenTelemetry (OTel) on every node in every cluster, and they are sending a significant amount of data. We scrape and collect every metric (and some traces, stay tuned) we have, which amounts to 600-700 MB/s of data being funneled into the Pipeline.
Our Observability Pipeline
Even though we are still running the sysdig-agent and not yet using Pipeline for traces, this is the observability story we’re aiming for. No matter what type of data it emits, every pod will send it to Pipeline, where we route, analyze, enrich, and alter the data as we see fit. It’s much more coherent and has already proven valuable.
Along the way, we had to iron out many issues. Everything worked on the first go in the lower environments, but as usual, once we hit production, that’s where the real trouble began. The addition of our new pipeline product brought even more metrics and logs, and we had internal queues filling up. Our ordering guarantees in Pipeline caused all the data to end up in just a few Kafka partitions, blocking the whole flow.
But with trial and error, we ironed out all the issues and eventually made it work.
An unexpected bonus from incorporating OTel is that now we can handle metrics, traces, and logs. This has proven valuable, especially since our data store, emits traces. With these traces, we can debug slow search speeds more quickly and with greater confidence in our findings.
One last thing that we benefited from using Otel is that there are limits to the number of metrics that a given sysdig agent running on a Kubernetes node can send to sysdig. We can reduce and consolidate into a single, and maybe arguably more useful, metric to send to sysdig. Reducing having to manage those limitations.
What’s Next?
From here, we need to shift our focus from collecting to storing. We need to simplify and prepare our backends for receiving all or a subset of this data: Sysdig, Global Prometheus, and maybe Jaeger for traces. I’m sure this will uncover many known unknowns and unknown unknowns, but I’m equally sure we will learn a lot along the way!
We’re eyeing that single pane of glass dream. With metrics, logs, and traces all flowing through Pipeline, we’re getting closer to that unified view of our observability data.
Stay tuned for more updates as we continue this journey.
SHARE ARTICLE
RELATED ARTICLES