Dogfooding at Mezmo: How we used telemetry pipeline to reduce data volume
8.9.24
Like many other organizations, we at Mezmo struggle with a lot of telemetry data, and for a while our team configured our logs to be sent to a global Mezmo Log Analysis account in our SaaS so we would have a single pane of glass to view all of our logs.
Our SRE team wanted to make sure that we have experience utilizing our new pipeline product. We set out some goals before we started using telemetry pipeline:
- Sends logs from the global account through telemetry pipeline for processing
- Reduce total data storage by at least 20%
- Migrate exclusion rules from Log Analysis to Telemetry Pipeline
You can also read about a previous blog on how we used telemetry pipeline to handle metrics.
First: What is Mezmo Telemetry Pipeline?
Mezmo’s Telemetry PIpeline makes it easier to understand data, optimize data to control costs, and respond quickly to make data-driven decisions.
Applying data engineering principles, Mezmo helps organizations gain confidence in their telemetry data and extract business value from it by centralizing data from various sources via Mezmo’s open platform. Data profiling helps understand the data before it is optimized by applying out-of-the-box and custom processors to transform data and route it to the desired destinations for further analysis.
With the Mezmo Telemetry Pipeline, enterprises can rapidly respond to any incident or data changes. This empowers teams to make faster decisions, all while controlling data costs.
The platform gives customers flexibility, regardless of the other tools they use in their observability and security stacks.
The Process
Our team initially started by having the logs flow through a simple pipeline without any manipulation. We then took the time to review each exclusion rule used in Log Analysis and worked to recreate them in the telemetry pipeline.
To reduce our storage, we reviewed the data to determine which app was sending the most volume and immediately noticed coredns was about 50% of the traffic flow. We then decided to use the sample processor to reduce that volume by 90%. By happy accident, we were able to satisfy another goal that had not been considered, we took the coredns logs and translated them into prometheus metrics. This further allowed us to reduce the logs we were storing but still see how many DNS looks were happening per environment.
The Outcomes
- All logs flow through a Mezmo telemetry pipeline now
- Our goal of reducing stored logs by 20% was exceeded with an average of 30%
- Migrated exclusion rules, some livetail logs took an additional step by adding a Script Execution process to modify the logs before sending them onto there destination
- More easily identified what services were causing the highest storage cost.
Since this project was completed, we have seen as high as 70% log volume reduction in some areas, and with the help of data profiling, we’re able to make more optimization, so check it out for yourself.
SHARE ARTICLE
RELATED ARTICLES