Monitoring SaltStack Logs with LogDNA

4 MIN READ
MIN READ

SaltStack is an open source configuration management tool that lets you manage your infrastructure as code. Using SaltStack, you can manage tens of thousands of servers remotely using either a declarative language or imperative commands. It's similar to configuration management tools such as Puppet, Chef, and Ansible.

Like most configuration management tools, SaltStack logs the activities taking place between masters and minions. These logs include important details about these activities including what actions were taken, whether they were successful, how long they took, and which minions were involved. Using this data can provide you with valuable insights into your SaltStack environment and make managing your minions easier.

How SaltStack Works

SaltStack consists of central command servers (called "masters") that send commands to one or more clients (called "minions"). These commands are based on configuration declarations called states. You can apply a state to a minion, and SaltStack will determine which commands to execute on the minion so that it meets this state.

SaltStack is designed for high performance, with the documentation stating that a single master can manage thousands of systems. In fact, LogDNA uses SaltStack to provision and manage infrastructure even across several cloud platforms.

Getting Your SaltStack Logs to LogDNA

You can send logs directly from SaltStack to LogDNA through the LogDNA SaltStack integration. With this integration, any state changes are logged and forwarded to your LogDNA account. For example, let's say we want to install Vim on each of our minions. We can do so by creating the following Salt state file:

# /srv/salt/vim.sls
vim:
 pkg.installed

Next, we'll run the following command on the Salt master:

root@saltmaster:~# salt '*' state.apply vim

Then, we can view the results in the LogDNA web app:

LogDNA automatically extracts the following fields:

  • salt.jid: The unique job ID that generated this event
  • salt.args: any additional arguments passed to the salt command
  • Source: The name of the minion that the event applies to
  • App: Automatically set to "Salt"

However, Salt events contain additional data such as:

  • The status of the job (SUCCESS or FAILURE)
  • The name of the state being applied (e.g. vim)
  • The job's total runtime
  • Any additional messages from Salt

Although LogDNA doesn't parse these fields automatically, we can use custom parsing rules to extract them. For successful events, we can extract the name of the state and its runtime:

Failed events also provide the state and time, but they also include a reason for why the event failed. We extracted this separately:

Analyzing Salt Logs in LogDNA

Monitoring for Failed Events

Even the most rigorously tested state files can fail to execute properly. When applying a state automatically or asynchronously, you may not be aware of an execution failure until you inspect the minion that the state ran on. But to understand the cause of the problem and to troubleshoot the state file, you will need information from Salt itself.

Fortunately, each event that gets sent to LogDNA includes a status of either SUCCESS or FAILURE. You can filter your view to only show failed events using the following search:

app:Salt FAILURE

From this view, we can create an alert that sends a notification after detecting any new failed events. We can forward the alert to the DevOps team to notify them of failed events, as well as the reasons behind the failure. Each event also includes the minion name in the host field, letting engineers identify the specific minion that the failure occurred on.

Tracking Execution Runtime

For each event, Salt reports the total time the event took to run. If you notice that your states are taking an unusually long time to complete, this is a useful way of determining whether the problem is caused by the master, the minion, or the state itself.
Using a custom parsing rule, we'll extract each event's duration to a numeric time field. This way, we can use comparison operators and charts to analyze our events more thoroughly. For example, we can use the following search to find events that took longer than ten minutes to complete:

app:Salt time>=600000

We can also graph this data to find any anomalies or trends. For example, the following graph shows the duration of each event applied to a single minion. The most apparent issue is the first event, which took 3–4 times as long as the four following events:

By clicking on the first peak and selecting "Show Logs", we can find the exact state change that resulted in the slow performance. As it turns out, the change was a pretty significant one that involved installing the Apache web server, creating files, adding several users and groups, and verifying each change. The next four changes were much less involved, resulting in them taking far less time (5–6 seconds) to complete.

Reviewing the Impact of State Changes

Although state.apply includes a dry-run option, there's always the risk of a state accidentally being applied to live servers. When this happens, you'll want to know which minions were affected and what the results of the state change were.

For example, let's say we accidentally deployed PHP to each of our minions using a custom state file named php.sls. We added test=True to the end of our command but accidentally typed a semicolon, causing our terminal to run two separate commands:

root@saltmaster:~# salt '*' state.apply php ;test=True

We could debug this ourselves by scrolling up through the output. However, we only have two minions in our environment. Imagine if you had 100, 500, or 1,000+ minions. Trying to scroll back through each result would be incredibly labor-intensive.

Instead, we can use LogDNA to see how many minions each state was applied to. By filtering our logs, we can even see the exact minions that were impacted. In the following image, we created a graph of the total log volume from Salt and added a histogram of the state field. This shows us which states were recently applied and the number of minions they applied to. Our environment has just four minions, which is why the maximum event count is also four:

If we want to show a list of minion names instead, we can filter the graph to only count logs from the apache2 state, then add a histogram based on the host field.

Start Monitoring SaltStack Today

With LogDNA, monitoring SaltStack is easy. To get started, follow the instructions for installing the LogDNA Salt deployment integration and apply a state in your Salt cluster. In the LogDNA web app, select "Salt" from the All Apps dropdown, or enter app:Salt in the search box. LogDNA automatically color codes your Salt events for easier readability.

To learn more about LogDNA, contact us, visit our website, or sign up for a 14-day free trial (no credit card required).

 

Table of Contents

    Share Article

    RSS Feed

    Next blog post
    You're viewing our latest blog post.
    Previous blog post
    You're viewing our oldest blog post.
    What is Active Telemetry
    Launching an agentic SRE for root cause analysis
    Paving the way for a new era: Mezmo's Active Telemetry
    The Answer to SRE Agent Failures: Context Engineering
    Empowering an MCP server with a telemetry pipeline
    The Debugging Bottleneck: A Manual Log-Sifting Expedition
    The Smartest Member of Your Developer Ecosystem: Introducing the Mezmo MCP Server
    Your New AI Assistant for a Smarter Workflow
    The Observability Problem Isn't Data Volume Anymore—It's Context
    Beyond the Pipeline: Data Isn't Oil, It's Power.
    The Platform Engineer's Playbook: Mastering OpenTelemetry & Compliance with Mezmo and Dynatrace
    From Alert to Answer in Seconds: Accelerating Incident Response in Dynatrace
    Taming Your Dynatrace Bill: How to Cut Observability Costs, Not Visibility
    Architecting for Value: A Playbook for Sustainable Observability
    How to Cut Observability Costs with Synthetic Monitoring and Responsive Pipelines
    Unlock Deeper Insights: Introducing GitLab Event Integration with Mezmo
    Introducing the New Mezmo Product Homepage
    The Inconvenient Truth About AI Ethics in Observability
    Observability's Moneyball Moment: How AI Is Changing the Game (Not Ending It)
    Do you Grok It?
    Top Five Reasons Telemetry Pipelines Should Be on Every Engineer’s Radar
    Is It a Cup or a Pot? Helping You Pinpoint the Problem—and Sleep Through the Night
    Smarter Telemetry Pipelines: The Key to Cutting Datadog Costs and Observability Chaos
    Why Datadog Falls Short for Log Management and What to Do Instead
    Telemetry for Modern Apps: Reducing MTTR with Smarter Signals
    Transforming Observability: Simpler, Smarter, and More Affordable Data Control
    Datadog: The Good, The Bad, The Costly
    Mezmo Recognized with 25 G2 Awards for Spring 2025
    Reducing Telemetry Toil with Rapid Pipelining
    Cut Costs, Not Insights:   A Practical Guide to Telemetry Data Optimization
    Webinar Recap: Telemetry Pipeline 101
    Petabyte Scale, Gigabyte Costs: Mezmo’s Evolution from ElasticSearch to Quickwit
    2024 Recap - Highlights of Mezmo’s product enhancements
    My Favorite Observability and DevOps Articles of 2024
    AWS re:Invent ‘24: Generative AI Observability, Platform Engineering, and 99.9995% Availability
    From Gartner IOCS 2024 Conference: AI, Observability Data, and Telemetry Pipelines
    Our team’s learnings from Kubecon: Use Exemplars, Configuring OTel, and OTTL cookbook
    How Mezmo Uses a Telemetry Pipeline to Handle Metrics, Part II
    Webinar Recap: 2024 DORA Report: Accelerate State of DevOps
    Kubecon ‘24 recap: Patent Trolls, OTel Lessons at Scale, and Principle Platform Abstractions
    Announcing Mezmo Flow: Build a Telemetry Pipeline in 15 minutes
    Key Takeaways from the 2024 DORA Report
    Webinar Recap | Telemetry Data Management: Tales from the Trenches
    What are SLOs/SLIs/SLAs?
    Webinar Recap | Next Gen Log Management: Maximize Log Value with Telemetry Pipelines
    Creating In-Stream Alerts for Telemetry Data
    Creating Re-Usable Components for Telemetry Pipelines
    Optimizing Data for Service Management Objective Monitoring
    More Value From Your Logs: Next Generation Log Management from Mezmo
    A Day in the Life of a Mezmo SRE
    Webinar Recap: Applying a Data Engineering Approach to Telemetry Data
    Dogfooding at Mezmo: How we used telemetry pipeline to reduce data volume
    Unlocking Business Insights with Telemetry Pipelines
    Why Your Telemetry (Observability) Pipelines Need to be Responsive
    How Data Profiling Can Reduce Burnout
    Data Optimization Technique: Route Data to Specialized Processing Chains
    Data Privacy Takeaways from Gartner Security & Risk Summit
    Mastering Telemetry Pipelines: Driving Compliance and Data Optimization
    A Recap of Gartner Security and Risk Summit: GenAI, Augmented Cybersecurity, Burnout
    Why Telemetry Pipelines Should Be A Part Of Your Compliance Strategy
    Pipeline Module: Event to Metric
    Telemetry Data Compliance Module
    OpenTelemetry: The Key To Unified Telemetry Data
    Data optimization technique: convert events to metrics
    What’s New With Mezmo: In-stream Alerting
    How Mezmo Used Telemetry Pipeline to Handle Metrics
    Webinar Recap: Mastering Telemetry Pipelines - A DevOps Lifecycle Approach to Data Management
    Open-source Telemetry Pipelines: An Overview
    SRECon Recap: Product Reliability, Burn Out, and more
    Webinar Recap: How to Manage Telemetry Data with Confidence
    Webinar Recap: Myths and Realities in Telemetry Data Handling
    Using Vector to Build a Telemetry Pipeline Solution
    Managing Telemetry Data Overflow in Kubernetes with Resource Quotas and Limits
    How To Optimize Telemetry Pipelines For Better Observability and Security
    Gartner IOCS Conference Recap: Monitoring and Observing Environments with Telemetry Pipelines
    AWS re:Invent 2023 highlights: Observability at Stripe, Capital One, and McDonald’s
    Webinar Recap: Best Practices for Observability Pipelines
    Introducing Responsive Pipelines from Mezmo
    My First KubeCon - Tales of the K8’s community, DE&I, sustainability, and OTel
    Modernize Telemetry Pipeline Management with Mezmo Pipeline as Code
    How To Profile and Optimize Telemetry Data: A Deep Dive
    Kubernetes Telemetry Data Optimization in Five Steps with Mezmo
    Introducing Mezmo Edge: A Secure Approach To Telemetry Data
    Understand Kubernetes Telemetry Data Immediately With Mezmo’s Welcome Pipeline
    Unearthing Gold: Deriving Metrics from Logs with Mezmo Telemetry Pipeline
    Webinar Recap: The Single Pane of Glass Myth
    Empower Observability Engineers: Enhance Engineering With Mezmo
    Webinar Recap: How to Get More Out of Your Log Data
    Unraveling the Log Data Explosion: New Market Research Shows Trends and Challenges
    Webinar Recap: Unlocking the Full Value of Telemetry Data
    Data-Driven Decision Making: Leveraging Metrics and Logs-to-Metrics Processors
    How To Configure The Mezmo Telemetry Pipeline
    Supercharge Elasticsearch Observability With Telemetry Pipelines
    Enhancing Grafana Observability With Telemetry Pipelines
    Optimizing Your Splunk Experience with Telemetry Pipelines
    Webinar Recap: Unlocking Business Performance with Telemetry Data
    Enhancing Datadog Observability with Telemetry Pipelines
    Transforming Your Data With Telemetry Pipelines
    6 Steps to Implementing a Telemetry Pipeline
    Webinar Recap: Taming Data Complexity at Scale