SRECon Recap: Product Reliability, Burn Out, and more
I recently attended SRECon in San Francisco on March 18 - 20, a show dedicated to a gathering of engineers who care deeply about site reliability, systems engineering, and working with complex distributed systems at scale. While there were a lot of talks, I’ll focus on a few areas that gave me the most insight into how having the right data impacts an SREs and an organization’s success.
How Google Maps Keeps Their Product Reliable
Micah Lerner and Joe Abrams of Google shared their insights into how they keep Google Maps reliable for 1 billion active users. One of the major challenges they had is a lot of outages were not detected by alerts, but rather by users reporting these issues via outlets such as Reddit.
One example of an outage they highlighted is that immersive views were not opening. Immersive views is where a user can experience an area in 3D animation. What was happening was that when a user would go to open it, it wouldn’t load but it was not being detected in any of their existing SLOs. How they figured out it wasn’t loading was they compared past vs present metrics of successful opens and noticed there was a significant drop.
From their experience and all the post-mortems they had done, one common problem was production alerts didn’t always capture the user experience, so the alerts they had in place were not sufficient. Another pattern they noticed was when rollouts would happen with system changes (e.g. configuration changes or new features), they may have received incomplete or no feedback about user impact.
Meeting the Challenge of Burnout
Dr. Christina Maslach, a psychology professor at the University of California, Berkeley discussed the topic of burnout, something we all see happen across our lives in many different shapes and forms. Dr. Maslach opened her discussion with how burnout is typically a mismanagement issue; for example, having to do more with less. A default response to these stressors is people want to help the employee cope with ongoing stressors, but rather management needs to fix the job and not the person. A key takeaway is that a healthy job environment takes care of both workers and the workplace so that the former will thrive and succeed.
Twenty Years in SRE
Niall Murphy of Stanza Systems talked about his experience of how when he started as an SRE but wasn’t called SRE, but rather a NOC (Network operations center). Over his 20 years as an SRE, he’s seen how the industry and role has evolved but its penetration is still not wide. According to Gartner, by 2027, 75% of enterprises will use site reliability engineering practices across their organizations, up from 10% in 2022. He highlighted how there needs to be more education on SRE principles, building clearer career paths for SREs, and, most importantly, how organizations need more qualitative data models over quantitative data models.
Telemetry Data Needs to Be Valuable to Be Useful
Mezmo was on the ground as a gold sponsor learning about the challenges people are facing with their telemetry data and educating folks on what is a telemetry pipeline. Many visitors came to learn about Telemetry Pipelines and how it helps users get more value out of their observability, which is where our O’Reilly book, The Fundamentals of Telemetry Pipelines, came in handy.
We find it’s important to understand that having a large quantity of telemetry data is useless if you don’t have the right data, regardless of role in an organization. So what’s one way to help get more qualitative data over quantitative, reduce mental toil, and make burn out easier? Telemetry pipelines! Request a demo now.
