See how you can save 70% of the cost by reducing log volume and staying compliant.

AWS re:Invent ‘24: Generative AI Observability, Platform Engineering, and 99.9995% Availability

4 MIN READ
8 MIN READ
April Yep

12.20.24

April has several years of experience in the observability space, leading back to the days when it was called APM, DevOps, or infrastructure monitoring. April is a Senior Product Marketing Manager at Mezmo and loves cats and tea.
4 MIN READ
8 MIN READ

I attended Amazon Web Services re:Invent conference. This is AWS's annual user conference, which takes over most of Las Vegas for a week. There’s a lot to do and take in—customer stories galore, new tech, learning different use cases, and all the walking. But you’re here to hear what I learned, so I’ve broken it down into sections. Enjoy!

Generative AI Is Here to Stay

One of the biggest recurring themes in keynotes, discussions, and sessions is generative AI. While there are a number of use cases and products out there that support generative AI, more serious questions are being asked, like how to detect and prevent AI hallucinations, ethical usage, building guardrails, etc. While AI has reached 90% accuracy, AWS has found, that still leaves 10% incorrect and possibly giving someone incorrect information. The need to have automated reasoning, which proves that the system is working as intended, has room for growth in AI. 

TL;DR: Generative AI still has some work to do but has made tremendous headway. It is increasingly needing guardrails and ways to observe and monitor the data sets. 

DOP207: Understanding software logistics: The rise of the platform engineer

This was by far one of my favorite sessions since it was a fireside chat about platforming engineering with Jason Valentino, Head of Software Engineer Practices at BNY, and Lee Faus, Global Field CTO at Gitlab. 

There was a wide area of platform engineering topics discussed - how do you spin up and down projects, what does platform engineering look like at BNY, how do you balance platform needs with developer needs, etc. My takeaways for best platform engineering practices:

  1. Have processes and guidelines—it's important to have these in place of what must be in projects, like security protocols, documentation, etc. 
  2. Measure performance and productivity—BNY has a net promoter score (NPS) for the products it uses and uses SPACE (Satisfaction and well-being, performance, activity, communication and collaboration, efficiency and flow) for developer productivity. 
  3. Elevate platform engineers to be mapped to engineers—this helps with consistency across projects and ensures someone advocates for platform engineering practices in different groups. 
  4. Leverage DORA metrics —Development frequency, lead time for changes, mean time to recover, and change failure rate metrics. 

Jason closed out by acknowledging the difficulty in scaling such practices across large organizations. However, having an advocacy model within different groups/business units has been the most effective approach for him and his team. 

FSI321: How Stripe achieves five and a half 9s of availability

99.9995% availability, which equates to just 13 seconds of downtime out of 2.6 million seconds in a month, is unbelievable when you look at the numbers. For Stripe, they felt the bar needed to be high, with every transaction mattering. Stripe understands that 40% of users will abandon a transaction if it fails, and don’t want to risk revenue lost to a competitor. 

How do they do it? 

With technology and a mindset. The mindset they have at Stripe is to expect Murphy’s law to happen, so they think about how to reduce the blast radius and have a cell-based architecture in place. While this is a more complex architecture, this means a single cell will handle a transaction from end to end, making it easier to gradually roll out new deployments, and roll back if a failure occurs. While they acknowledged there are some gray failures that occur, like high data latency or noisy neighbors, they use anomaly detection to identify those incidences and respond appropriately. 

They highlight 3 takeaways as to how their culture is able to achieve such high availability:

  1. Practice your worst day every day
  2. Never send a human to do a machine’s job
  3. Exercise extreme ownership

COP404: Best Practices for Generative AI Observability

Generative AI has disrupted many practices, one of those being how do you do observability on generative AI? The answer, is not really much difference, just a new relationship with different forms of telemetry data (e.g. error rate can indicate that an AI job is having a hallucination). The key takeaways I took are to take a layered approach to the observability of AI:

  • Layer 1: Component level metrics (invocation errors, latency, resource utilization)
  • Layer 2: Retrieval Augmented Generation (RAG) agent or chain traces
  • Layer 3: Advanced metrics and analysis (e.g. guardrail interventions or embedding drift)
  • Layer 4: End-user feedback

Many observability tools cover these in the form of metrics, traces, and logs. So there really is no need to change the tools you have in place, but a need to change how you practice observability. 

Summing It All Up

While generative AI is the focus these days, the need to do observability is further amplified since AI data generates telemetry data. Give Mezmo’s Telemetry Pipeline a try to see how it can help you get control and more insights into your telemetry data. 

false
false

TABLE OF CONTENTS

    SHARE ARTICLE

    RSS FEED

    RELATED ARTICLES

    No related articles.