Petabyte Scale, Gigabyte Costs: Mezmo’s Evolution from ElasticSearch to Quickwit
![](https://cdn.prod.website-files.com/626ad01a9cdb24810e68c024/63ce367e0760e75f7056a04d_clock.png)
2.5.25
![](https://cdn.prod.website-files.com/626ad01a9cdb24810e68c024/63ce367e0760e75f7056a04d_clock.png)
At Mezmo, we handle an enormous volume of telemetry data for our customers and ourselves, requiring a robust and efficient search and analytics backend. For years, ElasticSearch served us well, but as our infrastructure grew to a multi-cluster, multi-petabyte scale, we started to see the cracks—rising costs, performance bottlenecks, and scalability concerns. We needed a change, one that would make our system more cost-effective while maintaining speed and reliability.
After extensive research, we landed on Quickwit, an open-source, cloud-native search engine for logs. The transition was far from simple—it required significant engineering effort to revamp our infrastructure and help Quickwit adjust to new demands while ensuring a seamless experience for our customers. This blog takes you through our journey: why we moved, how we made the decision, the challenges we faced, and key lessons learned.
Recognizing the Need for Change
ElasticSearch had been the backbone of our search and analytics system, but we wanted to make things better:
- Cost Optimization – Managing petabytes of data is a lot of data, but it’s important to also not break the bank in the process.
- Performance could be better – While there were no impacts on our service, we felt that performance could be better.
- Scalability – Just like performance could be improved, we felt the same of scalability. With Quickwit, we saw an opportunity to have the storage layer separated from the compute layer, making it possible to spin up nodes faster.
- Reliability – While we had a good process to ensure a reliable customer experience, internally the scale of our ElasticSearch deployment meant that there was a lot of attention and intervention required.
Choosing the Right Solution: The Decision Framework
Selecting a new backend wasn’t just about finding a cheaper alternative—we needed a solution that was:
- Cost-efficient – Could reduce storage and compute expenses without sacrificing performance.
- Cloud-native – Designed to leverage modern cloud infrastructure.
- Scalable – Could handle our ever-growing data volume without constant tuning.
- Performance-driven – At least on par and ideally delivered faster queries and lower resource consumption.
- Reliable – Faster, cheaper and all that means nothing if the customer experience suffers.
After evaluating multiple options, Quickwit stood out for its efficient indexing, cloud-native architecture, and log-specific optimizations.
Architectural and Engineering Challenges
Migrating a mission-critical system comes with challenges. Our biggest hurdles included:
- Data migration at scale – Moving petabytes of data without downtime.
- Query compatibility – Ensuring Quickwit could support our existing search patterns.
- Maintaining performance – Avoiding disruptions to customer experience during migration.
To tackle these, we implemented a phased migration strategy, rigorous testing, and real-time performance monitoring to ensure a smooth transition.
Executing the Migration Without Disrupting Customers
To minimize risk, we took an incremental approach:
- Proof of Concept (PoC) – Tested a couple of applications running on Quickwit and compared them against ElasticSearch.
- Parallel Deployment – Running Quickwit alongside ElasticSearch to validate results and make refinements as needed.
- Gradual Traffic Shift – Slowly redirecting queries while monitoring performance – a break, adjust, and fix approach.
- Full Cutover – Decommissioning ElasticSearch once Quickwit met all requirements.
This measured rollout ensured zero service interruptions and seamless adoption.
Key Takeaways & Lessons Learned
- Recognize when to evolve – Don’t wait until performance issues impact customers.
- Use a structured decision framework – Prioritize scalability, cost, reliability, and performance.
- Plan for migration challenges – Expect unexpected roadblocks and address them proactively. For example, the markers to identify production issues in ElasticSearch were different in Quickwit.
- Customer experience is paramount – Maintain stability while making backend improvements. With our parallel deployment model, we could switch a customer back to ElasticSearch if something wasn’t performing right in Quickwit with no interruption to the customer.
- Continuous optimization is key. Post-migration tuning can unlock even more efficiencies. For example, we are still learning the right size for our cluster components and constantly fine-tuning them.
Conclusion
Switching from ElasticSearch to Quickwit transformed our search and analytics infrastructure, reducing data storage/search costs by 90% while improving scalability and performance. While the journey wasn’t easy, the payoff was worth it. For companies struggling with costly, inefficient search architectures, our advice is simple: evaluate your needs, explore alternatives, and don’t be afraid to evolve.