Lessons from the pre-LLM AI in Observability: Anomaly Detection and AIOps vs P99

by Jacek Migdal

The tech world is fully embracing large language models (LLMs), making it a good time to reflect on what has worked—and what hasn’t—over the past decade in pre-LLM in the observability industry. This is the first article of the AI in observability mini-series, which will be followed by current state-of-the-art and future predictions.

Business Context

Monitoring and troubleshooting modern software systems is an ideal candidate for automation. The stakes are high, and the business value is undeniable. Customers expect reliable systems —a tall order given their complexity, rapid updates, and numerous external dependencies.

Even the most advanced companies aren’t immune. Amazon, for example, lost an estimated $72–99 million in sales due to hiccups during its July 2018 Prime Day event. Delta Airlines faced an even costlier incident on August 8, 2016, when system failures disrupted 2,000 flights, resulting in a $177 million loss. And while these headline-grabbing outages are extreme cases, countless smaller incidents go unreported—the average data center failure still carries a hefty $730,000 price tag. Even minor degradation, such as 100ms extra latency, causes 1% less sales.

The scale of this problem has created a market opportunity worth tens of billions—leading to "decacorn" outcomes and the rise of legendary companies. Take Splunk, for example: $50 million in VC funding turned into a $26 billion valuation. Or Datadog, which raised $120 million and grew to a $47 billion valuation.

Plenty of other multi-billion-dollar successes exist, including New Relic, Sumo Logic, Elastic, Dynatrace, AppDynamics, and Grafana. With incentives like these, vendors are highly motivated to solve the problem.

In addition, engineers hate being on-call, especially when it means getting woken up at 2 AM for production issues. Fortunately, they’re also forward-thinking problem solvers with the autonomy to fix what frustrates them. And since they love sharing insights through blogs and conferences, good ideas spread fast.

So, where do AI and ML stand in the observability industry over the last decade?

Can ML/AI Find Problems for Us?

In the physical world of atoms, problems are easy to spot. But in the jungle of bits and bytes, many issues slip through unnoticed. They might cause a slow death by a thousand cuts—like customers occasionally failing to complete a purchase—or they could be lurking disasters waiting to unfold. Take the SolarWinds SUNBURST hack, for example. It started in the fall of 2019 and went undetected until December 2020, affecting major organizations, including the U.S. federal government and Microsoft.

Rather than relying on highly specialized professionals to painstakingly craft monitoring rules, why not let AI proactively search for problems? Production systems generate massive amounts of machine data, making it seem like an ideal playground for machine learning. Establishing baselines should be straightforward, and anomaly detection should shine in this scenario.

Anomaly detection has often been pitched as a silver bullet for this challenge, with many teams delivering compelling demos. But does it truly live up to the hype?

Prelert, one of the early anomaly detection companies, was founded in 2008 but ended up as a modest acquisition by Elastic in 2016. By then, nearly every major vendor had integrated anomaly detection into their offerings:

Splunk IT Service Intelligence introduced anomaly detection in 2015 (deck with deets).
Sumo Logic launched its anomaly detection feature in 2013.
Dynatrace debuted its AI-powered voice and chat assistant, Davis, in 2017.
Datadog rolled out anomaly detection in 2016.

Even industry analysts like Gartner started pushing the term "AIOps" in 2016, making it seem like anomaly detection would see rapid adoption.

And yet, in 2021, only 12% of SREs reported using it regularly, with just 7.5% finding it highly valuable—while nearly 40% never use it at all. Despite vendors and their PR teams painting a rosy picture, multiple surveys confirm this lukewarm reception. While ML-powered anomaly detection looks fantastic in demos and impresses enterprise decision-makers, hands-on practitioners seem far less enthusiastic.

John Allspaw, former CTO of Etsy, even penned an open letter to monitoring, metrics, and alerting companies, calling out misleading marketing claims:

Anomaly detection in software is, and always will be, an unsolved problem. Your company will not solve it. Your software will not solve it. Our people will improvise around it and adapt their work to cope with the fact that we will not always know what and how something is wrong at the exact time we need to know.

While some AI-powered features are genuinely useful, general anomaly detection with AI has fundamental flaws that are visible in the long run. One of the most striking examples is Lacework. Its Polygraph technology promises to analyze cloud activity logs automatically—without any predefined rules. This pitch appealed to enterprise buyers, helping the company raise $1.9 billion over six years and reach a staggering $8.3 billion valuation.

The problem? The technology simply did not work reliably. Customers were left disappointed, leading to churn. Even with top-tier talent on board, the flawed approach could not be salvaged, and in the end, the company was sold for just $200–230 million—a fraction of its peak valuation.

The harsh reality is that real-world production systems are noisy and full of anomalies. While algorithms can certainly detect them, most are unimportant, leading to a flood of false positives. This creates alert fatigue without delivering much actual value. The real issues—the ones that truly matter—are often one-offs and edge cases. Detecting them automatically is incredibly difficult, as understanding them requires deep context that AI struggles to grasp.

The Principle of Least Power: Statistics such as P99

While some ML-powered monitoring features have their place, good old-fashioned standard statistics remain hard to beat. The golden rule of monitoring revolves around the RED metrics: Rate (requests per second), Errors (error rate), and Duration (latency). Open standards like OpenTelemetry enable consistent data collection, making these metrics widely accessible.

Among them, monitoring the 99th percentile (P99) of request duration is particularly valuable. As a rule of thumb, if 99% of requests are completed within an acceptable threshold, the system is considered healthy. Highly technical engineers are obsessed with P99—they even named P99 Conf after it.

Setting alerts on P99 is incredibly useful, as breaches often serve as an early warning sign of deeper issues. When an on-call SRE gets woken up at 2 AM, they’d rather it be for a real problem—not an ML false alarm. Unlike opaque machine learning models, statistical alerts are easy to explain and act on. The alert source provides an immediate starting point for troubleshooting—identifying affected users and pinpointing problematic requests.

The P99 of request duration is a solid starting point, but most companies need a broader set of service-level objectives (SLOs) and a clear handle on their error budgets. The challenge? Maintaining this consistently at scale in production is tough.

Achieving key KPIs—like 99.95%+ uptime—isn’t about occasional strokes of genius. It’s about nailing the fundamentals every single time.

The biggest winners in the space—companies like Wiz and Datadog—were initially light on AI. Instead, their core philosophy revolves around developer and user experience. Thanks to their agentless AWS approach, they could go from zero to proving value on customer data in less than 30 minutes. Their success is not built on flashy, “killer” features but rather on nailing the details and making their products genuinely delightful—much like Slack. Companies are willing to pay a premium for that experience.

Useful Machine Learning in Observability

Among the many ML features introduced pre-2016, only a couple have truly proven to beat statistics. One of the most impactful is log pattern grouping. Since logs are often highly repetitive—generated from the same statements—grouping similar logs into patterns is extremely useful.

This concept was first brought to the mass market in 2012 by Sumo Logic, under the name LogReduce™. Over time, this approach became a standard feature across nearly all major observability vendors.

Log patterns:

*** example.com paymentService[501]: [ERROR] Unable to connect to DB. Code: ***
*** example.com paymentService[501]: [WARN] High memory usage: **%
*** example.com paymentService[501]: [INFO] Startup version: ***
*** api.example.com checkoutService[133]: [ERROR] Payment failed for userID=***
*** api.example.com checkoutService[133]: [INFO] Payment succeeded for userID=***
*** notification.example.com notificationService[542]: [WARN] Email queue backlog: *** pending
*** api.example.com userService[222]: [INFO] User login success: user=***
*** api.example.com userService[222]: [INFO] User logout success: user=***
*** example.com paymentService[501]: [DEBUG] Transaction ID=*** validated
*** example.com systemMonitor[100]: [CRITICAL] CPU usage exceeded threshold: **

Another game-changing ML feature was automated threshold setting for alerts based on historical data and trends. Instead of manually guessing P99 thresholds, ML can analyze past performance to determine them dynamically.

Additionally, predictive alerting has proven useful in certain cases. For example, rather than triggering an alert when disk usage hits an arbitrary threshold, it's often more practical to alert when the disk is projected to be full within the next 48 hours. This proactive approach helps teams prevent issues before they become outages.

Source Datadog blog

Interestingly, these features have become so ubiquitous that they are often no longer marketed as AI/ML. Instead, they are simply called “log patterns” or “automatic thresholds”—just another standard part of the toolkit.

For example, Dynatrace uses its implementation as predictive anomaly detection, while for Prometheus users, it’s just a straightforward linear regression prediction. The buzzwords were removed, but the underlying functionality remains the same.

Learnings

Vendor demos using real customer data are strong indicators of a product’s long-term value. In contrast, AI/ML can put on an impressive show at conferences, but those flashy demos do not always translate into practical capabilities. One of the most overhyped features? Anomaly Detection—often outperformed by solid statistical methods.

Observability products improved tremendously over the past decade, helping many companies gain a competitive edge. While trying new tools is worthwhile, evaluating them remains tricky for B2B enterprises. Running proof-of-concept tests on your own data is a great start, but ultimate success is measured by retention—how naturally people integrate the product into their workflows. Before your next renewal, it’s smart to assess feature adoption.

ML/AI is no different. At its core, it's about creating better user experiences and interfaces. AI-powered buttons and flashy labels are temporary—the best features will feel native, blending seamlessly into workflows. Just as horseless carriages became cars and smartphones became phones, AI will simply become another layer of automation and intelligence.

If you enjoyed this post, consider subscribing to the Quesma newsletter (scroll to the bottom of this page) or follow us on LinkedIn! We’ll continue this series with deeper dives into observability in the AI in the LLM era and future predictions. The following blog will investigate whether anomaly detection is a fundamental flaw or if technology is not mature enough.

Table of Contents

Title