Predictive Log Analysis: Turning logs into actionable insights
Predictive Log Analysis – Turning logs into actionable insights

Tomasz Olszowy
Mar 20, 2026
•
2
min.
For a long time I didn’t really care much about logs. I mean — obviously they exist, they’re being collected, they show something, but as long as nothing “blows up”, you generally don’t look at them. Only when something stops working does the classic digging begin: line-by-line analysis, over and over, until you find something suspicious. Or maybe you don’t.
I never liked it!
The change came partly by accident, partly from just being fed up with the topic. I had several incidents in a row where the monitoring looked perfectly fine, and yet something was clearly off. Everything green, but people were writing that the app was “acting weird”. And good luck figuring that one out…
At some point I started searching online for a tool that could help with log analysis and came across Datadog Watchdog. I poked around the configuration a bit, but without any big expectations. I was just curious whether it would actually change anything or make the work easier. I treated it kind of like “I’ll give it a try and most likely nothing will come of it anyway”.
But then one situation happened that really shifted my perspective.
It was a Wednesday morning, still before my first coffee — a completely standard day. Normal traffic, no spikes. Not long before I had finished rolling out a currency exchange rate search feature. Seemed like a fairly simple thing.
Dashboards? All good. Zero errors, response times within norms. I’m sipping coffee…
…and suddenly Watchdog fires an alert. A very… inconspicuous one. No crashes, no 500s. Just information that there were more warnings than usual appearing in the logs for one specific type of request.
First reaction — okay… probably nothing serious…
But it kept nagging at me. So I decided to check it anyway.
It turned out the issue was triggered when someone typed a phrase with a typo, and the system tried to do autocomplete suggestions + sort the results at the same time — which caused duplication of the call. Each feature worked correctly on its own. The problem only appeared when both were triggered together.
Seemingly a tiny thing, but the queries started piling up. There were simply suspiciously many of them, and that was clearly visible in the logs.
There was no hard crash or application freeze. Thankfully! But it was more like the classic slowly boiling frog… it was getting gradually “hotter”, gradually heavier. Small delays, almost unnoticeable on their own and not screaming “failure”, started to accumulate. Traffic began to increase → domino effect. At some point those “innocent little delays” became actually noticeable. It was turning into a real problem — and it appeared at the worst possible moment, completely out of the blue, with no obvious way to tackle it.
Classic case.
Only then did I truly understand what anomaly detection is really about. It’s about catching the moment when something starts deviating from the norm — even when it’s still not dangerous, and nothing obviously points to a problem.
When manually reading logs, you usually look for obvious things — so obvious that there’s no doubt what the issue is: exceptions, timeouts, specific error messages. Unfortunately, these “quiet” patterns are very easy to miss — they just drown in the noise. Especially when you’ve been staring at logs for a long time — fatigue kicks in and everything starts blending together.
In this specific case, Datadog Watchdog gave me a ready-made signal. Checking it took… maybe an hour? Maybe a bit more. In short — not long at all.
In the end it turned out there was one unnecessary call in the filtering logic that — for “safety” — multiplied queries to the database/index. Nothing dramatic. But every extra call was logging a “warning”. The fix took a moment. Just removing that one redundant call (the <duplicate call bug>).
But if it had gone unnoticed, it would almost certainly have come back later — at a much less convenient moment — and with very different consequences.
Introducing an AI-based solution delivered real value in this particular case. It doesn’t do anything “magical”, but it dramatically narrows down the search space.
And that already makes a difference.
When you have many services and a constant flood of logs, manually keeping up with everything stops being realistic. It’s not about lack of motivation or not liking the work (even though it’s extremely monotonous) — it’s simply a matter of scale.
At some point you either have a tool that helps catch these anomalies, or you operate reactively. There isn’t really a third option.
Everything else comes down to configuration — and whether you actually respond to those signals instead of ignoring them. Because — at least from my experience — it’s usually those small things that most often foreshadow bigger problems… and hopefully don’t end in disaster.
Disclaimer:


