In the ever-evolving landscape of IT operations, stability and performance are paramount. As part of a performance and stability test for our largest customers, we collected logs from a large kubernetes application into a log aggregator and then sent them into Log Anomaly Detection in IBM Cloud Pak for AIOps. This tested a few things, first the out of the box integration to the log aggregator, and the ability to handle a sizeable load – about 1.3TB of logs per day. That equates to about 6,000 logs per second.
The goal: identify meaningful anomalies, uncover hidden issues, and validate the robustness of Log and Metric Anomaly Detection under real-world conditions.
Key Findings and Observations
Timeout Issues
Occurrences of timeout log messages between microservices were observed, suggesting a need for resource scaling at higher loads. Timeouts happen, and can go unnoticed as the service automatically retries, but AIOps detected they were happening more frequently than normal. Log aggregators can be configured to alert on occurrences of Timeout log messages, but that involves prior knowledge of the logs, what they mean and configuration which can be difficult to tune. AIOps learns and detects this automatically.
Verbose Logging Overload
Over 600 log messages per minute were generated by a single microservice in response to a single action. These logs, while technically valid, added noise which complicates readability, troubleshooting and degrades performance. The immediate suggestion was to review the logging and reduce it.
Database errors
Errors highlighting a need for query optimization and transaction tuning from the underlying databases. Being aware they are happening, and when, and seeing them in context in the Topology and correlated with other Alerts made them more actionable.
OutOfMemory Errors
A microservice was infrequently restarting with OOM exceptions, suggesting memory pressure under high load. The anomaly highlighted the problem and when it started to happen, and the log messages in context. The microservice automatically restarted each time which made the issue easy to overlook. Of course, standard monitoring could also detect microservice restarts and memory issues, but again AIOps is doing this without rules, configuration or tuning.
Unnecessary Calls
An anomaly showed that the system made 1,621 calls to a microservice API (and logged them) in just 5 minutes. It was discovered that these calls should only be made if an optional component was installed - which it was not—an easy fix with significant performance benefits.
Conclusion
This deep dive into Anomaly Detection on AIOps logs has not only validated the platform’s ability to handle massive volumes but also surfaced actionable insights that can drive improvements in monitored products. By continuously observing and optimizing, customers can ensure their applications and services remain resilient, scalable, and ready for the challenges their clients face.