AIOps

AIOps

Join this online group to communicate across IBM product users and experts by sharing advice and best practices with peers and staying up to date regarding product enhancements.


#ITAutomation
#AIOps
#CloudPakforAIOps
#AIOps

 View Only

Uncovering Insights with Log Anomaly Detection in IBM Cloud Pak for AIOps

By Ian Manning posted 2 days ago

  

In the ever-evolving landscape of IT operations, stability and performance are paramount. As part of a performance and stability test for our largest customers, we collected logs from a large kubernetes application into a log aggregator and then sent them into Log Anomaly Detection in IBM Cloud Pak for AIOps.  This tested a few things, first the out of the box integration to the log aggregator, and the ability to handle a sizeable load – about 1.3TB of logs per day.  That equates to about 6,000 logs per second.

The goal: identify meaningful anomalies, uncover hidden issues, and validate the robustness of Log and Metric Anomaly Detection under real-world conditions.

Key Findings and Observations

Timeout Issues

Occurrences of timeout log messages between microservices were observed, suggesting a need for resource scaling at higher loads. Timeouts happen, and can go unnoticed as the service automatically retries, but AIOps detected they were happening more frequently than normal.  Log aggregators can be configured to alert on occurrences of Timeout log messages, but that involves prior knowledge of the logs, what they mean and configuration which can be difficult to tune.  AIOps learns and detects this automatically.

Verbose Logging Overload

Over 600 log messages per minute were generated by a single microservice in response to a single action. These logs, while technically valid, added noise which complicates readability, troubleshooting and degrades performance. The immediate suggestion was to review the logging and reduce it.

Database errors

Errors highlighting a need for query optimization and transaction tuning from the underlying databases.  Being aware they are happening, and when, and seeing them in context in the Topology and correlated with other Alerts made them more actionable.

OutOfMemory Errors

A microservice was infrequently restarting with OOM exceptions, suggesting memory pressure under high load.  The anomaly highlighted the problem and when it started to happen, and the log messages in context.  The microservice automatically restarted each time which made the issue easy to overlook.  Of course, standard monitoring could also detect microservice restarts and memory issues, but again AIOps is doing this without rules, configuration or tuning.

Unnecessary Calls

An anomaly showed that the system made 1,621 calls to a microservice API (and logged them) in just 5 minutes. It was discovered that these calls should only be made if an optional component was installed - which it was not—an easy fix with significant performance benefits.

Conclusion

This deep dive into Anomaly Detection on AIOps logs has not only validated the platform’s ability to handle massive volumes but also surfaced actionable insights that can drive improvements in monitored products. By continuously observing and optimizing, customers can ensure their applications and services remain resilient, scalable, and ready for the challenges their clients face.

0 comments
5 views

Permalink