Introduction
Organizations are drawn to the promise of AIOps to leverage AI-driven Intelligence and automation to make quick and accurate decisions to maintain resiliency. AIOps uses artificial intelligence to simplify IT operations management and accelerate and automate problem resolution in complex modern IT environments.
A recent blog by Sanjay Chandru set the stage for guiding you on Best practices for taking a hybrid approach to AIOps . We learned that a key capability of AIOps is detection. Accurately detecting issues and anomalies across hybrid cloud infrastructure and applications empowers IBM Z IT ops teams, and accelerates the AIOps journey.
In this blog we will focus on intelligent anomaly detection which provides outage avoidance with advanced notification of unusual behavior prior to end-user or SLA impact.
Client challenges
Detecting operational issues has been a critical tenet of providing any type of IT service. Demands for how services are provided have largely changed because of the digital transformation. There has been a cultural shift that has driven the expectation that the services we rely on day to day must be available any time of day with no exceptions. This presents a new set of challenges for IT operations to find new ways of ensuring key business applications are running at all times with no degradation in service.
What's now required and how is this different than what I have today?
Resource monitoring tools are still vital as they are critical in detecting operational issues before end users do. Historically, it was sufficient to set static thresholds, monitor operational behavior and generate alerts when those thresholds were breached.
However, today’s workloads are rapidly evolving as enterprises are embracing DevOps and continuously delivering new capabilities that may change how a workload behaves. There is no longer a one size fits all. User behavior and how and when they interact with your applications is also changing.
To combat this, we need to be able to understand how a system is expected to behave at any time. The workload that is running when stock markets open on Monday morning is likely to be very different than the system needs at 3:00 AM on a Sunday morning. By leveraging the operational data of our environments we can understand how the system behaves at any given time and provide an alert when it doesn’t.
How IBM can help
IBM Z Anomaly Analytics with Watson provides the capability to intelligently detect operational issues by finding anomalies in both log and metric operational data.
For metric data, the data scientists behind the solution have worked closely with the subsystem teams within IBM and our customers to derive a set of KPIs specific to the various IBM Z subsystems. The solution then uses historical data to generate a baseline of what is normal for each KPI in the given environment. The baseline for each KPI will take into account the time of day and day of the week, eliminating the need for static thresholds. Once the baseline has been established, real time data is compared to the baseline to identify metric anomalies. Metric anomalies are identified when one or more KPIs has trended outside of normal operational behavior as identified by the baseline.
Similarly for log data, the embedded machine learning capability creates a baseline of log messages that are generated at various times of day and days of the week. Once the baseline has been established, real time messages are compared against the model for anomalous log messages. Log anomalies are identified for messages that have not been previously detected, atypical message frequencies, or for messages that are out of sequence.
If an anomaly is detected, the system will generate an alert that can be consumed by event management, ticketing, or other systems like IBM Cloud Pak for Watson AIOps where those events can be correlated with events from the rest of your hybrid cloud.
Client Outcome
A major US airline has adopted the anomaly detection capabilities from IBM. The solution was deployed detect operational anomalies in their key business applications, specifically their pilot and crew scheduling systems which are integral in keeping their planes in the air and passengers moving around the globe.
After building a baseline of normal operations, the operations team was identified of approximately 15-20 anomalies in each month. During a 12-month period, 4 of those actionable insights were operational anomalies that if they had gone undetected would have led to a broader system outage leaving planes on the ground and their passengers stranded away from home. By automatically detecting and being alerted to the anomalies early, the operations teams were able to proactively identify the issue and take action before any customers were impacted.
What are my next steps?
Depending on where you are on your journey to adopting more of these AIOps best practices we are sharing the following resources to obtain a deeper understanding:
- To assess your current stage of AIOps maturity and identify action oriented next steps for adopting more AIOps best practices, inquire about the 15-minute online AIOps Assessment for IBM Z.
- Join the AIOps on IBM Z Community to follow this blog series about best practices for taking a hybrid approach to AIOps
- And finally, to research our IBM Z products that are implementing AIOps technologies to improve operational resiliency visit our product portfolio page.