In the blog - Your journey to AIOps includes IBM Z, IBM described a framework for accelerating AIOps and the importance of a holistic approach that includes IBM Z to improve overall business resiliency. The framework breaks the journey down to four stages; Firefighting, Reactive, Proactive and Intelligent. It also covers how to accelerate through the stages of AIOps by integrating a broad set of practices and capabilities from three areas - Detect, Decide, and Act (see below figure). This blog explores best practices to Detect and identify potential issues before they disrupt your business.
To accomplish this goal, IT operations teams need to focus on monitoring complete infrastructure and end-to-end application performance, generate alerts for incidents, and apply analytics for early detection of anomalies. Let’s dive into each aspect of Detect.
IT Organizations finding themselves in firefighting mode focus on service restoration and resolving problems as they happen. In some cases, the problems are discovered by end users and the business impact of each issue is largely unknown making it difficult to prioritize which issue to work first. In this stage, IBM Z is likely also managed as a silo, with little commonality with how operations is done in other parts of the organization. Challenges at this level include:
- Your monitoring solution does not cover the full stack of technologies at hand, leaving blind spots. The organization and the tools they use are also siloed, leading to a lack of ownership across subsystems.
- Thresholds are not used, or are set and never changed, leaving many thresholds to be useless, creating either too few or too many alerts.
- Operators spend a lot of time looking at monitors, looking for problems, rather than relying on being alerted when they need to pay attention to an issue.
To move past this stage, IT organizations should look to adopt practices covered in the next section.
Organizations moving from firefighting to reactive are investing in practices, skills and tools allowing them to identify problems in a more structured way, which helps you find problems earlier. This investment in process improvement will pay off through increased operational efficiency and improved SLAs as a result of improved resiliency. Best practices for reactive include:
- Ensure you have appropriate coverage in your monitoring solutions to avoid gaps. With gaps, an issue in one area could go undetected until it spreads and has a more systemic impact to core business applications. This entails full-stack monitoring including middleware, APIs, JVMs, operating systems, hardware, storage, and networks.
- Improve usage of thresholds and rule-based alerts for your monitoring solution so you no longer need to continuously have operators observing the monitors. Earlier notification of problems often results in smaller impacts on end users and SLAs.
- With the massive volume of workloads running on IBM Z, the number of events can be overwhelming. We recommend that you leverage notifications for non-critical events and alerts for critical events. Operators and SMEs can subscribe to the right level of events. For example, SMEs may subscribe to notifications and leverage those for root cause analysis and to identify new opportunities for automation to avoid thresholds being breached.
- To effectively manage all critical alerts, incident tickets are manually created in an enterprise-wide support system with the necessary information. This helps you to ensure that all critical alerts are addressed.
- To avoid known defects, which can have a major impact on the resilience of your system, you should do preventive maintenance. We recommend that preventive maintenance is installed at least two to four times a year. In addition, we recommend that potentially high-impact fixes, such as HIPER, PE Fix, Security/Integrity and Pervasive PTFs be installed more frequently, see this article for more details.
Organizations moving from reactive to proactive are adopting practices that help them detect incidents earlier, before they have a negative business impact. They are also maturing their best practices to handle the complexity of hybrid applications. Let’s have a look at some of the best practices we find in organizations in this stage of the journey to AIOps.
- Incident ticket creation is automated, providing consistency in the level of information included in the ticket. This helps ensure that all critical alerts are evaluated and addressed quickly.
- New thresholds and rule-based alerts are created on an ongoing basis to avoid incidents that were missed and detected manually.
- Intensified monitoring is performed on regular intervals, e.g. using z/OS health checks, and remediations are put in place within the automation to inform via incident management. If unhealthy conditions are detected, appropriate remediations are taken, such as adding automation to reduce the risk of any adverse impact to the business.
- Business applications are monitored end-to-end across your hybrid cloud using Application Performance Management software which tracks a transaction as it goes from mobile through all platforms and subsystems. This reduces time in war rooms, as you can immediately understand where the source of slowdown in an application is so you can contact the right SME for root cause analysis and problem resolution.
- Monitoring solutions across your hybrid cloud infrastructure are now feeding into a single pane of glass. This provides you with a consistent and shared understanding of the state of your entire hybrid cloud infrastructure. Since your infrastructure is only as strong as your weakest link, this helps you to rapidly address issues no matter where they occur.
- Key Performance Indicators (KPIs) like traffic, latency, saturation, and errors are used to monitor health check of systems and applications and to quickly identify issues. This provides clarity for operators and ensures a level of consistency with other platforms since these KPIs are increasingly becoming industry standard.
- Any change to a KPI, whether informational, warning, or critical, results in an event that is generated automatically and delivered to a central event management system where statistical analysis is possible.
- Monitoring tools are viewed as a critical part of the business and are never turned off.
In the Intelligent stage the focus is on continuous improvement. You also continue to integrate the practices and management environments for Detect, Decide and Act into one integrated solution. While artificial intelligence and Machine Learning may have been present in a previous stage for specific narrow applications, we now find a more pervasive adoption of Machine Learning. You understand what is normal for your systems by establishing a baseline, look for anomalies, find trends, and forecast problems so you can remediate them before they become a service disruption. The combination of all the above provides you with the ability to rapidly respond to more and more issues before they impact your business.
- Dynamic and intelligent thresholds are set automatically by AI agents, with awareness of periodicity and importance of applications, and are used to identify issues and anomalies. This can greatly improve the quality of alerts, including reducing alert noise by avoiding unnecessary alerts being raised.
- Track responsiveness to alerts. AI can be leveraged to identify the best responder for a given problem. Intelligent alert systems (chatops) will automatically identify the person(s) based on their job description or previous involvement in similar incidents. This allows you to refine processes used including call duty to ensure responsiveness is continuously improved and are within SLAs.
- Machine Learning is leveraged to understand what normal looks like for your organization. This enables real-time scoring of KPIs and analysis of logs so you can detect anomalies before they disrupt your business. These anomalies have associated alerts allowing operators and SMEs to be alerted to any anomalous behavior of logs or metrics.
- Problem signatures enabling early identification of specific critical issues are identified on an ongoing basis. Machine learning algorithms are trained for new problem signatures to map anomalous behavior to the corresponding problem signature. This enables operators to not only be alerted to an anomaly, but also forecast when a threshold may be breached and understand what the likely root cause is with guidance on how to fix the problem.
The journey to AIOps can have a meaningful and measurable impact of business resiliency. In our AIOps assessments we find that many companies are in the Reactive stage with some being in the Proactive stage. We are also seeing organizations aggressively moving towards Intelligent. The best practices for each area are well defined. The journey still takes some time, but companies making the investments are accelerating their journey.
To learn more about the IBM AIOps assessment and framework, please join us for a 30-minute webinar.