AIOps on IBM Z - Group home

Modernization of Monitoring and Observability in a Hybrid Cloud Infrastructure

By Ash Mahay posted Tue July 20, 2021 01:51 PM

  
2CVLUsivTqTvrRpf0wQx_DETECT_monitoring_blog-tile-640x420-M.jpg


Introduction


Organizations are drawn to the promise of AIOps to maintain resiliency by leveraging AI-driven intelligence and automation for quick and accurate decisions.  AIOps uses artificial intelligence to simplify IT operations management through acceleration and automation of problem resolution in complex modern IT environments.

A recent blog by Sanjay Chandru sets the stage for guiding you on Best practices for taking a hybrid approach to AIOps. A key capability that empowers IBM Z IT ops teams and accelerates your journey to AIOps is accurately detecting issues and anomalies across hybrid cloud infrastructure and applications.  We will focus on monitoring and observability delivering faster resolution with full-stack monitoring for early detection of Z incidents.

As IT systems become more dynamic and connected, new monitoring approaches are needed to maintain operational resiliency. Observability is a new approach that augments rule-based monitoring by measuring to understand the internal states of a system from external outputs.

Observability focuses on being prepared by instrumenting all applications and infrastructure components to monitor a critical set of KPI’s for health of the applications and infrastructure. By applying AI / ML, analysis of long-term trends can detect potential problems. The teams can be alerted to perform root cause analysis and decide on a resolution. 

Observability does not replace monitoring – rather it enables better resource and application performance monitoring across the hybrid application.

Customer challenges


The growing complexity of new application architectures involving open mainframe services, challenges monitors to capture even more metrics and provide better insights into these new workloads. Failure to modernize monitoring and observability exposes customers potentially avoidable outages.

A basic monitoring challenge is adding more metrics and collection points balanced against the undesired side effect of increased overhead to the monitored environment.

Another challenge lies in effectively using the thousands of metrics with open tooling for analysis in context of modern hybrid cloud applications due to lack of integration through API's or due to complex data format translation processes.

A pervasive challenge is in attracting and retaining the skills and expertise to address these ever-changing complex architectures. Modern monitoring and observability needs to be simple enough for IT staff to understand and execute even complex tasks with confidence and speed.

 

What's now required and how is this different than what I have today?


IBM Z performance metrics are no longer unique to the mainframe teams, there is an increasing demand to augment higher-level operational dashboards with performance statistics across the entire IT stack. Access to the Z data needs to be available either directly through an API or from a centralised data-lake supporting open architectures.

Customers need collaboration using preferred chat tools among and across teams with real-time event feeds and access to operational key metrics to perform problem triage and enable faster resolution.

IT operations should be able to perform key tasks like first level system triage utilising run books which contain actionable steps to fast track remediation.

There is a growing need to further leverage AI / ML for anomaly detection on collected data and correlate events to reduce false alarms.

How IBM can help


IBM Z Monitoring provides IT Operations Management, Operations Analysts, and domain Subject Matter Experts (SMEs) with a single source of truth for system performance to quickly isolate problems, with automated corrective actions through a modern user-friendly graphical interface. With features like flyover help and expert advice IBM enhances the user experience as generational shift comes to the population of domain experts on the z/OS platform.

Supporting all IBM Z domain areas and delivering intelligent alerts that help reduce excessive “false positives” with a simple, easy to learn user interface allows the next generation of IBM Z operators and SMEs to quickly become productive and accelerate team learning.

Purpose-built dashboards provide the right context for problems and Service Management Unite Enterprise Edition (SMU) provides unique at-a-glance access to multiple service management domains providing the ability to perform a variety of tasks from a single control plane.

The mainframe is a key platform in your digital enterprise and integrating IBM Z monitoring through APIs open the data to a wider community including with Watson AIOps where events are correlated to reduce noise and targeted problem isolation across the enterprise. IBM has also introduced ChatOps technology to break down silos and improve collaboration across teams to deliver faster incident isolation and resolution.

APIS IT implemented the IBM Z Service Management Suite and increased availability of services across the different platforms (mainframe and distributed) enabling them meet service level agreements (SLAs).

 

“We’re under current revised service level agreements and we are achieving 99.91 percent of availability.”


Dražen Zadro, Systems Engineer, APIS IT d.o.o. 
        

 

Customer Value


IBM Z monitoring and observability provides faster problem identification, isolation and resolution with intelligent monitoring and alerting enabling teams to: 

  • Improve collaboration between teams: alert details sent to your collaboration tool for faster problem triage which can be seen by the entire channel.
  • Reduce noise and incidents: event correlation across the enterprise with Watson AIOps reduces notice by grouping relevant events for targeted problem isolation.
  • Get relevant context – When an incident occurs, the wider teams are provided with additional context, such as domain dependencies and resource consumption.
  • Automate remedial actions – teams can automate problem remediation directly through an event action or interlocking with system automation.

 What are my next steps? 

 

Depending on where you are on your journey to adopting more of these AIOps best practices we are sharing the following resources to obtain a deeper understanding:

  • To assess your current stage of AIOps maturity and identify action oriented next steps for adopting more AIOps best practices, inquire about the 15-minute online AIOps Assessment for IBM Z.
  • Join the AIOps on IBM Z Community to follow this blog series about best practices for taking a hybrid approach to AIOps
  • And finally, to learn more about IBM Z solutions to help improve operational resiliency through AIOps technologies visit our product portfolio page.

 

 

 



0 comments
19 views