IBM Storage Insights now brings in AIOps to empower Flash storage support team. It helps Level-2 and Level-3 Flash support team with better diagnosis and quicker resolution of operational issues and tickets.
Worth noting that present set of AIOps features are exposed only to support team, but will ultimately help serve customers better, by improving resiliency, performance and uptime of storage assets.
This blog aims to give a brief overview into the making of the present set of AIOps feature and the support use cases it addresses.
Innovation
- Anomaly Detection: This feature ingests operational/performance metrics of storage system at a sampling rate of 5 minutes. Close to 200 performance metrics are collected and grouped in 32 KPI groups, With over 6 months of such historical metrics, deep learning models are then trained to detect system anomalies. Further the detected anomalies are correlated with recent device configuration changes and/or events on the device. This feature therefore helps to reduce the search space for diagnosis and root cause analysis.
- Recommendation for Ticket Resolution: Reinforced with SME knowledge and storage best practices, this feature leverages Language models to recommend resolutions to support cases. Model is trained with thousands of historical tickets and their resolutions. For improving accuracy, training data is augmented with system make/model and software version. When supplied with a new unresolved ticket, model performs a semantic search across the ticket database, surface, rank and recommend resolutions.
Supported Use cases
- Diagnose lingering performance issues: When support team is aware (or alerted) of device performance issue, they can leverage AIOps to detect performance anomalies in different time windows. The KPI groups depicting anomalies can be further drilled down to get heat map views and correlation views of the constituent performance metrics. This feature is found to be extremely useful to improve speed, quality and confidence of analysis and resolution.
- Diagnose and Resolve Ticket: When a Flash support case is created, support team can leverage AIOps to get a list of ranked recommendation for resolving the tickets. In cases, when a reported problem gets recommended of multiple resolution options have competing relevancy scores, comparison of performance anomaly profile of historical tickets helps to firm up and choose relevant resolution.
Technology Research & Development Perspective
Present set of AIOps features is a fruition of deep collaboration between IBM Storage Insights, IBM Zurich Research Lab.(ZRL) and IBM Technology Lifecycle Service (TLS). From concept to realisation, the feature took shape through the following important phases
- Conceptualization of features and use case it will address.
- Determination of Evaluation / Success Criteria
- Analysis and Evaluation of state of art including:
- Accuracy
- Cost / Benefit Analysis of LLMs and Time Series Transformers
- Data Analysis - Along with important considerations like accessibility, format, entity relation, data quality and data grain, following additional aspects were also considered:
- Compliance of Data Confidentiality and Security.
- Feasibility, latency and cost of data acquisition.
- PoC, Prototype Development and Technology Preview - where the features were released to a select set of support users to access the fitness of purpose. This helped improve analysis navigation, additional views thereby making it more intuitive and improving effectiveness of the features.
- Productization of feature, deployment and evangelisation.
A big thanks to my colleagues in IBM Zurich Research Lab.(ZRL) and IBM Technology Lifecycle Service (TLS), IBM Systems -Development, Testing and UX Design team for Storage Insights who made this possible.