Recently I had the pleasure of chatting with IBM zAIOps Lab Services Technical Consultant, John Strymecki, to learn about his experiences working directly with clients who are deploying IBM Z Anomaly Analytics with Watson into their production environments. It was exciting to hear about how his clients have been utilizing the product and the positive results they have experienced, with a special emphasis on the log-based machine learning feature which John considers “essential software for all clients.” Continue reading to discover the insider's view for facilitating successful product deployments, client environment preparations, training and support, and the impact of AI/ML solutions in this must-read interview.------------------------------------------------------------
Tim: Hi, John. Thank you for agreeing to this interview. We have had a lot of interest from customers about how to implement AI/ML use cases for their IT Operations and you have extensive insights into deploying the IBM Z Anomaly Analytics with Watson product with your customers. Can you tell me about a recent product deployment and what the clients use cases and expectations were for the product?
John: Yes, I recently completed several production deployments for metric-based machine learning; one of the core features within IBM Z Anomaly Analytics with Watson. One of the primary use cases that the client had was to use the product to generate proactive alerts based on anomalous activity. We configured the problem insights server, or the alert mechanism within IZAA, to send alerts to their event collector – in this case NetCool. And from NetCool they wanted to correlate the events from other monitoring and automation solutions. This way they would be able to flag anomalous activity alongside traditional alerts to more accurately detect operational issues.
Another production use case I worked on involved a client who wanted to get a better understanding of what normal operations really looked like for their environment. They had several traditional monitoring solutions where subject matter experts had defined static and dynamic thresholds, but even so, they didn’t know what “normal” was for their different subsystems. With IBM Z Anomaly Analytics with Watson, they were able to graphically see the model that was trained using their historical data for a particular subsystem and overlayed on top, they could see their current real-time metrics. So, if and when anomalies do occur, they have a better understanding of how far their current metrics and thresholds are away from their baseline, or normal operations. This helps them prioritize their triaging of events and alerts.
Figure 1: A sample CICS region metric based anomaly detection scorecard.
Tim: Fantastic! So how do you prepare your clients for deploying the solution into production? What steps are involved in the process?
John: There are several steps that need to be done within the environment before the configuration of the product begins. Some of the steps include preparing other teams who need to be involved in production deployments. There are actually several teams you need to engage with, including: security, there’s the database team – Db2 team in our case - and you have your z/OS subject matter experts who will be executing a lot of the work using UNIX system services doing SMPE work, reserving ports, configuring Anaconda and Spark. There's a lot of activities that need to be done prior to doing the actual setup of the final product. So, it's important that all the teams get that work done before we actually begin the configuration of Watson Machine Learning for z/OS and the IBM Z Anomaly Analytics with Watson product.
Tim: Thank you for that explanation. And I’m also aware that you have a step-by-step quick start guide that helps clients go through a checklist to ensure the environment is ready for the product to be deployed. But now I’d like to ask you about what kind of training and support do you and your team provide to your clients during or after deployment?
John: Well, during the deployment the client becomes aware of the needs of the product – based on the security, how they are setting up Db2 for z/OS… and so on. They also will learn about the desired time interval that the SMF records are being collected and analyzed. The training and support that they get is learned by working alongside the different team members. This way they learn how to do the configuration themselves.
After the product is up and running in production, I give them a demo of how the tool works and what is expected and how it should look. Lastly, I also give instructions on housekeeping and best practice activities.
Tim: That’s great! So how have your clients been using the product and what kind of results have they seen? You mentioned earlier that you personally find the log-based machine learning feature to be valuable and an “essential software for all your clients.” Can you elaborate?
John: Yes! So the log-based machine learning feature essentially takes log messages from a bunch of different sources and uses 30 days or more of historical data to create a model that represents how syslogs should look during any given time. Then in real time, the product is going to display any log messages that are statistically identified as anomalies. Take for example, the product would create an anomaly alert for a message that had never been detected in the prior 90 days, or if the message was statistically rare within the last 90 days. This bring so much value because in real time you can get a notification for a message anomaly, for example, resources unavailable, or a coupling facility failure, or buffer pool shortage, and other similar types of messages. You can’t really predict these types of messages, but with this product, and in real time, you can get an indication of a problem that previously has not been detected and mitigated by automation.
This is also really helpful for diagnosing problems, because you have a history of anomalous events that you can refer back to when trying to determine root cause of the problem. And rather than going back and manually looking through all the logs during a particular part of the day, and using your best judgement on whether particular messages are of interest or are expected, now with the product you can quickly assess the abnormal messages that have been automatically flagged for you so you can accelerate your investigation. That is very powerful and a huge time saver
Figure 2: This is the log anomaly scorecard which highlights the time interval in red where there are highly anomalous log activity.
Tim: That’s really cool! Thank you for sharing. One last question before we end for the day… What is one piece of advice you would give to a potential customer who is looking to deploy more AI solutions for IT Operations?
John: Customers should really take a look at the log-based machine learning capabilities because that is really going to reduce their problem determination time for both problems known and unknown.
Similarly, for metric-based machine learning – it works best on production systems because they follow a specific pattern of business activity. It will detect and show you real time activities for your metric based systems – such as z/OS, DB2, CICS, MQ, IMS - that you had no clue were acting abnormally. So in other words, what might seem normal to the untrained trained eye - or any person looking at it - would not be able to detect if CPU were busier than normal, or channel activity is busier than normal for any given time of day or day of week. This is the best tool to use to detect anomalies on your systems.
Tim: John, thank you so much for your time today and for sharing your expertise!------------------------------------------------------------
Do you have additional questions you’d like to get an Insider’s View on? If so, leave a comment and we’ll address them in the next blog.