Imagine you’re an SRE for a company that has recently seen an uptick in sales, and because the majority of your customers are purchasing through the website, you always want to make sure the backend systems are running well with minimal outages. The IT environment has evolved and expanded over the years to accommodate this growth in sales, and will now be observed by the application performance management tool, Instana. In addition, Cloud Pak for Watson AIOps helps improve the mean time to resolution (MTTR) for any issues found.
In order to get AIOps set up into the environment, a SRE or ITOps team member will identify the conditions that they want to be alerted on for review, and any automation that should be executed to alleviate anomalous conditions. So the following steps should be taken:
- Verify that Instana is connected to Cloud Pak for Watson AIOps. This involves setting up a data connection between Instana and AIOps which provides visibility to the events, topology & metrics.
- An automation policy is created that, in this scenario, will create a story* when any alerts indicating that the memory utilization is outside of normal bounds and other alerts indicating problems with the memory. This helps the SRE quickly see when memory might be trending to an exhaustion point, and impacting the website. As the SRE can define the conditions on when to alert for this, the SRE can catch this issue before it leads to an outage.
- For memory exhaustion issues, the SRE wants a quick way to remediate the situation. With AIOps, the SRE can associate a runbook to an alert. This allows the SRE to quickly see that there is a remediation available for this type of alert and is able to run the automation in either automated, semi-automated or manual mode based on the type of runbook created.
When errors are detected
The website is humming along and customers are continuing to purchase the product. We know that websites can have issues and SREs want to get ahead of these issues by being on top of any alerts that are being sent by the system.
Cloud Pak for Watson AIOps detects a small deviation from the metrics baseline, and shows a metric trending outside of its usual bounds. This is the intelligent way that the solution helps detects a potential incident before it actually occurs.
Cloud Pak for Watson AIOps triggers an alert which is captured in the Alert viewer.
The SRE has these possible actions from the alert viewer:
a. view the deviation along with the side panel to see forecasts of the memory utilization
b. clear this instance of the alert
The SRE decides to ignore it for now, because it seems to be a small deviation.
However, a few hours later, the memory metric keeps trending out of normal range and the alert is triggered again as well as other alerts from the system indicating the memory is having problems. The automation policy that had been set up earlier is triggered and a story view is created. The SRE now has a holistic view of the incident and is given a holistic context of the problem.
From the story* view, the SRE has a focused view of the topology identifying the area of the problem. It shows the probable cause of the problem and identifies automations that can be run to remediate the problem. This will provide the SRE with confidence in taking the next best action. The SRE has executed the runbook, which addresses the memory utilization problem. The story is resolved and closed. The website is humming along again!
*Story - a holistic view of an incident enriched with topology, probable cause, recommended runbooks, similar incidents and details of this story