Co-Author : @Isabell Sippli
It has become a common practice for developers to own more operational aspects of their product’s lifecycle in what is commonly known as DevOps. This brings with it several challenges especially in context for developers who may only be responsible for ops on a rotating or part-time basis.
Some examples include: finding the needle in a haystack, navigating disjointed tools, managing time pressure, and an increasing number of services, applications, and their relationships. All of this while they are still crafting new code.
Let’s go into more detail starting with time pressure. In 2016, a major airline had a five hour outage which cost an estimated $150M. That’s over $8,000 per second — talk about pressure! Combine this with having separate tools for your logs, metrics, tickets, chat, documentation, and more, as well as the increase in complexity in modern cloud architectures where applications consist of 100s or 1000s of microservices. This environment presents a steep hill for developers to climb.
So, challenge accepted…but what can you do? We suggest the solution is threefold; namely, taking a holistic view, applying AI, and adding in automation.
We need a holistic view to bring context and insights together from all different areas. When an alert comes in, we want to present the full picture, thus allowing us to validate right away that there is a problem so we can start to resolve it. When a problem occurs, it typically triggers multiple alerts from each of the services affected instead of a stream of alerts of which should be grouped together to get the full picture. Add to that any anomalies discovered in the logs and metrics and also a view of how all the services are interconnected, and it can become easier to get up to speed and focus on fixing the problem.
Now, a holistic view isn’t the same as just a single pane of glass. We don’t need a bunch of data together on one screen; there needs to be some intelligence as if a group of SREs prepared the information for us.
Think of this holistic view as your personal assistant giving you an abstract of the issue at hand, answering questions around:
- What happened?
- When did it happen?
- Where did it happen?
- What is a probable cause?
To answer these questions, we need to comb through the thousands of service instance’s log messages to see if there are any anomalies. We need to monitor hundreds of metrics for each of these service instances to understand the normal patterns of behavior and calling out anomalous ones. We need to understand which sets of events typically co-occur so that we can group them together. For this, we need AI.
As much as 90% of an enterprise’s data is unstructured. Log message, change requests, incident tickets. So, when we say that we need AI, we really mean that we need advanced AI that deals not just with the structured data such as metrics, but also unlocks all the valuable insights from the rest of the unstructured data.
One example of this is that when a problem occurs, it is likely that the problem, or a similar problem, has occurred previously. Instead of re-inventing the wheel or hoping that the same person happens to be on duty, we can use AI to mine all the previous resolutions and recommend not just similar tickets, but what the actual actions were that solved the problem the last time.
Finally, we need automation as the best solution to catch the problem before it causes a custom impact, and to automatically resolve it allowing the developer to continue to create code.
We’ve hopefully now shown you solution approaches to the identified challenges. Do you now need a team of 10 SREs and data scientists to build it out yourself? Thankfully, you don’t.
With IBM Cloud Pak for Watson AIOps, we offer prepackaged, out-of-the-box solution approaches to the challenges above.
The portfolio offers a holistic view, by:
- receiving alerts from a wide range of sources, both structured and unstructured
- showing insights through a web console and by using chatops
It provides out-of-the-box AI:
- We correlate alerts automatically through various algorithms, based on
- historical co-occurrence (temporal correlation)
- connected services (spatial/topological correlation)
- shared values for certain attributes.
- We find anomalies in your logs, based on our own templating mechanism.
It offers several means of automation, for example:
- Provide instructions to your ops team with manual and semi-automated runbooks, or fully automate them right way. We allow you to connect to a shell, run arbitrary REST APIs, or execute Ansible playbooks through Ansible Tower.
- Automatically correlate alerts, as indicated above, suggest probable cause, and identify corresponding affected applications and services.
Find out more about IBM Cloud Pak for Watson AIOps, and explore more content on the Cloud Pak for Watson AIOps hub page on IBM Developer.
#Featured-area-1-home#Featured-area-1