In the last blog post, I described the main event categories of an IT software production environment and the corresponding human groups involved with the environment.
In this blog post, I will describe the state of an IT software production environment where AI has been fully infused into the production environment. The focus here is on the AI infusion in such a way that will help IT Operations keep the production environment in the best state. We will refer to this type of AI infusion state as AIOps nirvana. In future posts, I will describe the gradual and iterative steps that IT software solution owners can take to reach this AIOps nirvana state. Someone may argue that the expression “AIOps nirvana state” suggests that there are other states and the answer would certainly be yes. This is because going from a state where the production environment is managed in a completely reactive mode to a state where the environment is in an AIOps nirvana state is a transformational journey.
So, as mentioned in the previous paragraph, we will focus on describing this AIOps nirvana state. As most IT professionals agree on, if there is one constant in the IT world, that would be the constant change driven by business requirements, business ambitions, software bugs, enhancement requests, initiatives, stiff competition, and market opportunities, just to name a few. Let us consider Figure 1 again while discussing the nirvana state.
Figure 1: Software Solution Events and Relevant Human Groups
The AIOps nirvana state describes a production environment where:
- Certain failures are proactively avoided
- Other failures are addressed in such a way that service-level agreements (SLAs) are not broken
Looking into Figure 1 while considering the AIOps nirvana state, we can get to the following key points:
- The impact of code, configuration, and load changes must be addressed before the changes happen in production.
- The impact of human errors and malicious attacks must be addressed before they happen in production
- Some automation must happen before the expected failures happen in production
- Understanding SLAs is crucial while considering the unexpected failures so that the right measures can be put in place and failures do not break the SLAs
Reaching the AIOps nirvana state is a transformational journey that involves changes in the culture, process, tools and data. Below is a brief description of what that really means. Future posts will provide more details.
Culture: When a disruptive event happens in production, assuming IT Operations will have the sole responsibility, in the traditional sense, will not work. So, a cultural change such as the move to a DevOps culture will be required for a more swift and collaborative resolution to that event. Site Reliability Engineering (SRE) principles promote such a cultural transformation from organizational silos to a more collaborative organization.
Tools: There are typically many tools involved in a production environment. Generally, tools used in the AIOps nirvana state are considerably more advanced to support the proactive nature of IT Operations. So, these tools will be different depending on where you are in the AIOps journey.
Data: This is the data without which the AIOps nirvana state cannot be reached. For example, data from all layers of a solution stack must be available to assess the health of the overall solution. Collecting data from the application layer is not enough. Furthermore, having data collected from all layers, but only from specific components is not enough. As expected, there is a lot that can said here, but will leave the details for future posts.
Process: Event detection, documentation, human assignment, human delegation, ticket creation, escalation, resolution, and automation is a process involving tools, data, and culture.