AIOps

 View Only

AIOps Journey - Part 1: Production environment key disruptors

By Samir Nasser posted Sun March 07, 2021 11:12 AM

  
In this post, I will describe the events, at a very high level, that lead to performance and resiliency issues impacting a production IT software solution. In future posts, I will describe the AIOps roadmap that the IT software solution providers can gradually follow to minimize and proactively avoid such performance and resiliency issues.

Throughout my career, I have had the privilege to work with many large customers on complex IT software solution performance and resiliency problems. Although the customers were deeply interested in the visible performance issues that were directly impacting business in a negative way… fighting the burning fire, I have always noticed much larger issues that led indirectly to the performance or resiliency issues at hand. For example, a few years ago, I was asked to lead a customer through the root cause analysis of a crash of hundreds of Java Virtual Machines (JVMs) in one of their data centers. All the effort was made to find that root cause. However, during my engagement, I was able to spot a number of larger issues that contributed indirectly to the crash of all these hundreds of JVMs. For example, there were different Operations teams operating in silos even though what they were managing were layers of the same solution stack. One team was taking care of the network, another was taking care of the virtualization layer, another was taking care of the middleware layer, another was taking care of the application layer, etc. There wasn’t a proper collaboration tool used to pull various experts together to find the root cause of an issue. Collaboration was done through emails, phone calls, “war rooms” or isolated direct chats.

Although the issues that can happen to a solution can be many, it is important to realize that there are ways to minimize these issues and proactively avoid many of them. Minimizing these issues and ideally avoiding them will certainly be most welcome by solution stake holders. Before we describe the ways to minimize performance issues and ideally, and proactively avoid them, it is important to describe the essential event categories and the human groups involved in a production software solution. So, let us look at the various event categories that lead directly or indirectly to performance issues and the human groups that are relevant to the performance of a software solution. Figure 1 shows such categories of events and the various human groups involved:


Figure 1: Software Solution Events and Relevant Human Groups

It is worthwhile going over this Figure as it will serve as the foundation of the discussion going forward. I will only provide a brief description here as I will elaborate further in future posts as necessary. This figure will grow as I provide more details.

Event Categories

Here is the list of event categories that are relevant to a production IT software solution:

Code Changes: This consists of any code to fix certain defects, provide an enhancement, provide a new feature, or provide a new version of any layer of the software solution stack.

Configuration Changes: This consists of any configuration change in any layer of the solution stack. The change may be simple, for example, an increase in a Java heap size, or a change in a TCP/IP parameter value. Nevertheless, the change may appear simple, but can have a drastic effect on the solution performance.

Load Changes: This includes many types of changes related to the load on the production solution. For example, the number of concurrent users may change, the request data size or type sent by a user may change. The response data size or type associated with a given user request may change.

Human Errors: As an Operations staff is trying to make a change or run a command to look for certain things in production, an error may result. This error may impact the solution software negatively. This type of errors is not uncommon as this is the result of an ad hoc and manual activity as opposed to automated and well tested activity.

Malicious Attacks: This is self-explanatory and results from internal or external attacks on the software solution. As the case with human errors, this category of events is not uncommon.

Resource Issues: Shown inside the production software stack box as these issues can be the result of any of the above event categories. For example, a high CPU usage may happen as the load increases on the production solution.

Software Defects: Production software defects are not uncommon. These defects may be in any layer of the production solution stack, not just the so-called business application layer. These defects may be in the middleware layer, the operating system layer, the networking layer, etc.

Hardware defects: Although hardware defects are less common than the software ones, they are listed here as they can be relevant to the performance of the production software solution.

Human Groups

A production software solution involves four main human groups:

Development: This includes the business requirements analysis, design, and coding experts that provide code to deliver to production.

Operations: This includes all staff with the important objective to keep the production software solution running smoothly.

End Users: This includes internal and/or external users who use the services of the production software solution.

Malicious Users: This includes internal and/or external users who want to gain access to the production software solution with intent to cause harm. These users may access the production environment to steal regulated information, cause denial of service, crash one or more software or hardware components, etc.

0 comments
53 views

Permalink