In part 1, I provided the primary activities that take place in the post-production phase. The purpose of these activities is essentially to keep the production software solution running so that the business value it was created for is attained. Needless to say, the higher the business value, the worse is the impact of an outage. A few years ago, a production environment consisting of hundreds of Java Virtual Machines (JVMs) hosting many business insurance-related applications supporting internal business users crashed twice within a period of about one month. Although the applications directly supported only internal business users as opposed to external users, the impact of each crash was in hundreds of thousands of dollars in lost productivity to say the least.
In this blog post, I will take a deep-dive into the Identify activity. Before the solution is put in production, the following must take place:
- Identify Metrics: every key resource “metric” must be identified across all layers of the solution stack. An example of a metric is a state of a resource such as a Java Virtual Machine (JVM) instance is “started”, “stopped”. Another example is the number of hung threads in a JVM instance. The “metric” can also be a specific message or lack thereof in a log.
- Identify Thresholds: the threshold must be specified for each metric. For example, the state of a JVM should be “started” at all times. When the state of a JVM is “stopped”, unless the JVM is in maintenance mode, this state is an undesirable state or a threshold breach which calls the attention of an expert.
- Identify Experts: an expert is assigned to investigate when a metric value surpasses its threshold. In reality, an expert is assigned for each class of metrics. For example, all metrics of the Java garbage collection resource are assigned to a garbage collection expert. All database performance metrics are assigned to a database performance expert and so on. A list of people to receive some notification messages when a threshold breach occurs should also be identified. This list may include folks such as the solution stake holder, the Operations manager, in addition to the expert that is assigned to investigate the threshold breach.
- Assign Severity Levels: A severity level when a threshold is breached should be assigned. This level should imply a certain period of time within which the breach should be cleared.
- Identify Automation Actions: an automatic action, if possible, that can be taken to resolve the threshold breach should be identified.
Although the steps above should take place before the solution is put in production, the reality is that it is challenging to come up with a complete set of metrics that describes the health of a solution. So, the set of steps above will be refined after the solution is put in production. The better the Identify activity is performed before the solution is put in production, the better prepared the Operations team will be to address the threshold breaches and subsequently, the healthier the production solution will be.
Have you done this set of activities before you put the solution in production? Or, have you delayed these activities until after the solution is put in production?
In the next blog post, I will continue discussing the Identify activity.
#AI#ArtificialIntelligence#Capabilities#itops#Netcool#NOI#watsonAIOps