In the last blog post, I described the need to ensure that the code change meets the functional and non-functional requirements before the code is deployed in production. Otherwise, the production support will be in a reactive mode. In that blog post, I described the key activities performed, by a number of teams, as part of the functional and non-functional requirements validation. But, I did not describe the specific non-functional requirements needed to be implemented for the AIOps maturity. So, in this blog post, I will first describe the code logging requirements. As I mentioned in the previous post, I will be using the term application code changes to include new applications as well as changes to existing applications because at the end of the day, both will introduce a change in the production environment.
Code Logging
- The application code change should log informational, warning, error, and critical. These messages provide critical insights when the change is deployed in production. Without these messages, it may be challenging to do a root cause analysis when an incident is raised… you can’t fix what you can’t see!
- The application code change should log its component and subcomponent identifiers. This associates a log message with the owner component.
- The application code change should associate a log message with a message code. So, in the log, there should be a message code and the corresponding log message.
- Special focus should be placed on logging when the code change is calling another component that may reside in the same code runtime environment or on the other side of the world. This focus may mean the following:
- More verbose logging
- Special care when using credentials if those are needed for interactions to take place. For example, these credentials may not be logged. Otherwise, a security related incident may be triggered in production.
- Special care when handling private related information. For example, a user social security number may never be logged. Otherwise, a security related incident may be triggered in production. That also can have very serious ramifications.
- In the case when the code runs into an error or exception, the code may try to interpret that error, but shouldn’t suppress that exception or error. Suppression of such an error or exception will likely leave the production support team in the dark when that error or exception triggers an incident.
- The code change should log troubleshooting tips associated with certain errors when possible. This accelerates incident resolution when an incident is triggered, especially when the developer who developed the code is not available.
If the application code change is deployed in production without taking the guidance above into considerations, the production support team will be in a reactive mode. Resolving incidents will likely be more time-consuming especially when the log data you need to resolve the incident is not available.
Since the drive to deploy new code changes can be compelling, there is an opposing drive to prevent the changes from being deployed in production because of risks. To minimize these risks while keeping the code moving toward production quickly, employing the right tools to scan the log data for such code changes and ensure the right safeguards are enforced as outlined in the guidance above will put the organization in the proactive AIOps maturity level. If the code changes fail these tests, deployment to production does not happen. Otherwise, the risks associated with the code changes may have to be addressed in production.
The full guidance of logging provided above is also critical for AI log training and AI log anomaly detection that is used by AI tools such as Watson AIOps. The insights derived from the log anomaly detection are sent to the production support team. When the insights are examined, the support team may want to explore the log data associated with those insights. So, having clear and meaningful log data is key to a swift resolution of incidents. On the other hand, having sensitive information included in the log data can trigger another incident. That’s the reason why such sensitive information should be eliminated from the log data before the production deployment of those code changes.