IBM Concert

Join this online group to communicate across IBM product users and experts by sharing advice and best practices with peers and staying up to date regarding product enhancements.

View Only

Back to Blog List

When services must stay running – A case for Automated Observability

By Tim Greenside posted Wed August 28, 2024 05:41 PM

When services must stay running – A case for Automated Observability

Introduction

In every business environment, there are certain applications and systems that are of critical importance to the business. If the availability of these systems falter, it can lead to dissatisfied customers and loss of revenue. To ensure that critical systems are always online and always performing, we must monitor them. Observability is key – the old axiom holds true – “you can’t maintain or improve what you don’t monitor”.

In this article, I will describe one approach that I’ve used successfully to monitor the availability of a service using IBM’s Rapid Network Automation solution.

About Rapid Network Automation

IBM’s Rapid Network Automation (RNA) solution is a “low code / no code” automation tool, allowing users to design automation workflows with its drag-and-drop interface. Workflow actions are represented as “Action Blocks” which represent API endpoint tasks, SSH commands and results, Python or Ansible playbook tasks, as well as logic blocks used by the RNA workflow (if/then, forEach, etc). Access to APIs and third-party systems is controlled by creating an “Authentication” – providing the necessary credential bits needed for each API data source. The Authentication can then be assigned on a per-block level to guarantee secure access to the needed endpoints.

When it comes to service monitoring, there is no “silver bullet”. Depending on your situation, you may need to come at it in different ways. That’s what makes Rapid Network Automation workflows so useful and powerful – you can easily create workflows to handle your situation, no matter what is required.

Automated Service Monitoring using SSH

Designing the Workflow

The scenario:

I want to monitor service “x” because it sometimes stops responding (process not running).
This particular service requires some manual intervention to get it running again (don’t get me started...).
There is no monitoring agent installed on the system. However, I can SSH to it.
If I run a “ps -ef <process name>”, I can see if the process is running.
If the process state changes, I would like to notify my team via Slack with a single notification.
I want to automate this, so I don’t have to manually check the status.

The approach:

I will determine status by using SSH to access the system, and running a "ps -ef <process name> | grep -v grep"
- If process is found, then status is “running”
- If process is not found, then status is “not running”
I want to represent the status of the process persistently, so I can check whether it has changed since I last checked to avoid sending duplicate notifications to my team.
- Use a file to hold status value
  - 0 = running
  - 1 = not running
- If current status matches status value in file, then don’t notify via Slack
- If current status does not match status value in file, then notify via Slack and update the status file with current status value.
I want to check the status of my process every 15 minutes.

Designing the Workflow in Rapid Network Automation

We start out our workflow by adding some variables that we will use to our workflow's "Start" block:

We define two sets of Authentication SSH credentials
- One for our target system, allowing us to run a “ps -ef” command
- Another for our RNA system so we can store the result of the “ps -ef” command persistently
We define the process name we are checking, along with the status file name to write the result of the “ps -ef” command.

Next, we build our workflow. We are using SSH action blocks to create the persistent status file and to run the UNIX command to check the process status.

Once we run the command, we need to handle what comes next – if the process is found, we take one path. If it is not found, we take another.

We determine if we need to notify the operations team by comparing the current return value with the value that has been written in the $statusFile file. If the current value is different than the status file value, we need to send a notification to the Slack channel. If it is the same, then we can simply do nothing, since we have already sent a prior notification about this state.

Assuming the state has changed, we send the notification via Slack. Then, we update the status file with the current status value so we don’t send another notification unless the status changes.

To notify via Slack, we are using a HTTP Request action block. You will need to request a URL and bearer token from your Slack administrator for the Slack channel that you want to write a message into. You will also need to refer to the Slack documentation to define your body.

Once the notification is sent, we need to update the status file with the current state value so that it is ready for the next workflow run comparison. This will make sure that we don't spam our Slack channel with duplicate state messages. We use a SSH block and simply echo the state value to the state file using the greater than (>) symbol so that it overwrites the current value.

Scheduling the Workflow

Rapid Network Automation has a job scheduler that we will use to run our automation workflow every 15 minutes.

The Result

So, we have designed our workflow and tested it. The example Slack messages are shown below. If the service check result is normal, then a NORMAL message is sent with the output of the “ps -ef” command. If it is not normal, then a CRITICAL message is sent.

#Technical Blog
#Highlights-home

0 comments

19 views

Permalink

https://community.ibm.com/community/user/blogs/tim-greenside/2024/08/28/when-services-must-stay-running-a-case-for-automat

IBM Concert

IBM Concert

When services must stay running – A case for Automated Observability

By Tim Greenside posted Wed August 28, 2024 05:41 PM

When services must stay running – A case for Automated Observability

Introduction

About Rapid Network Automation