Hello Community!
I was approached recently to discuss methods for validating that the Custom Rule Engine, CRE, was working.. At first, my thoughts revolved around what that ask could be, is it relating to making sure CRE was generating events? Is it to trigger an offense or a CRE event if it wasn't running? And then- how do we 'prove a negative' in this context? I provided some thoughts around CRE reports, or leveraging recent offenses as a marker... but no, that's not at all what they were looking for, instead, they wanted to make sure the service was up and running.
Ok- requirements are now set... 'validate services are running.' Now to find which services need to be monitored. For that, it's imperative to understand which services are critical to monitor. I pulled up the list of services so I could understand each service's purpose and therefore, which are critical to monitor.
From the list, I decided it was best to monitor three services in particular:
- ecs-ec-ingress: primary collection service that hands off events for further processing.
- ecs-ec: parsing, coalescing and categorizing of events
- ecs-ep: event correlation and event storage
So now, the question becomes how to best take this on. To start, it's important to understand where these services live and run. Each of these services will sit and run on every appliance in an implementation that is processing events... in other words, the Console and Managed Hosts in a Distributed environment. A quick note: this list does not include monitoring for flow related collection and conversion. That service would be 'qflow' which, while outside of the scope of this specific exercise,can also be monitored in a similar way.
With these factors in mind, it came down to considering how to best approach the monitoring of these services, as well as how to provide for alerting if any of these fail. Normally, I would use the native custom parsing functionality of QRadar and create a custom rule leveraging the Rule Wizard, but there's a critical factor here... the service necessary for the custom rule to function is the exact service we're looking to monitor here, thus using conventional UI configuration and notification elements. Alright- we're going to focus on leveraging the back-end and keeping it simple by leveraging out of the box commands... but this needs to be continuous monitoring, so it's time to script it and schedule it!
Structuring The Monitoring Steps
Checking The Service
In order to check a service in the back-end, I am used to using the 'service' command, which would follow a format of "service {service name} status." This approach would work, but it redirects to the current "systemctl status {service name}" approach, and both will yield the same result. This can be seen in the screenshot below.
Now that I can see the service status when active, it's time to figure out what it looks like when it isn't active. I took the ecs-ep service and stopped it and checked the status again:
Now that I have a decent understanding of what shows up, I need to reduce the output to as small a result as possible. This is for use within the automated checking of the service status. Some grep and awk magic and I'm now able to get just the status of either 'active' or 'failed.'
Outputting The Result
Alright, now that I've got the service status, I need to consider what I am going to do with it. To run this on a minute by minute check with cron means that there are a lot of checks, and when everything is running as it should, then it should be rather easy to not send a notification message, but it still makes sense to track. Perhaps this would make sense to track under our audit logs as we are tracking services uptime metrics so to speak. Time to send them to the QRadar audit log!
That's fine and good when the service is successfully running, primarily because the CRE is active and we can see the logs as needed within the Log Activity window... but what if it is, in fact, not running... then just how do we track it? With an environment running QRadar with a configured SMTP server, it may be possible to use the 'mail' command and send a text based email while also feeding it into the audit log. That's it! This is how I would tackle this use-case.
The Script
Piecing the various items together led to a small bash script in order to run the checks:
#!/bin/bash
#Version: 1.1
#Owner: Cristian Ruvalcaba
#Purpose: Test of service status
#Script Location: /store/scripts/
#Filename: service_checks.sh
statusep=$(systemctl status ecs-ep|grep Active| awk -F":" '{print $2;}'|awk -F" " '{print $1;}'| grep -v awk| grep -v directing) # Check status of ecs-ep service
statusec=$(systemctl status ecs-ec|grep Active| awk -F":" '{print $2;}'|awk -F" " '{print $1;}'| grep -v awk| grep -v directing) # Check status of ecs-ec service
statuseci=$(systemctl status ecs-ec-ingress|grep Active| awk -F":" '{print $2;}'|awk -F" " '{print $1;}'| grep -v awk| grep -v directing) # Check status of ecs-ec-ingress service
timestamp=$(date |awk -F" " '{print $2" "$3" "$4;}') # Take current timestamp
hostip=$(ifconfig|grep inet|grep -v inet6|grep -v 169.254|grep -v 127.0.0.1|awk -F" " '{print $2;}') # Identify current system's IP
recipient="recipient@yourdomain.net" # Recipient list for email notification of failed service check, comma separated
sender="qradar@yourdomain.net" #Define Sender's email
if [[ $statusep = 'active' ]] # Check for up status
then
echo "$timestamp ::ffff:127.0.0.1 servicescheck@$hostip (303) | [Action] [Service Check] [ECS-EP Service is Running]" | tee -a /var/log/audit/audit.log
else # If failed
echo "$date ::ffff:127.0.0.1 servicescheck@$hostip (303) | [Action] [Service Check] [ECS-EP Service is NOT Running]" | tee -a /var/log/audit/audit.log
mail -s "ECS-EP Service is not running" -r $sender $recipient <<< "The ECS-EP service that provides the correlation services is not currently running"
fi
if [[ $statusec = 'active' ]] # Check for up status
then
echo "$timestamp ::ffff:127.0.0.1 servicescheck@$hostip (303) | [Action] [Service Check] [ECS-EC Service is Running]" | tee -a /var/log/audit/audit.log
else # If failed
echo "$date ::ffff:127.0.0.1 servicescheck@$hostip (303) | [Action] [Service Check] [ECS-EC Service is NOT Running]" | tee -a /var/log/audit/audit.log
mail -s "ECS-EC Service is not running" -r $sender $recipient <<< "The ECS-EC service that provides the parsing and normalization services is not currently running"
fi
if [[ $statuseci = 'active' ]] # Check for up status
then
echo "$timestamp ::ffff:127.0.0.1 servicescheck@$hostip (303) | [Action] [Service Check] [ECS-EC-Ingress Service is Running]" | tee -a /var/log/audit/audit.log
else # If failed
echo "$date ::ffff:127.0.0.1 servicescheck@$hostip (303) | [Action] [Service Check] [ECS-EC-Ingress Service is NOT Running]" | tee -a /var/log/audit/audit.log
mail -s "ECS-EC-Ingress Service is not running" -r $sender $recipient <<< "The ECS-EC-Ingress service that provides the log ingestion and collection services is not currently running"
fi
Scheduling it to run on a relatively high frequency, every minute, is possible by adding it to cron:
* * * * * /store/scripts/service_checks.sh
Key Assumptions
There are a few key requirements and assumptions with a process like this:
- This is an on-premise implementation of QRadar, or an IaaS implementation on an IaaS provider like AWS, Azure or Google Cloud.
- This is implemented by a system admin for the QRadar, or an authorized user with appropriate privileges
- The script is executable
- All services needing monitoring are known
- An SMTP server is available and configured for use on each host on which the script runs
- The script will live locally on each Managed Host
- Cron is configured on Managed Host and the Console in order to run checks
Note: It may be possible to run this on the console alone, but that would require adjusting the approach
With these in place and the monitoring script running, it should be possible to monitor critical services on a near-real-time basis.
Output
Below is a screenshot of the alert email received when a service goes down:
And there you have it, an automated way to ensure services are running as they should in near real time. One thing to note, so long as the service is down, the recipient will continue to receive the email as long as the service is down at every time the script runs. There are ways to work around that by checking how long the service has been down or by creating a 'prior state' file and using one of those items as a condition for the email being sent. I may consider modeling that process in the future... so you'll just have to keep checking the community forum for more content, just in case!