Hello Community,
QRadar Health Monitoring has been an area of interest (and concern) for every user, and it becomes particularly difficult for large environments with multiple nodes or Managed Service Providers with multiple customers being monitored via single pane. Although there is an app - QRadar Deployment Intelligence having a large collection of useful widgets, however, it misses an intrinsic feature, i.e., create/send alerts whenever any parameter misses its threshold, which delays response & remediation. Also, in case there are multiple nodes, analysts need to scroll down to check if there is issue with any node.
Our need was to use the logs available from Health Metrics to monitor set of key parameters, generate alerts (offense/email) in case they cross threshold which facilitates the monitoring process. We utilized the QIDs to create rules and then use them in a dashboard so that monitoring can be made easy. Additionally, email alerts were also configured to be shared with admin team, so that they can take required actions immediately.
Sample log:
Apr 21 18:09:59 127.0.0.1 [Thread-61] com.q1labs.hostcontext.health.Agent: [INFO] [NOT:0000006000][172.30.32.170/- -] [-/- -]LEEF:1.0|QRadar|Health Agent| 7.2.4|QRadarHealthMetric| MetricID=DiskUsage DeploymentID=abcd-efgh-1234-abcd-HostName=INDEC001 ComponentType=hostcontext ComponentName=hostcontext devTime=2021/04/21 18:09:58 +0530 devTimeFormat=yyyy/MM/dd HH:mm:ss Z Element=/dev Value=0.45
Event name: Health Metric
Custom Event Properties:
Extracted Field Name Regular Expression Field Type Log Source Type
Metric ID MetricID=(\S+) Alphanumeric Health Metrics
Element Element=(\S+) Alphanumeric Health Metrics
Hostname HostName=(\S+) Alphanumeric Health Metrics
Value Value=(\S+) Numeric Health Metrics
Sample: Parameters Monitored, Rules
- Disk Utilisation - Trigger alert when QRadar-Disk Usage exceeds 80%
APPLY QRadar-Disk Utilisation on events which are detected by the LOCAL system
AND when the event QID is one of the following (94000001) Health Metric
AND when the event matches Metric ID (custom) is any of DiskUsage
AND when the event matches Value (custom) is greater than 0.80
Response: Send email alert to admin team
Response Limiter: 1 time per 6 hours per Source IP
- Disk Failure - Trigger alert for any disk related failure
APPLY QRadar-Disk failure alert on events which are detected by the LOCAL system
AND when the event QID is one of the following (38750111) Predictive Disk Failure: Hardware Monitoring has determined that a disk is in predictive failed state, (38750110) Disk Failure: Hardware Monitoring has determined that a disk is in failed state
Response: Send email alert to admin team
Response Limiter: 1 time per 6 hours per Source IP
- CPU Utilisation - Trigger alert when QRadar CPU-Utilisation exceeds 90%
APPLY QRadar-CPU Utilisation Alert on events which are detected by the LOCAL system
AND when the event QID is one of the following (94000001) Health Metric
AND when the event matches Metric ID (custom) is any of UserCpu
AND when the event matches Value (custom) is greater than or equal to 90
Response: Send email alert to admin team
Response Limiter: 1 time per 30 minutes per Source IP
- Memory Utilisation - Trigger alert when QRadar Memory-Utilisation exceeds 90%
APPLY QRadar-Memory Utilisation Alert on events which are detected by the LOCAL system
AND when the event QID is one of the following (94000001) Health Metric
AND when the event matches Metric ID (custom) is any of SystemMemoryUsed
AND when the event matches Value (custom) is greater than or equal to 90
Response: Send email alert to admin team
Response Limiter: 1 time per 30 minutes per Source IP
There are over 130 parameters/health metrics generated which can be used for a wide variety of purposes.
Dashboard
Pulse Dashboard can be built for these monitoring parameters.
AQL:
SELECT RULENAME(creeventlist) AS Rule, UniqueCount("sourceIP") AS 'Source IP (Unique Count)', COUNT(*) AS 'Count' from events where RULENAME(creeventlist) ='QRadar-Disk Usage Alert' order by "Count" desc LAST 30 MINUTES
SELECT RULENAME(creeventlist) AS Rule, UniqueCount("sourceIP") AS 'Source IP (Unique Count)', COUNT(*) AS 'Count' from events where RULENAME(creeventlist) ='QRadar-CPU Utilisation Alert' order by "Count" desc LAST 30 MINUTES
SELECT RULENAME(creeventlist) AS Rule, UniqueCount("sourceIP") AS 'Source IP (Unique Count)', COUNT(*) AS 'Count' from events where RULENAME(creeventlist) ='QRadar-Memory Utilisation Alert' order by "Count" desc LAST 30 MINUTES
------------------------------
Nabojyoti Sarkar
------------------------------