IBM QRadar

 View Only

IT Ops Metrics and QRadar

By Cristian Ruvalcaba posted Wed August 02, 2023 05:00 PM

  

Hello Community!

In the last 20 years, I've seen security technology evolve quite a bit... I've used log management technologies as part of my daily compliance responsibilities, SIEM technologies as part of my investigative responsibilities as well as enhanced compliance responsibilities and more... I collaborated with other teams that had their own logging technologies in order to collect relevant information during investigations, those teams used their platforms to collect their application logs as well as gathering IT Operations metrics. 

Over the years, technologies began to marry the two concepts, some requiring agents, others simply collecting available logs. Network monitoring and system monitoring in NOCs, Network Operations Centers, are a core function of solid BCP, Business Continuity Plan, and resiliency. 

A few years ago, I met with a customer who had chosen to pursue a technology whose primary purpose is log management, with some SIEM-like functionality thrown on top. This team had struggled to implement this technology to function as a SIEM... they had seen how strong IBM QRadar SIEM is as a SIEM, including its ability to natively incorporate netflow, and were eager to try to get this SIEM into their environment... but having just spent their budget on this other technology, they asked that we prove out IT Ops use for IBM QRadar SIEM and present that to their IT Operations team in hopes that both teams could benefit from IBM QRadar SIEM. 

While that customer was ultimately unable to move forward with the purchase, it planted a goal in my mind. I pieced together some proposal documents, a benefit analysis, some samples and got it ready to present should the opportunity present itself. Once that was ready, I went back to the day to day activities and slowly kept working on more and diverse use-cases while also telling the IBM Security QRadar story to customers. Well, this year, it finally happened, I got the opportunity to discuss IT Operations and was asked to do another quick proof of concept on the approach, this time, I decided to document it here.

Where to Begin

To start, I wanted to understand the most critical components of IT Operations and ensuring a healthy system and environment. There are several items that are critical to a running host: disc, memory, CPU, services, OS... and so much more.The goal at this point is exploring what is possible. To that end, I put together a list of metrics I wanted to focus on for the first round and wrote a script to pull those metrics and ensure they found their way into the event pipeline:

  • Host name
  • OS version
  • CPU use
  • Memory use
  • Disc use
  • CPU and memory per service user
  • Gateway latency
  • NTP latency
  • DNS latency

These metrics can be used for measuring some of the host specific details, but some can also be used to assess environmental issue. For example, if the gateway latency on certain hosts is small, whereas in others, it's very large, there exists a possibility of network issues.

NOTE: As part of any custom logging and parsing effort, a log structure of some sort must be defined. In my case, I used 'ITO" for "Information Technology Operations" in order to quickly identify and parse the relevant logs.

NOTE: This is an exercise to determine viability of this approach. Not all intended metrics that are included in the script work as it stands today. Some metrics gathering and logger use lead to empty areas where the value would sit, where others skip out when the script runs via cron but yield results when running manually. 

Breaking Down The Metrics
Focus on Linux

While these metrics are valuable for servers of any core operating system, I start out with linux distributions for ease of implementation as well as native syslog support which makes sending the logs to a collection point rather simple. While I may show an example or two below in regards to the method used for collecting events, I will defer detailed coverage until further into the article where I provide the script's contents.

Hostname and OS Details

To start, I wanted to make sure I captured the hostname and operating system of a given host. In environments where systems live for short periods of time, or with a need for 'real-time' asset management, collecting these details in a method akin to a heartbeat can help provide that environmental awareness given a well adhered-to provisioning process for new hosts. To that end, in linux distributions, pulling these details in and sending them into the system logs can be accomplished all in one line with the use of logger

Logger allows for sending of crafted text to be sent into system logs, for example /var/log/messages and still allows for inline use of variables and embedded instructions. Thus, the following would give us the details we're looking for:

logger "ITO: Host $(hostname) running $(cat /etc/os-release | grep ^NAME= ) $(cat /etc/os-release | grep ^VERSION= ) up for $(uptime | sed -E 's/^[^,]*up *//; s/, *[[:digit:]]* users.*//; s/min/minutes/; s/([[:digit:]]+):0?([[:digit:]]+)/\1 hours, \2 minutes/')"

CPU and Memory

The health of a given system is critical to understand, some of the key indicators of health for any information system include usage levels of its physical resources, to that end, it is important to gather CPU and memory metrics for a system as a whole. Some native packages enable the user to gather these metrics, similar to the hostname detail from the section above, I leverage logger to feed these numbers into the log files and in doing so, also to any syslog destination the host is configured to relay logs to.

Disc Partition Use

Any healthy system must maintain a level of available resources in a 'free' and 'available to use' state. To that end, disc usage is just as important, including just how much space is used and across which partitions or mount points.

I/O Stats

An additional component to healthy disks directly relates to the ability to read and write to disc. Capturing I/O stats for a host can be extremely beneficial, although an additional package may be required for this specific set of metrics.

Network Latency

I couldn't tell you how many times I failed authentication over the years while typing in the correct username and password combination... and the majority of the times this happened to me was because of time disparity between the application I intended to use and the authenticating server... The first time this happened, it was very early in my career and once I figured it out, I wasn't going to let this hit me again. BUT! Time servers are not the only servers we need to make sure we can reach readily... a race condition could lead to a malicious actor successfully responding to a DNS query prior to the legitimate DNS server does... monitoring latency to these types of servers is important to the overall health of the given IT ecosystem.

NIC stats

Network traffic can at times exceed bandwidth limits, at other times, solder points can fail on a NIC or worse, physical damage to connector pins may lead to concerns with network and NIC connectivity. Monitoring error counts for receiving and transmitting packets can be useful to ensure a healthy system communication path.

NOTE: While working this component of the monitoring script, I was unsuccessful in yielding these in an automated way through a single comprehensive script, but kicking off the script manually did yield these events. I may in the future attempt to split this piece of the script off and run it independently. I'll update this entry if that proves to be successful.

Memory and CPU by user

Want to catch a runaway process or service? Which user is tied to it? Monitoring individual 'users' and their levels on memory and CPU can help determine when a certain service or process is hogging too many resources and potentially causing some sort of issue. Breaking these apart visually can enable an analyst to easily determine what to troubleshoot or escalate to a sysadmin or application owner to handle.

Service Status

Up, Down, Touch the Ground... Knowing which services are down and which are up along with how long services have been up helps monitor for unexpected service bounces, potential issues with applications and can help raise a huge red flag if a mission or business critical service is not running, costing an organization thousands if not hundreds of thousands of dollars a day. The ability to gather a comprehensive list of existing services from a host is critical to an in depth view of health for a host, pair this with a defined list of critical services and we're off to the races!

Piecing it all together- Gettin' Scripty With It (NaNa NaNa NaNa Na)

The script below employs several approaches to similar tasks, including direct log sending to logger, or reading a log file, herein generated, and using logger on each line; and more. I could say that this is intended to prove out the various methods are possible... but in all honesty, these are the approaches I took on a given day for a given segment of the script, and I have yet to sit down to clean up the methods to be uniform. There is no wrong or right answer in terms of methods or approaches, there is only an intended output, but I may still one day sit down to make this more uniform and streamlined.

The script is below and is commented throughout:

#!/bin/bash
#Version: 1.1
#Owner: Cristian Ruvalcaba
#Contact: cristian@ibm.com
#Log Model: Linux IT Operations Metrics Collection Script
#Dependencies (Packages): sysstat

## Functions

# Function to calculate average latency using ping command to DNS servers configured on host
function latency_DNS() {
    # Get the ping output
    local ping_output=$(ping -c 4 "$1")

    # Extract the average latency
    local avg_latency=$(echo "$ping_output" | awk -F '/' 'END{ print $(NF-2) }')

    # Output
    logger "ITO: AvePing: DNS - $1: $avg_latency ms"
}

# Function to calculate average latency using ping command to the configured gateway
# May need to adjust script for multihomed devices
function latency_GW() {
    # NOTE: This function seems to not function when script is initiated via cron.
    #       Starting the script manually may yield results for the latency to the 
    #       default gateway.


    # Variables and calculations
    gateway=$(ip route | grep default | awk '{ print $3 }')
    latency_l=$(ping -c 4 $gateway |  awk -F '/' 'END{ print $(NF-2) }')
    
    # Output
    logger "ITO: AvePing: Default GW: $gateway: $latency_l ms"
}

# Function to calculate average latency using ping command to NTP servers
function latency_NTP() {
    # Get the ping output
    local ping_output=$(ping -c 4 "$1")

    # Extract the average latency
    local avg_latency=$(echo "$ping_output" | awk -F '/' 'END{ print $(NF-2) }')

    # Output
    logger "ITO: AvePing: NTP - $1: $avg_latency ms"
}


# Collect relevant variables into temporary files

# I/O stats
io_stats="/tmp/io_stats.txt"
iostat -dxm|grep -v Device|grep -v CPU|grep -v '^$'>$io_stats

# DNS servers
dns_servers=$(nmcli device show | grep IP4.DNS | awk '{print $2}')

# Collecting list of NTP servers used by host
ntp_servers=$(cat /etc/ntp.conf|grep server|awk '{print $2;}')

# Collecting Partition Details and write log entries to temporary file
tempPartitionFile="/tmp/partition.txt"
df -h  | grep -v Filesystem| awk '{print "ITO: Partition: "$1" Mounted on "$6" Size: "$2" Used: "$3" Available: "$4" Used %: "$5}' > $tempPartitionFile

# Metrics collection using logger to send to /var/log/messages

# High level details on host: hostname, OS and uptime
logger "ITO: Host $(hostname) running $(cat /etc/os-release | grep ^NAME= ) $(cat /etc/os-release | grep ^VERSION= ) up for $(uptime | sed -E 's/^[^,]*up *//; s/, *[[:digit:]]* users.*//; s/min/minutes/; s/([[:digit:]]+):0?([[:digit:]]+)/\1 hours, \2 minutes/')"
logger "ITO: CPU Use: $(top -b -n1| grep "Cpu(s)")|Memory Use: $(top -b -n1| grep "KiB Mem")"

# While loops for multiple element variables

#check IO Stats
while read line; 
    do logger "ITO: I/O Stats: $line";                   
done < $io_stats

# Latency Details

# Gateway - Commenting out as this does not function via cron
# latency_GW

# DNS
for dns_server in $dns_servers; do
    latency_DNS "$dns_server"
done

# NTP 
for ntp_server in $ntp_servers; do
   latency_NTP "$ntp_server"
done


#check Network Interface Stats
for interface in $(ip link show up | grep 'state UP' | awk -F: '$0 !~ "lo|vir|wl|^[^0-9]"{print $2}' | tr -d ' '); do
  tx_errors=$(cat /sys/class/net/"$interface"/statistics/tx_errors)
  rx_errors=$(cat /sys/class/net/"$interface"/statistics/rx_errors)
  tx_packets=$(cat /sys/class/net/"$interface"/statistics/tx_packets)
  rx_packets=$(cat /sys/class/net/"$interface"/statistics/rx_packets)

  # Calculate error rates  

  # Log the rates to syslog
  logger "ITO: Interface $interface - TX error count: $tx_errors over $tx_packets, RX error rate: $rx_errors over $rx_packets"

done


# Log Disk Partition File
cat $tempPartitionFile | while read line
do
    logger "$line"
done

# Collect the data per user
for user in `ps aux | grep -v COMMAND | awk '{print $1}' | sort | uniq`
do
  # Count CPUs
  cpucount=$(lscpu | grep "^CPU(s):"|awk '{print $2}')

  # Grab memory and CPU for user
  logger "ITO: For $user - Memory use: `ps aux | egrep ^$user | awk 'BEGIN{total=0}; {total += $4};END{print total,\"%\"}'` and CPU use across $cpucount cores: `ps aux | egrep ^$user | awk 'BEGIN{total=0}; {total += $3};END{print total,\"%\"}'` equivalent of a single core."

done

# Collect Service Status and Generate Log For Each
# CLI argument required
for service in "$@"; do
  if systemctl is-active --quiet "$service"; then
    logger "ITO: Service Up:  $service.service up for $(systemctl status $service | grep running | awk -F';' '{print $2;}'|awk -F' ago' '{print $1;}')"
  else
    logger "ITO: Service Down:  $service.service"
  fi
done

Scheduling a cron job will get events to the appropriate collection point on a regular cadence; the frequency of which may depend on an organization's needs. For testing purposes, I have scheduled it to occur on every minute. See below for the cron entry I have running.

* * * * * /root/it_operations_metrics.sh $(systemctl --type=service|grep loaded| awk -F".service" '{print $1;}'|sed s,\s,,g|tr '\n' ' '|awk -F" LOAD" '{print $1;}')

From the cron entry above, you'll not that this isn't just the script, but includes an embedded call to systemctl. This specific section calls up for a list of all loaded services to be used as arguments for the script. This is used by the for loop in the last section of the script above. 

Making it all happen in QRadar SIEM

In order to leverage the metrics in these events in QRadar SIEM, it is important to create the appropriate custom extracted properties. The steps I chose to take are below:

  1. Create custom Log Source Type
  2. Create "Event ID" parsing
  3. Add relevant existing fields from list and apply appropriate regex, or create additional
  4. Create widgets in Pulse to leverage the data and present relevant visualizations

For the purposes of this exercise, materials were created using QRadar 7.5.0 UP3.

1-    Creating the log source type

I decided to create a log source type for the purposes of this exercise, but it is not required as creating all the appropriate relevant extractions on top of the native Linux OS DSM would work just as well, with some adjustments to the related queries used for searches. NOTE: All materials linked or attached to this article leverage the IT Ops source type if a log source type is referenced.

2-    Create Event ID Parsing

The first item I chose to focus on was the Event ID parsing. I defined logs to have a fairly consistent structure with one exception around host details. This was not intentional, adjusting this is an exercise in adjusting one line on the script below: line 70.

3-    Add fields

For parsing, I took the payloads below and created the relevant extractions

• <13>Aug  1 19:01:51 jump01.saluca.net root: ITO: Service Down:  vmtoold.service

• <13>Aug  1 19:01:51 jump01.saluca.net root: ITO: Service Up:  vgauthd.service up for  2 months 2 days

• <13>Aug  1 19:01:50 jump01.saluca.net root: ITO: For root - Memory use: 6.9 % and CPU use across 1 cores: 0.2 % equivalent of a single core.

• <13>Aug  1 19:01:49 jump01.saluca.net root: ITO: Partition: tmpfs Mounted on /dev/shm Size: 920M Used: 0 Available: 920M Used %: 0%

• <13>Aug  1 19:01:20 jump01.saluca.net root: ITO: AvePing: NTP - 3.centos.pool.ntp.org: 73.692 ms

• <13>Aug  1 19:01:02 jump01.saluca.net root: ITO: I/O Stats: sda       0.00     0.06    0.00    0.36     0.00     0.00    18.40     0.00    1.46    1.38    1.46   0.39   0.01

• <13>Aug  1 19:01:02 jump01.saluca.net root: ITO: CPU Use: %Cpu(s):  5.6 us, 11.1 sy,  0.0 ni, 83.3 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st|Memory Use: KiB Mem :  1882340 total,    72748 free,   271464 used,  1538128 buff/cache

• <13>Aug  1 19:01:01 jump01.saluca.net root: ITO: Host jump01.saluca.net running NAME="CentOS Linux" VERSION="7 (Core)" up for 63 days,  6 hours, 5 minutes

I’ve attached the screenshots of the custom extractions below. You’ll note that there are two log source types associated with the parsing. This is due to the process I took while creating the extractions and the log source type and sources themselves. 

4-    Create widgets

The dashboards created include an overarching environmental overview of linux systems in the environment, with a drill down capability into an individual host to view that given host's health metrics. Screenshots can be seen below:

IT Ops Insights:

IT Ops Insights - Host Profile
The below screenshot series is showcasing a single dashboard with many widgets.
Below, we review a sample of individual widgets and provide their AQL
Time Series
To review how to craft a time series chart, I chose the Average Latency to NTP Servers
The same methodology would apply to DNS servers, or to other time series based items that can be found in these dashboards. For this I created a widget with the query below:
The query:
select concat(substring(starttime, 0, 9), '0000') as "Time", avg(ito_latency) as "Latency", ito_ping_host as "Target" from events where ito_ping_type == 'NTP' and strlen("Target") >0 group by "Time" order by "Time" asc last 1 hours
Items to note here: I took out the last 4 digits from the starttime and replaced them with zeros in order to have a common 'time' so I can calculate average latencies in this example. For a time series, it is critical to ensure grouping by time and I chose to bypass a null checker and instead use a string length comparison to determine whether or not the intended NTP servers have an identifier that exists to avoid blank names. A very important factor here is to make sure that events are ordered by a time stamp of some sort, this will ensure that the visualization built out on Pulse will properly draw the connecting lines across the applicable data points.
From there, it is a matter of creating the visualization. 
In the screenshot below, note that we are splitting the data, using a dynamic series, based on the 'target,' in our example, the NTP servers. This is to allow for each server to have its own plotted line.
Pie or Doughnut Chart
For the Pie Chart, I chose to showcase the host based overall service status across all services. Not all services are critical or perhaps even intended to run in a hardened system so to see less than 100% 'Up' may be by design.
While not all services need to be 'Up' in order to consider a system to be in a 'healthy' state, identifying critical services in general or on a host by host basis may be possible and leveraging Reference Data, (maps, sets, etc), may limit visualization to only those critical services where a 100% 'Up' status would apply. The approach would be similar, with an added query parameter to refer to these. In my case, I crafted the query below:
The query:
select "Service Status", count() from (select "Service Name", LAST(ito_service_status) as "Service Status" from events where utf8(payload) like '%ITO:%' and Hostname ilike '{host}' and strlen("Service Name") > 0 GROUP BY "Service Name" order by "Service Status" desc last 5 minutes) group by "Service Status"
Items to note here include the subquery used to collect some details and appropriately gather the most recent status on a given host followed by only getting the status and count on the overarching query.
Then it's on to the pie! The configuration on this visualization is a bit less involved than the dynamic time series above:
Bar Chart
For a bar chart, I chose to leverage the OS and version details within this environment. You'll note below a 'null' inidcated as one of the CentOS versions... I'm still working on that piece, it may be a misconfigured script on one of my systems that is not sending all relevant information and thus my parsing is attempting to capture data from an empty space in the log.
Despite the clearly visible 'null' in my entries, I am still able to gather a contextual understanding of the environment's assets and I would even be able to hone in on those in particular through a drill down, if I created one, to identify hostnames of those servers that are sending the incomplete logs. For the time being, I am using the below configuration, including this query:
The query:
select OS as "OS", COUNT() from (SELECT hostname, concat("OS Name",' - ',"OS Version") as "OS" from events where UTF8(payload) ilike '%ITO: Host%' and STRLEN("OS Name") > 0  GROUP BY hostname last 1 hours) GROUP BY OS
Things to note here include another subquery and yet another concatination. As part of the parameters, note that I do a length comparison on the OS name, but not the version. Had I also done a comparison against the version, the entry would not show up in the bar chart. 
Time to build a bar... chart!
Table
For a table, I chose to showase the Service and Status table from the host profile and metrics dashboard. The goal with this visualization is to be able to find a service by name and see its latest status.
Similar to what I discuss above, in order to get the appropriate most recent status, I was able to leverage subqueries in several visualizations, but in this particular case, that was not necessary, as can be seen in the screenshot below:
The query:
select "Service Name", LAST(ito_service_status) as "Service Status" from events where utf8(payload) like '%ITO:%' and Hostname ilike '{host}' and strlen("Service Name") > 0 GROUP BY "Service Name" order by "service Status" desc last 5 minutes
If you look at the query carefully, you'll notice it's the same subquery used in the pie chart above. Crafting this table was the first step in getting an understanding in how to properly gather the data in order to subsequently query only certain data from the yielded results.
Now to set the table!
As you can see, QRadar SIEM continues to be an extremely flexible platform that, given the right data, can present almost all, if not all, use cases in a visual format, in a report, or streaming across the analyst's screen, only limited by imagination! Oh, and this applies to offenses too, imagine a data point where a given partition on a critical server reaches 70% capacity and that is intended to kick off an effort to assess expanding disc capacity or perhaps to flipping over to another active data store instead... a rule could be written to trigger a script, escalate to a SOAR case (and relevant workflow tasks) or send an email to the appropriate sysadmin. The possibilities are truly endless.
Materials including log source type and fields, dashboards and the IT Ops Metrics script can all be found here. Thanks to @Jose Bravo for hosting these files for me! For those that don't know of Jose, check out his youtube channel, it has a ton of useful QRadar videos, including art of the possible and tips and tricks.
I will soon craft a follow up blog entry to add details around building some visualizations that align to the above, but for QRadar Log Insights and Grafana.
3 comments
63 views

Permalink

Comments

Mon August 07, 2023 05:08 AM

This is really an amazing write up.

Mon August 07, 2023 03:54 AM

Hello Cristian Ruvalcaba

I found your block very useful for customer-environments who have the intention to consolidate their log management and leverage the power of IBM QRadar.

Thu August 03, 2023 08:24 AM

Thank you @Cristian Ruvalcaba for posting this content!