IBM QRadar

IBM QRadar

Join this online user group to communicate across Security product users and IBM experts by sharing advice and best practices with peers and staying up to date regarding product enhancements.

View Only

Back to Blog List

IT Ops Metrics and QRadar ... Suite

By Cristian Ruvalcaba posted Tue August 08, 2023 08:00 PM

Hello Community!

This is a follow up article to the previously posted IT Ops Metrics and QRadar article designing visualizations on QRadar SIEM. For this article, the same methods for log collection described in the other article will be used, including the log format, in fact... I had created the following first and then recreated in QRadar SIEM so as to make sure that the method could be implemented across our Threat Management log collection portfolio. While QRadar SIEM is a fantastic SIEM, some customers go above and beyond using it for security use cases alone and also have additional licenses that allow a bypass of correlation for some events while allowing for searching of those same events.

With QRadar Log Insights, QLI, we are starting with a cloud-native log collection data lake with security context associated to events along with auto-discovery of ingestion data sources and parsing when a Device Support Module, DSM, exists. With QRadar SIEM, we created a custom log source and several extractions in order to have the necessary foundation for the visualizations to function as designed. It is an involved process and uses Ariel Query Language, AQL, as our query language. AQL is similar to some of your favorite data base query languages already (eg. SQL), and subqueries were used in that implementation.

What changes in QLI? Well, a great deal! With QLI, we are introducing our implementation of Kusto Query Language, KQL, for searching through ingested logs. KQL is an iterative language that allows for pulling of column data along with a heap of comparative or calculation type functions that can be used, including the ability to extract fields in query. This context may seem a bit wordy, but I share this to provide quantifiable differences between QLI and QRadar SIEM. Below, I will show a few of the KQL queries that were created for this effort on QLI.

Again, please find some sample logs below:

Aug 6 03:15:00 jump01.saluca.net root: ITO: Partition: /dev/sda1 Mounted on /boot Size: 1014M Used: 181M Available: 834M Used %: 18%
Aug 6 03:15:00 jump01.saluca.net root: ITO: For root - Memory use: 6.5 % and CPU use across 1 cores: 0.2 % equivalent of a single core.
Aug 6 03:15:00 jump01.saluca.net root: ITO: Service Up: crond.service up for 2 months 6 days
Aug 6 03:15:00 jump01.saluca.net root: ITO: Service Down: dbu.service
Aug 6 03:15:01 jump01.saluca.net root: ITO: Host jump01.saluca.net running NAME="CentOS Linux" VERSION="7 (Core)" up for 67 days, 14 hours, 19 minutes
Aug 6 03:15:02 jump01.saluca.net root: ITO: CPU Use: %Cpu(s): 4.3 us, 30.4 sy, 0.0 ni, 65.2 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st|Memory Use: KiB Mem : 1882340 total, 151468 free, 263868 used, 1467004 buff/cache
Aug 6 03:15:02 jump01.saluca.net root: ITO: I/O Stats: sda 0.00 0.07 0.00 0.39 0.00 0.00 18.17 0.00 1.43 1.38 1.43 0.40 0.02
Aug 6 03:15:05 jump01.saluca.net root: ITO: AvePing: DNS - 192.168.2.1: 0.525 ms
Aug 6 03:15:11 jump01.saluca.net root: ITO: AvePing: NTP - 0.centos.pool.ntp.org: 73.400 ms
Aug 6 03:15:49 jump01.saluca.net root: ITO: Partition: /dev/mapper/centos-root Mounted on / Size: 14G Used: 2.1G Available: 12G Used %: 16%
Aug 6 03:15:50 jump01.saluca.net root: ITO: For root - Memory use: 6.5 % and CPU use across 1 cores: 0.2 % equivalent of a single core.
Aug 6 03:15:50 jump01.saluca.net root: ITO: Service Up: auditd.service up for 2 months 6 days
Aug 6 03:15:50 jump01.saluca.net root: ITO: Service Down: kmod-tatic-node.service

Dashboards in QLI

I've created the same visualizations that you see in the QRadar SIEM article, on QLI first... well, on Grafana first, using the QLI collected data. But let's go ahead and start with QLI.

The "IT Ops" Dashboard

The screenshot below showcases the IBM QRadar Suite's 'Dashboard' visualization tool. Stemming from the QRadar SIEM Pulse application, the QRadar Suite's implementation has two data query approaches: KQL and API. The below visualizations were all created with KQL.

From the screenshot above, we'll focus on 3 specific visualizations to discuss:

1- Linux OS Versions in Monitored Environment

2- Recently Checked in Systems

3- Max Latency to DNS Servers

Linux OS Versions in Monitored Environment

For the above visualization, the following log entry is leveraged:

<13>Aug 7 12:31:01 jump01.saluca.net root: ITO: Host jump01.saluca.net running NAME="CentOS Linux" VERSION="7 (Core)" up for 68 days, 23 hours, 35 minutes

From this entry, we are able to discern hostname and version of operating system. Depending on the allowable window for not checking in, multiple log entries for a single host may exist within the time window of our query. Due to this, it is critical to limit the results to a single result per host. The below query is used for structuring the data in the format and with the parameters necessary for the visualization above:

events
| project payload
| where original_time > ago(5m)
| where payload contains 'ITO: Host' and payload contains 'NAME' and payload contains "VERSION"
| extend hostName=extract("^\<\d+\>\w+\s+\d+\s+\d+\:\d+\:\d+\s+([a-zA-Z0-9\.\-\_]+)\s+",1,payload, typeof(string)), os=extract("NAME=\"([a-zA-Z0-9\.\s]+)\"",1,payload, typeof(string)),
version=extract("VERSION=\"([a-zA-Z0-9\.\s]+)\"",1,payload, typeof(string)), osv=strcat(os," - ",version)
| distinct hostName, osv
| summarize Host_Count=count() by osv
| take 10000

Breaking the KQL down:

1- We are searching through the 'events' table only

2- We pull the payload only

3- We pull the last 5 minutes of data

4- In the 'Extend' section, we are extending the effective columns, or data points, to add custom in-query extracted fields: hostname, os and version. We also concatinate the OS and version into the 'osv' field.

5- We group to distinct hostname and osv pairs for the returned data set

6- We summarize by calculating how many entries of each osv exist

By the time we get to the last step, we've trimmed the data down to a single 'osv' entry per actively reporting host in the given time window. This ensures a single entry per host that we can then count. At this point, the hostname is no longer needed and as such, it is not used for the visualization.

The configuration for the visualization itself is very similar to the IBM QRadar SIEM process. I'll include a screenshot below for this visualization, but will not include that for the following visualization examples.

For the query:

For the visualization:

Recently Checked in Systems

For the above visualization, we once again use the same payload as the visualization documented above. In fact, it will be the same query! Kind of... the goal here is to snag the hostname. The output from step 5 of the query above contains that information and I chose to re-use the KQL to save time while building this out.

I used the following query for this specific visualization:

events
| project payload
| where original_time > ago(50m)
| where payload contains 'ITO: Host' and payload contains 'NAME' and payload contains "VERSION"
| extend hostName=extract("^\<\d+\>\w+\s+\d+\s+\d+\:\d+\:\d+\s+([a-zA-Z0-9\.\-\_]+)\s+",1,payload, typeof(string)),
os=extract("NAME=\"([a-zA-Z0-9\.\s]+)\"",1,payload, typeof(string)),
version=extract("VERSION=\"([a-zA-Z0-9\.\s]+)\"",1,payload, typeof(string)),
osv=strcat(os," - ",version)
| distinct hostName, osv

The first 5 steps detailed in the above breakdown comprise the entirety of this specific query. While we only present the hostname in the table, we do have both the hostname and OS Version in the results:

What would I do different now?

Well, it very much depends. Leaving the query as is allows me to adjust the table to include both columns and also create drill downs for both, the hostname as it already exists, as well as the OS Version. Creating a drill down for the OS version would likely allow me to create an overview dashboard for all hosts that are in that grouping, including comparing specific service statuses, up v down for critical services, and more! I did not create that drill down... but I did create one drilling down for the hostname. Other efficiencies may also exist, including not projecting the payload prior to creating the extracted fields in the query and simply snagging the extracted field(s) directly.

If I were to adjust the query, I could make it more efficient by directly projecting the hostname extraction then ensuring we only keep distinct values. Given the hostname as my primary data point here, I did create a drill down for a host profile dashboard, see screenshot below. We'll explore more of that dashboard later in this entry.

Max Latency to DNS Servers

The logs used for this visualization are similar to those used for the average latency as well... including the Max and Average for NTP, with some slight adjustments to the query.

<13>Aug 7 12:28:04 jump01.saluca.net root: ITO: AvePing: DNS - 192.168.2.1: 0.372 ms

From the above log entry, we can discern the relevant data: hostname, DNS related entry, DNS host and latency. For this general visualization, the hostname is not needed, but could play a factor for DNS latency related visualizations for the host profile visualization. So, for our general dashboard here, what is it we want to see? Well, ideally, the maximum latency for each DNS host. To that end, the following query is used:

events
| project date=bin(original_time,1m), latency=extract("AvePing: DNS - [0-9a-zA-Z\.\-]+: (\d+\.\d+)",1,payload,typeof(double)), DNS_Server=extract("AvePing: DNS - ([0-9a-zA-Z\.\-]+): \d+\.\d+",1,payload,typeof(string))
| where original_time >= ago(6h) and payload contains 'ITO' and isnotempty(DNS_Server)
| extend datehost=strcat(tostring(date)," ",DNS_Server)
| summarize average_latency=avg(latency), max_latency=max(latency) by datehost
| project average_latency, max_latency, dns_host=extract("([0-9a-zA-Z\.\-]+)$",1,datehost,typeof(string)), date_time=todatetime(extract("^([0-9\.\-\:TZ]+) ",1,datehost,typeof(string)))
| sort by date_time desc

Breaking this KQL query down:

1- Again, we search through the 'events' table.

2- Similar to what I discuss in the "What would I do different now?" section above, I do the extraction immediately here, projecting only what fields I need: timestamp to the minute, latency, and hostname (or IP). By doing the timestamp to the minute, to the hour or other allows for more generalized calculations of the data.

3- We ensure that there's a DNS server involved in the log, ensure that it is one of the ITO, or IT Operations, logs collected and limit results to the preceding 6 hours.

4- Concatenate the date and DNS host fields, this allows for leveraging the calculations of Average and Max within a given time window but also associating them to a given host, just in case multiple DNS servers are used within the environment.

5- This is where the calculations happen, both Average and Maximum for the time slots (1 minute windows) for given hosts. Also we are able to once again split the hostname and time fields.

6- This is where the data is sorted. This is important when using QRadar Pulse or QRadar Suite Dashboards as this will define how the connecting lines between data points are drawn out. While this is documented, it is definitely one of those nuances that I just recently figured out after finally digging into some of my time series not showing up how I needed or wanted them to.

NOTE: A separate visualization is used for the Average Latency across DNS servers for QRadar Suite Dashboards.

The "IT Ops - Host" Dashboard

Continuing to leverage the IBM QRadar Suite's Dashboarding, the below screenshots showcase the IT Ops host profiling dashboard, the drill down from the table we covered in the preceding section.

While I've covered three different visualizations in the first section above, a lot of the visualizations on this follow similar principles and can be discerned as needed, or the JSON files for these dashboards can be imported for further exploring all the related queries for the various charts and graphs. As I haven't touched on the pie chart as of yet in this article, we'll take a look at that next:

For the above pie chart, the following log models, and those similar, are leveraged:

<13>Aug 7 12:28:49 jump01.saluca.net root: ITO: Service Up: crond.service up for 2 months 8 days

<13>Aug 7 12:28:49 jump01.saluca.net root: ITO: Service Down: dbu.service

From this log, we can identify a few items: hostname, service name, service status, and service up time. For the pie chart, we will focus on using a subset of these data: hostname, service name and service status. The uptime can be leveraged for other purposes should the need arise, including assessing likelihood of a vulnerable service having been updated and restarted, depending on patch management processes that may exist in a given environment, but for now, we'll skip that.

For this specific visualization, we'll be using the following query:

events
| project date=bin(original_time,1m), data_source_name, low_level_categories, payload
| where original_time > ago(8m)
| where payload contains '{host}' and payload contains 'ITO'
| extend hostName=extract("^\<\d+\>\w+\s+\d+\s+\d+\:\d+\:\d+\s+([a-zA-Z0-9\.\-\_]+)\s+",1,payload, typeof(string)), serviceName=extract("([a-z0-9A-Z\@\:\-]+).service",1,payload, typeof(string)), serviceStatus=extract("ITO\: Service (\w+)",1,payload, typeof(string))
| where isnotempty(serviceStatus)
| summarize Total=count() by serviceStatus
| take 10000

Breaking down the KQL:

1- Events table, yup, again!

2- Projecting a set of columns, this is where I had started the effort and likely left the data_source_name and low_level_categories in the query but these may not be necessary. If you import these dashboards, please feel free to update this KQL to make it more streamlined, and if you have questions regarding this, please reach out to your Technical Specialist for guidance.

3- Limiting to the most recent 8 minutes, this could be shorter as we are planning on using the most recent data point for a given service.

4-Yet another filter, in this step, we are leveraging the parameter field for the hostname: {host}. This is done to enable drill down functionality from the parent dashboard and in order to showcase a placeholder for the KQL query.

5- At this step, we extract the service name and status along with our hostname, which by now, we've confirmed is in the payload through the filter in step 4.

6- Validate that a service status exists.

7- Counting via Summarize! How many are up, how many down?

NOTE: While the 8 minute filter can be adjusted to include less time or more this can still yield some nuances around exact percentage of up v down due toe adjustments in the time window... to resolve this, another function is possible!

To that end, I would take the Summarize line and change it!

| summarize arg_max(date, serviceStatus) by serviceName

This approach leverages the timestamp and will take the most recent entry and leverage that for the summarizing in lieu of the following line:

| summarize Total=count() by serviceStatus

Thus far, we've explored how to visualize the data sent to QRadar Log Insights using KQL and Dashboards, even going through KQL assessment and optimization for a few items in this article to showcase that there are several different ways to accomplish the same thing. It's just a matter of breaking down the steps necessary to get the data to the structure necessary for our charts! These dashboards are available for download here.

And with that, we've covered the dashboarding and included some additional exercises for practice on some of this should you feel so inclined! Stay tuned for an entry to discuss how to implement similar in Grafana.

1 comment

45 views

Permalink

https://community.ibm.com/community/user/blogs/cristian-ruvalcaba1/2023/08/07/it-ops-metrics-and-qradar-suite

Comments

Wendy Batten

Mon August 14, 2023 10:52 AM

Thank you for posting.

IBM QRadar