Db2 (On Premises and Cloud)

Analyzing Db2 monitoring data from Data Management Console with Splunk – case study

By Jason Sizto posted Sat August 07, 2021 12:34 AM

  

Analyzing Db2 monitoring data from Data Management Console with Splunk – case study

(20 minutes read)

Why DMC and Splunk?

Imagine DMC is capturing a treasure trove of monitor data, and there is a neat way to visualize the data the way you like it? 

How about spotting a rogue application using DMC collected monitoring data and preemptively identify a problem before it became a bigger problem?

This case study will show you how to find the “Spike” in aborted activity on a rogue application. 

The flexibility of finding whatever “Spike” that matters to you is easy with monitoring data from DMC and visualization from Splunk!  Interested?

Reasoning

DMC is a powerful tool that will capture various monitoring data.

Its interface allows users to visualize trends on monitoring metrics and provides drill down to further investigate problems.  It also provides availability and performances alerts.

With that said, DMC may not be able to cover all data scenarios.

This is when integration with 3rd party tools which utilize the DMC monitoring data can be specific help.  It allows users to look at the monitor data at custom angles for their individual needs.

With this exercise, we chose Splunk as the 3rd party tool as integration case study.

 

Approach

At the high level, there are 3 steps needed in the Splunk integration.  The steps are Data ingestion, Search, and Actions.  We will describe the steps below in more detail.

 

Data ingestion

There are different ways that monitoring data can be ingested to Splunk.  For example, data can be forwarded to Splunk by setting up the Splunk Universal Forwarder.  Data can be retrieved from a database connection by setting up the Splunk DB Connect App.  Data can also be retrieved from calling RESTful APIs by setting up the Splunk REST API Modular Input App.

Splunk Universal Forwarder can periodically pick up a file(s) from a target machine and forward it to Splunk server for indexing.  This is a handy way to forward log or trace files, as the files are ready for forwarding. 

Splunk DB Connect App can collect data by using the SQL interface.  Query result sets will be indexed also based on a periodic schedule.  For the case study described in this article, we use this approach and connections would be made to the DMC repository.

Splunk REST API Modular Input App can collect data by calling the RESTful API.  The app will call the DMC RESTful API to pull result periodically for indexing.

 

Search

In Splunk, search is used to retrieve events from indexed data.  The main goal for search is to customize the data for analysis or visualization.  The search command can be saved and can be used as a Report.  Threshold can be added to the search command and be used as Alerts.

Actions

Aside from visualizing the data, as mentioned above, Reports, Dashboard and Alerts can also be generated.  The purpose of this doc is to give an idea on how to utilize the data that is collected by DMC.  There can be creative ways to put actions to the collected data for analysis, and we will leave it to users’ imagination.  Following is an example in turning DMC data into a handy use case.

 

Use case

In this use case, we are using DMC version 3.1.5 on Linux platform and Splunk v8.1.2 free trial version.

We will use Splunk to pick out aborted SQLs information from DMC repository WORKLOAD table. and generate reports, dashboard, and alerts when the system is experiencing high SQL abortion rate.

Let us assume this company has an SLA setup with their customer to have query total activity time to finish less than 10 seconds.  And Threshold event monitor is setup in Db2 to terminate any execution when ACTIVITYTOTALTIME > 10 seconds. 

For each month, if the total number of terminated SQLs exceeded a certain threshold, penalty will be incurred.  So, aborted SQL is closely monitored in this company.

Let us assume rogue queries can happen, as newly deployed applications may not be very well tuned.  Albeit not often, but a low number of poorly performing SQL is expected. 

The goal of this exercise is to utilize the DMC collected workload data and use the ACT_ABORTED_TOTAL field to create report for visualization and trending.  If ACT_ABORTED_TOTAL has exceeded certain threshold, alerts and notifications are generated.

Once Alerts is triggered, DBA can go directly to console to identify the offending application, the session user, and the offending SQLs.

 

Data collection

As mentioned, we are going to use the Splunk DB Connect App to pull data from the DMC repository to Splunk server for indexing.  The schema of the DMC repository so far is a black box, but we are enhancing the documentation of the DMC repository.  You can find the definition and usage of the repository table in here: https://www.ibm.com/docs/en/db2-data-mgr-console/3.1.x?topic=metrics-summary

The first step is to install the Splunk DB Connect App.  We will not describe the steps on how to install this app.  Once you have the app installed, here are the steps to create a New Input to index the required data. 

All subsequent steps described in this article will assume the connection is made to the DMC repository database.

 

Create the Identity

An identity is used to store the connection host and connection user credentials.  If you do not have an identify created, then, go to Splunk DB Connect > Configuration > Identities.  Then, click on New Identity.  Follow the steps to give an Identity Name, the Username and the Password.  Click the Save button.  For additional options, you can refer to Splunk documentation.  Once the identity is created, we need to create a Connection.

 

Create the Connection

To create a new connection, go to Splunk DB Connect > Configuration > Connections.  Then, click on New Connection and give a Connection Name, e.g. I name it DMCREPO.  Then, pick the Identity that has just created; and for Connection Type, choose DB2; and for Timezone, choose the time zone you are in.  Under JDBC URL Settings, fill in the Host, Port and Default Database fields.  Click the Save button.  Once the connection is created, we can create a new input to pull data from DMC repository for indexing.

 

Create the Input

To create a new input, go to Splunk DB Connect App > Data Lab > Inputs.  Then, click on New Input.  In the Connection drop down, pick the connection that has just created, in my case, DMCREPO

In the SQL Editor, you can enter the statement below:

SELECT TIMESTAMP('1970-01-01-00.00.00') + (TIMESTAMP / 1000) SECONDS - 7 HOURS AS PDT_TIMESTAMP, c.DBCONN_ID, c.DBCONN_INT , TIMESTAMP , COLLECTED , STARTBATCHCOLLECTED , LEVEL_TAG , ROW_ID , WORKLOAD_ID , WORKLOAD_NAME , MEMBER , TOTAL_CPU_USEC , ACT_COMPLETED_TOTAL , ROWS_READ , ROWS_MODIFIED , ROWS_RETURNED , LOGICAL_READS , DIRECT_READS , DIRECT_WRITES , TOTAL_SORTS , TOTAL_APP_COMMITS , TOTAL_COMMIT_TIME , TOTAL_APP_ROLLBACKS , TOTAL_SECTION_TIME , TOTAL_ROUTINE_TIME , POOL_READ_TIME , POOL_WRITE_TIME , DIRECT_READ_TIME , DIRECT_WRITE_TIME , TOTAL_SECTION_SORT_TIME , WLM_QUEUE_TIME_TOTAL , LOCK_WAIT_TIME , LOG_BUFFER_WAIT_TIME , LOG_DISK_WAIT_TIME , TOTAL_COMPILE_PROC_TIME , TOTAL_IMPLICIT_COMPILE_PROC_TIME , LOCK_WAITS , LOCK_TIMEOUTS , LOCK_ESCALS , DEADLOCKS , AGENT_WAIT_TIME , LOCK_WAIT_TIME_GLOBAL , TCPIP_WAIT_TIME , IPC_WAIT_TIME , AUDIT_SUBSYSTEM_WAIT_TIME , AUDIT_FILE_WRITE_WAIT_TIME , DIAGLOG_WRITE_WAIT_TIME , EVMON_WAIT_TIME , TOTAL_EXTENDED_LATCH_WAIT_TIME , PREFETCH_WAIT_TIME , COMM_EXIT_WAIT_TIME , IDA_WAIT_TIME , CF_WAIT_TIME , RECLAIM_WAIT_TIME , SPACEMAPPAGE_RECLAIM_WAIT_TIME , TOTAL_ROUTINE_USER_CODE_PROC_TIME , TOTAL_SECTION_PROC_TIME , TOTAL_SECTION_SORT_PROC_TIME , TOTAL_COL_PROC_TIME , TOTAL_COMMIT_PROC_TIME , TOTAL_ROLLBACK_PROC_TIME , TOTAL_RUNSTATS_PROC_TIME , TOTAL_REORG_PROC_TIME , TOTAL_LOAD_PROC_TIME , TOTAL_BACKUP_PROC_TIME , TOTAL_INDEX_BUILD_PROC_TIME , TOTAL_CONNECT_REQUEST_PROC_TIME , TOTAL_CONNECT_AUTHENTICATION_PROC_TIME , CLIENT_IDLE_WAIT_TIME , TOTAL_RQST_TIME , TOTAL_WAIT_TIME , SUM_SQL_EXECUTION_TIME , SUM_NONSQL_PROC_TIME , SUM_IO_TIME , LOCAL_LOCK_WAIT_TIME , OTHER_WAIT_TIME , TOTAL_ACT_TIME , ACT_ABORTED_TOTAL , TOTAL_COMPILATIONS , SORT_OVERFLOWS , PHYSICAL_READS , FED_WAIT_TIME , FED_ROWS_DELETED , FED_ROWS_INSERTED , FED_ROWS_UPDATED , FED_ROWS_READ , FED_WAITS_TOTAL

FROM IBMCONSOLE.WORKLOAD w, IBMCONSOLE.RTMON_MAP_DBCONN c

WHERE w.DBCONN_INT = c.DBCONN_INT

        AND c.DBCONN_ID = 'TPCDS_1G'

        AND w.TIMESTAMP > ?

ORDER BY  w.TIMESTAMP ASC;

 

e.g.
For Settings (on the right).  Input Mode is set to Event.

Input Type is set to Rising.  When it is set to rising, it will then prompt you for the Rising Column.  Set the Rising Column to TIMESTAMP.  Set Checkpoint Value to 0.


Notes on Input Type:

  • When it is set to Batch, it means the result set for the query will be fully indexed. In this case, you must write a query that only obtain the last connection timestamp for DMC. 
  • When it is set to Rising, Splunk will use the column specified in Rising column to decide what is the latest data to index. And the Checkpoint Value will be used as a parameter marker value for the w.TIMESTAMP > ? predicate in the collection SQL.  The TIMESTAMP column represents the collection time in DMC in UNIX (Epoch) time format.  By setting it to 0, means it is set to January 1st, 1970, at midnight GMT, the initial Unix Epoch time.  Doing so will pull all the data from the table.  If you do not want to pull all historical data, you can set it to a timestamp that fits your need.


Timestamp
is set to Choose Column.  And Column is set to PDT_TIMESTAMPQuery Timeout is set to 30, the default value.


Notes on Timestamp:

  • When it is set to Current Index Time, it means the timestamp used to index the result set would be the time when the collection query is run.
  • When it is set to Choose Column, it means the timestamp used to index the result set would be the column selected. In this case, the PDT_TIMESTAMP.  Since the collection is initially done by DMC, we want to keep using the DMC collection timestamp as the Splunk index timestamp.

Click Next to setup properties for this input.

Under Basic Information, set Name to a desire name for this collection.  I call mine WORKLOAD_TPCDS_1G.  This input is used to collect workload metrics for the TPCDS_1G database.  Description is optional.  And Application is set to Splunk DB Connect.

Under Parameter Settings, leave Max Rows to Retrieve to default 0.  Also leave Fetch Size to default 300.  For Execution Frequency set it to */5 * * * *Execution Frequency is specified by cron syntax.  This will run the collection every 5 minutes.  I am setting to 5 minutes to match the default DMC collection interval at 5 minutes as well.  It is OK to set to a lower frequency, as the index timestamp is using the PDT_TIMESTAMP, which is the DMC collection timestamp.

Under Metadata, leave Host as optional, and leave Source as optional.  For Source Type, if there is no existing Source Type setup, you can enter a name to create a new Source Type.  I call mine, dbConnect.  For Index set to main.

Click Finish to save the New Input.

Tips in building the collection SQL

The collection timestamp for DMC for the WORKLOAD table is the column TIMESTAMP.  It is stored as the Unix Epoch time format.  For readability, I am converting it to California time by: TIMESTAMP('1970-01-01-00.00.00') + (TIMESTAMP / 1000) SECONDS - 7 HOURS AS PDT_TIMESTAMP.  I am also selecting all the columns from the WORKLOAD table; you can customize the number of columns based on your need.  I put all column names in the select list and avoiding using “*”.  There are issues to use “*” when retrieving column info for the SQL in Splunk. 

Notice that the query also joins with the IBMCONSOLE.RTMON_MAP_DBCONN table.  This table is the mapper table for connection name (database) and the connection integer ID.  I am joining the 2 tables and only interested in the ‘TPCDS_1G’ database.  If you do not join this table for a particular database, then, all the databases collected by DMC will be indexed in Splunk.

As mentioned earlier, the TIMESTAMP column is chosen as the Rising Column.  Splunk will automatically save the latest timestamp for the collection.  And when the next collection time comes, it will only collect the data that has timestamp > the saved timestamp.  This is important as we do not want to index the data repeatedly.

Lastly, an ORDER BY w.TIMESTAMP ASC clause is required when using Rising column

 

Data indexed

Once the New Input is saved, the first data collection will pull in all the historical data from the repository.  Any subsequent collection will only pull the latest snapshot(s) since the last collection.  Next, we will see how to search for the data.

 

Search

We are not going to go into the details of Splunk Processing Language syntax, but only give some basic tips and how we can utilize the search capability to achieve the goals in the use case.

 

Basic introduction

After you have created the Splunk DB Connect input, you can click on the Find Events link, and it will lead you to the Search page for this particular index.  E.g.

Note: In Splunk, searches, alerts and reports are grouped by the App.  Currently, we are using the Splunk DB Connect App, any searches, alerts or reports that are saved will be saved under the Splunk DB Connect App.

 

Another way is to directly go to the Search & Reporting App.  E.g.

Since we want to save the Report and Alert later under the Splunk DB Connect App, we will go back to the Splunk DB Connect App, and then do Search there.

e.g.


Then, click Search.

Once you are there, you can directly search the indexes that are created in Splunk.  If you do not know what data are indexed in Splunk, you can click on Data Summary button, and the data available in Splunk will be grouped by 3 different types.  By Hosts, Sources and Sourcetypes

E.g. If I clicked on Sources, I can select on the source that I am interested in and start searching that source.  In my case, I would click on the WORKLOAD_TPCDS_1G source and start searching.

Search tips on DMC data

A lot of the metrics collected in DMC are coming from the Db2 MON_GET_xxx functions.  And the metrics returned from these functions are accumulated over time, until it is reset, either manually or when Db2 is restarted. 

Therefore, to get a clearer picture on the metrics, it is better to look at the delta values between 2 collection intervals.  Comparing values in delta values is visually easier to understand.

In this use case, our main goal is to find the number of Aborted SQLs for each collection interval.  To achieve this, the following search query can be used to find delta value per collection interval:

e.g.

source=WORKLOAD_TPCD_1G

| bin span=1m _time

| stats sum(ACT_ABORTED_TOTAL) as ACT_ABORTED_TOTAL by _time

| streamstats current=f last(ACT_ABORTED_TOTAL) as last_ACT_ABORTED_TOTAL

| rename ACT_ABORTED_TOTAL as current_ACT_ABORTED_TOTAL

| eval delta_ACT_ABORTED_TOTAL = abs(last_ACT_ABORTED_TOTAL - current_ACT_ABORTED_TOTAL)

| table _time delta_ACT_ABORTED_TOTAL

| sort -_time

 

This search query will bucketize the data in 1 minute interval and do a delta between the current and previous collection timestamp (_time).  Then, it will present the data in sorted table format.  In the example above, we see there are few aborted activities over the past few collection intervals.

By clicking on the Visualization tab, you can also view the data in a chart.  There are different chart formats you can choose from, e.g. the following graph is the Aborted Activity per interval plotted with Line Chart format.  You can see there are 3 spikes in aborted activities over the past 24 hours.

Report and Dashboard

Once you are happy with the search query, you can easily save the search as a Report.  You can then quickly reference this search and look for trends next time you log in to Splunk.

 

Setup Report

To do so, click on the Save as button, and choose Report.

You can give a Title for the Report, e.g. I named it TPCDS_1G_ABORTED_SQL.  Content allows you to choose chart only or statistics as well.  And you can decide if you allow the Time Range Picker when Report is generated.

Click Save.

e.g.

Setup Dashboard

Optionally, you can also add this Report to a Dashboard.  If you desire to create a Dashboard, you can click the Add to Dashboard button.

e.g.

Since this is a new dashboard, I selected New for Dashboard.  For Dashboard Title, I used TPCDS_1G.  Other notable options I picked are Panel Powered by Report and Panel Content set to Line Chart.  E.g.

Then, click Save and View Dashboard.

The dashboard can be access by the URL link, you can save this link.  E.g. in my setup: http://buddy1.fyre.ibm.com:8000/en-US/app/splunk_app_db_connect/tpcds_1g

 

Access the Report

To access the report, make sure you are under the Splunk DB Connect App, then go to Settings > Searches, reports and alerts.


You can run the Report by click on the Run link.

e.g.

e.g.

Access the Dashboard

As mentioned earlier, you can access the Dashboard by going to the dashboard link: http://buddy1.fyre.ibm.com:8000/en-US/app/splunk_app_db_connect/tpcds_1g

Optionally, you can add this dashboard to the Splunk Home Dashboard.  To do so, go to the Splunk Home page.  And click on Choose a home dashboard.

e.g.

Then, search for the dashboard name, e.g. in my case, TPCDS_1G and then click Save.

e.g.


Then the Home page will include the TPCDS_1G dashboard.  E.g.


We will see how to generate alerts next.

 

Alerts

As part of the use case, we would like to trigger an alert when the Aborted SQL rate is over a certain threshold.  Email notification is setup when alert is triggered.

 

Setup Alert

Let us go back to the Splunk DB Connect App, and perform the search based on the saved Report.  E.g.


Then, go to Settings > Searches, reports and alerts.


Click Run, in existing TPCDS_1G_ABORTED_SQL.  This will launch the search query again. 


Then, add the following line to the end of the search query:


The full search query is:

source=WORKLOAD_TPCD_1G

| bin span=1m _time

| stats sum(ACT_ABORTED_TOTAL) as ACT_ABORTED_TOTAL by _time

| streamstats current=f last(ACT_ABORTED_TOTAL) as last_ACT_ABORTED_TOTAL

| rename ACT_ABORTED_TOTAL as current_ACT_ABORTED_TOTAL

| eval delta_ACT_ABORTED_TOTAL = abs(last_ACT_ABORTED_TOTAL - current_ACT_ABORTED_TOTAL)

| table _time delta_ACT_ABORTED_TOTAL

| sort -_time

| stats max(delta_ACT_ABORTED_TOTAL) AS MaxValue


This will aggregate using the max(delta_ACT_ABORTED_TOTAL) value and we will use this to set the threshold.

To create an alert out of this search, click Save As, then Alert.


In Settings. I gave the name TPCDS_1G_ABORTED_ALERT as Title.

Alert type is set to Scheduled and choose Run on Cron ScheduleCron Expression is set to */15 * * *.  We want to perform search every 15 minutes.

For Time Range, select Advanced, and Earliest is set to -15m, and Latest is set to now.  Then, click Apply.  The Alert will look at only the last 15 minutes data. 

In Trigger Conditions.  For Trigger alert when choose Custom and the condition is set to where MaxValue > 10.  This will trigger the alert when there are 10 Aborted SQLs per collection interval.  Trigger is set to Once.

e.g.

We continue to add trigger actions.  In Trigger Actions.  Click + Add Actions.  Then, select Add to Triggered Alerts.

Click + Add Actions again, select Send email

Enter your email address.  Then, click Save.

If you had not setup the email server, you can set it up by going to Settings > Server Settings > Email settings

According to our use case, we just set up an alert to run at every 15 minutes interval.  If the total number of aborted activities is > 10 within intervals in the last 15 minutes, then, an alert will be triggered with email notification.

 

Triggered Alert

When an alert is triggered, you can view it by going to Settings > Searches, reports, and alerts.  The Alerts count is incremented when threshold is violated.  If you had email notification setup, you will receive an email notification regarding the triggered alert. 


You can see the date and time when the threshold is violated by clicking on Run for the Report.

In this example, we see the alert is triggered around 10PM and there are over 10 Aborted activities during that time. 

Let us go back to DMC console to investigate what is the violating application, session authorization ID and violating statements.

 

DMC drill down

Summary page

After logging into DMC, we drill down to the database connection and look at the Summary page.  This page gives us an overview of the database system.  From this page, we saw a small spike in CPU resource and Statements throughput.  Both are happening around 10PM.  This tells us some additional statements were running during this time. 


Next, we want to identify the offending application

Connection page

To do so, we click on Applications, and click Connections.  E.g.


Based on prior knowledge, we know that the SQL Abortion rate increased around 10PM.  So, we want to zoom in to the hour around 10PM and try to identify the offending application.  To do so, click on Latest then Custom.  Then in Start and End time, set to 9:30PM to 10:30PM.  Click OK.


Click on the column customization icon, move up the Aborted activities / min field.  E.g.


Sort on the Aborted activities / min field in descending order. 

We can see that the connection with high Aborted activities is coming from db2bp and workload is SYSDEFAULTUSERWORKLOAD.

After that, click on the ellipsis icon (3 dots) for the row and click View details.  E.g.


In the details page, there are more connection metrics.  We see that the CPU time is high around 9:45PM to 10:00PM.  We also identified that the Threshold violations is around 2 violations per min!


Moving to the bottom of the page, we see that the Total aborted activities / min hovers between 2 and 3 abortion per min.  This is the likely culprit application!


Next, we want to identify the offending session user and the statements

 

Individual Execution page

To do so, we click on Statement > Individual execution and set custom time to 9:30PM to 10:30PM.  E.g.


Next, click on the column customization icon and move up the Workload name, Session authorization ID and SQL code columns.  E.g.


Then, click on the SQL code column and sort on ascending order, e.g.


We see many statements are returning SQL Code with -4712.  This indicates that these statements were terminated by the Threshold Event Monitor as it exceeded the ACTIVITYTOTALTIME > 10 seconds threshold! 


The DBA saw that in Session authorization ID, LUKE, is running a new application in the SYSDEFAULTUSERWORKLOAD which does not have the right WLM priority set to execute the work efficiently.  The DBA rectified the problem by creating resource group for LUKE with proper priorities and things are getting back to normal again!

 

Note: to capture statements in the Individual Execution page, you have the setup the console activity event monitor for the collection.  Please find more details in this KC page: https://www.ibm.com/docs/en/db2-data-mgr-console/3.1.x?topic=monitoring-events

 

Conclusion

Although, DMC can provide variety of tools, drill down, reports, alerts and jobs, there is something that may require customization.  In this example, we can see the value of using 3rd party tools to visualize data collected by the DMC in its repository.  DMC repository contains a treasure trove of historical data and there are good opportunities to mine useful insights from it!

Please leave me comments or suggestions.  You can reach out at jsizto@us.ibm.com / @IbmJsizto / LinkedIn
Special thanks to Irtiza Jafry, ijafry@usc.edu, Graduate Student at USC who contributed to data collection and Splunk installation during his internship project at IBM.
0 comments
28 views

Permalink