IBM Data Warehousing and Advanced Analytics Specialist
IBM Data Warehousing Center of Excellence
IBM Data Warehousing Center of Excellence
IBM Data Warehousing Center of Excellence
IBM Information Development
Monitoring a data warehouse system is important to help ensure that it performs optimally. This paper describesthe most important DB2 software and operating system metrics for monitoring the performance of the IBM Smart Analytics System or IBM PureData™ System for Operational Analytics or a system having a similar architecture. This paper also presents a general methodology to help find reasons for performance problems. This approach starts with the operating system metrics and drills down to the DB2 metrics that explain the behaviour that is seen at the operating system level and help to identify the causes of performance problems. This approach is illustrated by information about typical performance problems.
Download Best Practices for Performance Monitoring in a Data Warehouse Full Report!
This best practices paper covers real-time monitoring of the IBM Smart Analytics System and IBM PureData System for Operational Analytics. You can apply most of the content to other types of clusters of servers runninga data warehouse system with DB2 software and database partitioning under AIX and Linux operating systems.The focus of this paper is finding the reasons for performance problems. These can be bottlenecks that are in the operating system, are in the DB2 database software, or are related to a single query. The focus is on datawarehouse systems with long-running queries rather than transactional systems with mostly short queries.
A main goal of this paper is to provide a set of key performance indicators (KPIs) or metrics for the operatingsystem and DB2 software, along with a methodology for analyzing performance problems in a distributed DB2 environment. This paper describes scenarios to help you gather the right information depending on the symptoms of the performance problem.
This paper first provides an overview of the approach and what to consider in general when monitoring the performance of a data warehouse. It then describes the most important operating system and DB2 metrics formultiserver data warehouse systems. The last section describes in detail several performance problem scenarios that are related to data warehouse or BI workloads and explains how to use the metrics for analyzing the problems.
Most of the information about KPIs that are described in the paper has sample commands that extract actual values. However, these examples are not intended to provide comprehensive tooling. You can use this best practices paper as a guideline for selecting the metrics to focus on when using monitoring tools such as IBMInfoSphere® Optim™ Performance Manager. You can use this paper to complement the best practices paper Managing data warehouse performance with IBM InfoSphere Optim Performance Manager, published in September 2012, which covers historical and end-to-end monitoring.
Performance monitoring methodology
Monitoring a database system requires an understanding of the various performance measures and how to interpret them relative to overall system usage. The following measures are explained in the next sections:
- Operating system measures:
- CPU or thread utilization
- Main memory usage and swapping
- Network utilization
- Disk usage
- DB2 measures:
- DB2 memory usage
- Locking characteristics
- Number of sort and hash join overflows
- Buffer pool performance
- CPU or data skew
To determine whether a situation is just a normal peak or something more critical, try to get a good overallpicture of the system when it is idle and when the load is average, high, or maximum. When you encounter a performance problem first determine what changed since the last time that the system performed satisfactorily. In particular, consider whether the following factors apply:
- Workload or queries are
- More data is being
- More connections
- The software version was
- The DB2 system was upgraded, which might have included the installation of fix
- Data nodes were
- Data was
- The database design was changed, for example, an MQT or index was
- The RUNSTATS command was issued with different parameters, for example, executed with
If you have no initial hypothesis for a performance problem, first determine whether there are any bottlenecks on the operating system side. Areas of bottlenecks include the CPU, main memory usage and swapping, thenetwork, and I/O (disk usage). Based on the information that you gather, you can then drill down to the corresponding DB2 measures. This approach is described in the section “Some typical performance scenarios.
Main performance measures and how to monitor them
To monitor data warehouse performance, you must look at both operating system and database performancemetrics. Another good source for learning about monitoring is the best practices paper Tuning and MonitoringDatabase System Performance. It explains DB2 monitoring and tuning in OLTP environments.
Operating system performance measures
The interactive tool that is shown in this paper is the nmon utility. If you want to monitor a cluster, you mustopen a window (for example, with PuTTY software) on each server and start the nmon utility there.
The command-line tools that are shown are tailored for specific measurements. Some tools behave slightly differently on Linux operating systems than they do on AIX operating systems.
Monitor all servers of a cluster to detect anomalies or deviations that might impact the overallperformance. To run a command-line tool through a single invocation on all servers of a cluster, use the rah command.
When considering CPU load, differentiate between these types of CPU measurements
- User CPU: The number of CPU cycles that are required for user processes
- System CPU: The number of CPU cycles that are used by the operating system kernel, for example, for paging or process switching
- I/O wait: The percentage of CPU cycles that is spent waiting for data
- Idle time: The number of cycles where there is nothing for the CPU to do
The sum of user and system CPU usage can be close to 100% during peak times. A 4:1 ratio of user CPU usage to system CPU usage is considered normal. If the system CPU usage is increasing for time frames that are longer than the average execution time of your SQL, query check the system and identify the cause of this imbalance.
For the I/O wait measurement, values of up to 25% are normal. Peak values of greater than 25% for longer time frames suggest that disk activity is too high and need investigation.
For multi-core and multithreaded CPUs, in addition to checking the overall CPU usage, check the usage at the core or thread level to determine core-related or thread-related bottlenecks. For example, an overall CPU usage of 8% looks low at first glance. However, there might be aCPU-related bottleneck if one thread is running at almost 100% CPU usage and the other 15 threads are idle
To determine the load on a cluster, look at the length of the run queue and the number of process switches.These provide a good hint of how busy, in terms of parallel running jobs, the system is and can be.
Examine the length of the run queue and the number of process switches to determine whether they are the reason for high system CPU usage.
Monitoring CPU usage
You can use the nmon utility to monitor CPU usage. You must run it on each server in a cluster. If you run it in the c-interactive mode, it graphically shows the CPU usage.
For a system with a large number of CPUs, you can also use the nmon utility in l-interactive mode, which displays the average overall CPU usage over time. It is also useful to view the usage for the most active processes by using the t-option. These processes should be for an active db2sysc command on a DB2 system.
Figure 1 shows the overall CPU usage and the usage for the important processes for an IBM Smart Analytics System 7700 data module with 16 cores. The CPU usage of the processes is indicated per core. The value of 160% for the most active process means that this process can use 80% of two CPUs.
Disk I/O is the slowest way to deal with data. OLAP systems always show a high amount of I/O. For these reasons, it is important to optimize data transfers and to identify and minimize I/O waits. Key measures to monitor for I/O are the amount of data read/written per second, the number of I/O operations per second, and the degree of utilization.
You should know the performance characteristics of your storage system. To determine the maximum possible I/O performance, use the dd command on AIX operating systems and the sg_dd command on Linux operating systems to generate a sample I/O work load containing queries that reads from and writes to raw physical devices.
You can use the iostat command to measure each I/O device, as shown:
iostat interval count
This call initiates iostat <count> times every <interval> seconds. If you specify the -k option on a Linux operating system, the output includes the translations per second and KB that are read per second for each disk or device for the specified interval. For Linux operating systems, the command behaves similarly to thevmstat command because the first execution returns these values for the period since the last system reboot. For AIX operating systems, issuing the iostat command with the -f or -F option displays the I/O activity per file system. With the nmon utility, you can monitor the I/O activity by using the d-interactive mode.
The most important network in an IBM Smart Analytics System is the internal application network – also know as the fast communication manager (FCM) – that is used by DB2 to exchange results between the administration and data nodes or to exchange data between data nodes.
Bottlenecks can appear in the communication between the administration node and the data nodes or when the volume of messages overall is driving the network to its capacity limit. This capacity limit is determined by the bandwidth of the network (usually 1 Gb/second or 10 Gb/second) and the number of switches. For a 10 Gb network with two switches, the capacity is 2 x 10 Gbit/second, which is 2.5 GB/second. Although this is thetheoretical limit, the realistically achievable maximum throughput for the IBM Smart Analytics System 7700 product is 80% - 90% of this value.
The most important measure is the amount of data that is sent and received by the server, which is usuallymeasured in KB/second. Because all servers in a cluster share the internal application network, you must sum the measures for the servers to determine whether the capacity limit has been reached
If the network performance is lower than you expect, look at the number of sent and received packages and the number of transmission errors. These errors should be less than 1% of the number of received and sent packages.
Monitoring network traffic
To monitor the network traffic for one server in a cluster, use the nmon utility. The following sample output was generated by specifying the interactive n option:
In Figure 5, en11 is the network interface of the internal application network of the IBM Smart Analytics System 7700 product. The output shows that 72 KB received and 76 MB sent represent the highest activity level.
Download Best Practices for Performance Monitoring in a Data Warehouse!