2/07/2016

Performance monitoring tools and methodology

Operational monitoring of system performance

Operational monitoring refers to collecting key system performance metrics at periodic intervals over time. This information gives you critical data to refine that initial configuration to be more tailored to your requirements, and also prepares you to address new problems that might appear on their own or following software upgrades, increases in data or user volumes, or new application deployments.

Operational monitoring considerations

A DB2 database (“DB2”) system provides some excellent sources of monitoring data. The primary ones are snapshot monitors and, in DB2 Version 9.5 and later, workload management (WLM) table functions for data aggregation. Both of these focus on summary data, where tools like counters, timers, and histograms maintain running totals of activity in the system. By sampling these monitor elements over time, you can derive the average activity that has taken place between the start and end times, which can be very informative.

There is no reason to limit yourself to just metrics that the DB2 product provides. In fact, data outside of the DB2 software is more than just a nice-to-have. Contextual information is key for performance problem determination. The users, the application, the operating system, the storage subsystem, and the network - all of these can provide valuable information about system performance. Including metrics from outside of the DB2 database software is an important part of producing a complete overall picture of system performance.

The trend in recent releases of the DB2 database product has been to make more and more monitoring data available through SQL interfaces. This makes management of monitoring data with DB2 very straightforward, because you can easily redirect the data from the administration views, for example, right back into DB2 tables. For deeper dives, activity event monitor data can also be written to DB2 tables, providing similar benefits. With the vast majority of our monitoring data so easy to store in DB2, a small investment to store system metrics (such as CPU utilization from vmstat) in DB2 is manageable as well.

Types of data to collect for operational monitoring

Several types of data are useful to collect for ongoing operational monitoring.
  • A basic set of DB2 system performance monitoring metrics.
  • DB2 configuration information
    Taking regular copies of database and database manager configuration, DB2 registry variables, and the schema definition helps provide a history of any changes that have been made, and can help to explain changes that arise in monitoring data.
  • Overall system load
    If CPU or I/O utilization is allowed to approach saturation, this can create a system bottleneck that might be difficult to detect using just DB2 snapshots. As a result, the best practice is to regularly monitor system load with vmstat and iostat (and possibly netstat for network issues) on Linux and UNIX, and perfmon on Windows. You can also use the administrative views, such as ENV_GET_SYSTEM_RESOURCES, to retrieve operating system, CPU, memory, and other information related to the system. Typically you look for changes in what is normal for your system, rather than for specific one-size-fits-all values.
  • Throughput and response time measured at the business logic level
    An application view of performance, measured over DB2, at the business logic level, has the advantage of being most relevant to the end user, plus it typically includes everything that could create a bottleneck, such as presentation logic, application servers, web servers, multiple network layers, and so on. This data can be vital to the process of setting or verifying a service level agreement(SLA). 
The DB2 system performance monitoring elements and system load data are compact enough that even if they are collected every five to fifteen minutes, the total data volume over time is irrelevant in most systems.

Basic set of system performance monitor elements

11 metrics of system performance provide a good basic set to use in an on-going operational monitoring effort.

There are hundreds of metrics to choose from, but collecting all of them can be counter-productive due to the sheer volume of data produced. You want metrics that are:

  • Easy to collect - You do not want to use complex or expensive tools for everyday monitoring, and you do not want the act of monitoring to significantly burden the system.
  • Easy to understand - You do not want to look up the meaning of the metric each time you see it.
  • Relevant to your system - Not all metrics provide meaningful information in all environments.
  • Sensitive, but not too sensitive - A change in the metric should indicate a real change in the system; the metric should not fluctuate on its own.
1. The number of transactions executed:
TOTAL_APP_COMMITS
This provides an excellent base level measurement of system activity.

2. Buffer pool hit ratios, measured separately for data, index, XML storage object, and temporary data:

Data pages: ((pool_data_lbp_pages_found - pool_async_data_lbp_pages_found) / (pool_data_l_reads + pool_temp_data_l_reads)) × 100

Index pages: ((pool_index_lbp_pages_found - pool_async_index_lbp_pages_found ) / (pool_index_l_reads + pool_temp_index_l_reads)) × 100 

XML storage object (XDA) pages: ((pool_xda_lbp_pages_found - \ pool_async_xda_lbp_pages_found ) / (pool_xda_l_reads + pool_temp_xda_l_reads)) × 100

Buffer pool hit ratios are one of the most fundamental metrics, and give an important overall measure of how effectively the system is exploiting memoryto avoid disk I/O. Hit ratios of 80-85% or better for data and 90-95% or better for indexes are generally considered good for an OLTP environment, and of course these ratios can be calculated for individual buffer pools using data from the buffer pool snapshot.

Note: The formulas shown for hit ratios for data and index pages exclude any read activity by prefetchers.

3. Buffer pool physical reads and writes per transaction:

(POOL_DATA_P_READS + POOL_INDEX_P_READS + POOL_XDA_P_READS
POOL_TEMP_DATA_P_READS + POOL_TEMP_INDEX_P_READS)
/ TOTAL_APP_COMMITS
(POOL_DATA_WRITES + POOL_INDEX_WRITES + POOL_XDA_WRITES)
/ TOTAL_APP_COMMITS
These metrics are closely related to buffer pool hit ratios, but have a slightly different purpose.

4. The ratio of database rows read to rows selected:
ROWS_READ / ROWS_RETURNED
This calculation gives an indication of the average number of rows that are read from database tables to find the rows that qualify. Low numbers are an indication of efficiency in locating data, and generally show that indexes are being used effectively. 

5. The amount of time spent sorting per transaction:
TOTAL_SORT_TIME / TOTAL_APP_COMMITS
This is an efficient way to handle sort statistics, because any extra time due to spilled sorts automatically gets included here. That said, you might also want to collect TOTAL_SORTS and SORT_OVERFLOWS for ease of analysis, especially if your system has a history of sorting issues.

6. The amount of lock wait time accumulated per thousand transactions:
1000 * LOCK_WAIT_TIME / TOTAL_APP_COMMITS
Excessive lock wait time often translates into poor response time, so it is important to monitor. The value is normalized to one thousand transactions because lock wait time on a single transaction is typically quite low. 

7. The number of deadlocks and lock timeouts per thousand transactions:
1000 * (DEADLOCKS + LOCK_TIMEOUTS) / TOTAL_APP_COMMITS
Although deadlocks are comparatively rare in most production systems, lock timeouts can be more common. The application usually has to handle them in a similar way: re-executing the transaction from the beginning. Monitoring the rate at which this happens helps avoid the case where many deadlocks or lock timeouts drive significant extra load on the system without the DBA being aware.

8. The number of dirty steal triggers per thousand transactions:
1000 * POOL_DRTY_PG_STEAL_CLNS / TOTAL_APP_COMMITS
A “dirty steal” is the least preferred way to trigger buffer pool cleaning. Essentially, the processing of an SQL statement that is in need of a new bufferpool page is interrupted while updates on the victim page are written to disk. If dirty steals are allowed to happen frequently, they can have a significant affect on throughput and response time.

9. The number of package cache inserts per thousand transactions:
1000 * PKG_CACHE_INSERTS / TOTAL_APP_COMMITS
Package cache insertions are part of normal execution of the system; however, in large numbers, they can represent a significant consumer of CPU time. In many well-designed systems, after the system is running at steady-state, very few package cache inserts occur, because the system is using or reusing static SQL or previously prepared dynamic SQL statements. In systems with a high traffic of ad hoc dynamic SQL statements, SQL compilation and package cache inserts are unavoidable. However, this metric is intended to watch for a third type of situation, one in which applications unintentionally cause package cache churn by not reusing prepared statements, or by not using parameter markers in their frequently executed SQL. 

10. The time an agent waits for log records to be flushed to disk:
LOG_WRITE_TIME / TOTAL_APP_COMMITS
The transaction log has significant potential to be a system bottleneck, whether due to high levels of activity, or to improper configuration, or other causes. By monitoring log activity, you can detect problems both from the DB2 side (meaning an increase in number of log requests driven by the application) and from the system side (often due to a decrease in log subsystem performance caused by hardware or configuration problems). 

11. In partitioned database environments, the number of fast communication manager (FCM) buffers sent and received between partitions:
FCM_SENDS_TOTAL, FCM_RECVS_TOTAL
These give the rate of flow of data between different partitions in the cluster, and in particular, whether the flow is balanced. Significant differences in the numbers of buffers received from different partitions might indicate a skew in the amount of data that has been hashed to each partition.


0 comments:

Post a Comment

ads