Executive summary
Because of business needs for data retention and regulatory compliance, enterprises need to manage increasingly large databases ranging from hundreds of gigabytes to many terabytes, or even petabytes, in size. As data continues to grow at an exponential rate, DBAs and IT professionals in these organizations face daunting challenges when designing and operating such large databases. The data must be well organized to effectively cope with data growth and to meet service requirements. Challenges include how best to achieve the following objectives:
- Designing databases that can accommodate continuous data growth
- Keeping database systems lean and high performing
- Managing data lifecycles more efficiently and less intrusively to keep operational data
highly available
- Reducing the cost and impact of data maintenance operations, such as backup and restore operations, to keep mission-critical data ready when needed, and reorganization
and runstats operations to maximize system performance
- Satisfying near real-time requests for transactional data or the complex analytical
query requirements for large data sets, often including historical data, while reducing
the total cost of ownership (TCO)
DB2 for Linux, UNIX, and Windows offers a rich set of features that help you to meet these challenges and to benefit from winning solutions. This paper describes best practices for managing data growth that you can consider during the stages of database planning, design, implementation, and operation. By leveraging these best practices, DBAs and IT professionals can use DB2 data server’s extensible architecture and the layered data partitioning and organization schemes to take full advantage of proven approaches to managing data growth.
Introduction
This paper, whose contents are based on DB2 for Linux, UNIX, and Windows Version 9.7, is primarily intended for database administrators and solution architects in a data warehouse environment who are looking for physical database design and planning information that is applicable to data lifecycle management and data maintenance scenarios characterized by rapid data growth. The best practices described in this paper are also helpful in mixed-workload environments.
Strategies for managing data growth
There are a few general strategies for managing growth in a DB2 database system. Consider these strategies when planning your database system or application. You can also apply them when your system requirements change unexpectedly and you need to adjust accordingly.
Strategy 1: Control data growth
You might use this strategy when you have data that is constantly produced, but you have different access patterns for the data based on its age. Growth could be controlled by keeping the size of data in tables used for key applications constant, even as the size of the overall database grows, or it could be controlled by keeping the size of the entire database constant. Either way, you will need mechanisms for moving new, “active”, data into the tables and moving less relevant, “inactive”, data out of the tables or database. For example, a data warehouse might retain a single quarter's worth of data in its active data set and move older data to near-line storage for retention for regulatory compliance. By separating data according to access patterns, you can keep applications lean and optimize performance. You can also control costs by allocating hardware resources based on requirements for accessing the data.
Strategy 2: Adapt to growth
You might use this strategy when you have a system that needs to maintain good performance even as the size of its active data set continues to grow. For example, your data warehouse has produced so much business value, that now other departments want to take advantage of your application for their analyses. These departments want to include more columns from the transactional data in the warehouse. Another example is business growth. Increases in customers or sales, for example, could cause the same increase in data for your system. The DB2 software provides mechanisms that add capacity to the system to scale for good performance and to reduce the impact of data maintenance operations on increasingly large tables.
Combined strategy: Control and adapt to growth
Controlling and adapting to data growth can be combined into the same strategy. You might use this approach when you can separate data by access pattern, but the size of the data in the active data set continues to grow due to increases in business or increasing data demands on the application from new use cases.
Database design considerations for data growth
As data volume increases over time, the data might become skewed and fragmented, resulting in decreased performance. These problems can be addressed through good initial planning during the design phase. The DB2 shared-nothing architecture provides unparalleled system extensibility, both in terms of additional processing power and increased storage capacity. Database partitioning, table partitioning, and multidimensional clustering (MDC) are innovative and industry-proven solutions for scalability, manageability, and performance.
The tips in this section are intended to help DBAs take full advantage of these features, either individually or in combination. The discussion covers best practices for the following tasks:
- Designing table spaces with all anticipated database considerations in mind, including query performance, data lifecycle management, archiving, data backup and recovery operations, and other database maintenance tasks.
- Choosing the right database partitioning key so that the data can be distributed evenly across all of the database partitions for efficient parallel processing.
- Achieving better data collocation for enhanced complex query performance.
- Choosing a table partitioning key that facilitates data roll-in or roll-out for efficient data lifecycle management, and that helps complex queries to run faster by using the DB2 optimizer’s data partition elimination capability to scan only relevant data partitions.
- Determining the right MDC dimensions to achieve better query performance and require less data maintenance.
- Choosing the right combination of data organization schemes to maximize benefits.
Data lifecycle management
As a database grows, so does the relative amount of old data that is infrequently accessed yet consumes limited resources. Even with good scaling, it is increasingly important to keep the system lean to meet performance requirements and lower the TCO. It is also important to keep the old data available to meet business needs. This is accomplished by controlling the amount of active data and by defining a data retention strategy to handle the data aging process, for example by archiving historical data. Data lifecycle management refers to the management of data from the time at which the data is generated right to the end of its useful life.
The data lifecycle actually starts before the point of injection into a database system, and does not end when the data is rolled out from the active tables. Preparing, cleansing, and transforming the data before putting it into the database is a critical part of the process. Similarly, the management of data both while and after it is rolled out of the system should be considered part of the data lifecycle.
Today's database applications often require existing data to be available 24x7, so that the applications can run without interruption during both data roll-in and roll-out. The best practices for data lifecycle management are covered in the “Data lifecycle management” section. The best practices for accessing archived data are also outlined in that section.
Data maintenance
As the amount of data in a database continuously grows, database backups seem to take forever, data reorganization takes a long time, and runstats operations do not complete within the defined maintenance window. A highly active database can experience changing data distribution characteristics, including the balance of data across database partitions. These changes can result in fragmentation, reduced clustering ratios, and degraded performance over time. It is critical to regularly monitor the data distribution characteristics and to maintain current statistics to keep system performance from degrading.
Regular maintenance typically includes reorganizing data and indexes for better clustering, space reclamation, or defragmentation; refreshing statistics to help the optimizer improve data access plans; and redistributing the data in partitioned database environments to eliminate skew. Because such operations can be very time consuming and resource-intensive, especially with large amounts of data, careful consideration should be given during the physical database design phase to reduce or eliminate the impact of these maintenance tasks during database growth.
Another important operational aspect of managing data in a growing database is implementing a good data backup and recovery strategy to ensure business continuity after either planned or unexpected outages.
Responding to changes
No planning is perfect! As business needs change and data continues to grow, the original design assumptions for the database might no longer be valid. Although small changes to storage allocation or memory management can easily be accommodated, in some cases you might need to make significant changes. For example, the growing needs of your business might call for horizontal scaling, which you can address by adding a new database partition. Or the distribution key has developed a lot of skew and no
longer seems to be the best choice.
The “Responding to changes” section provides guidelines that help you to reconsider your choices around data organization schemes by assessing current realities and anticipating future trends. The following topics are covered in this section:
- Adding new database partitions to accommodate increasing data volume requirements and satisfy user demands
- Modifying the distribution key in response to skew
- Changing table partitioning granularity to accommodate data growth
Download the full report for more on managing data growth.
Download the report to get started! #Db2