Wednesday, 1 June 2011

Optimizing the Data Warehouse for Decision Support

Optimizing the Data Warehouse for Decision Support

The data warehouse is the foundation of any data delivery system, and the warehouse data model must address the conflicting goals of system flexibility and data delivery performance. Flexibility is required so that additional attributes, dimensions, and metrics can be added to the warehouse, complementing and leveraging existing data. Quick query response is required in all effective interactive Decision Support Systems, since it not only impacts end-user satisfaction but also directly determines the number of analyses the DSS analyst can perform given limited time.
In considering the data warehouse data model, it is useful to first review the types of tables found in a data warehouse. The three types of tables are:
  • Primary Data Tables.
  • Descriptor Tables.
  • Characteristics Tables.
Primary Data Tables contain both metrics and attributes, and contain the data that end-users are seeking. In Figure 5, the Primary Data Table contains the attributes Product_ID, Store_ID, and the metric Sales. In large data warehouses, the full-text attribute description is not stored in the Primary Data Table, but rather in a Descriptor Table, while numeric element ID codes are stored in the Primary Data Table. Numeric ID codes can be indexed faster, yield smaller indices, and provide faster WHERE clause matching than their text counterparts.
Descriptor Tables often contain only two columns, the attribute ID code and the common-English description of the attribute. There is a one-to-one relationship between the ID and the description. These tables are used to replace the ID codes used in queries with common business terms familiar to the user. In Figure 5, there are two Descriptor Tables which map Product_ID and Store_ID codes to their respective user-understandable business terms. In smaller warehouses, where load performance and storage concerns are lessened, text descriptors may appear in the Primary Data Tables to increase data comprehensibility.
Characteristics Tables contain additional information about an attribute and can be used to segment data in an ad-hoc manner. Each column in a Characteristics Table represents an additional attribute that can be used as a filtering criterion in queries. In Figure 5, there is a single Characteristics Table which contains additional attributes related to the Store attribute. Using this Characteristics Table, Sales could be segmented by store square footage, or any other attribute in the Characteristics Table. The ability to define and join with Characteristics Tables containing many attributes is a key advantage of RDBMS technology over multi-dimensional data storage architectures, since multi-dimensional architectures do not support multi-table joins.

Performance Optimization

DSS query performance, while a function of the performance of every component in the data delivery architecture, is strongly correlated to the physical data model. Intelligent data modeling, through the use of techniques such as denormalization, consolidation, and partitioning, can provide orders of magnitude performance gains compared to the use of normalized data structures.

Denormalization

Denormalization improves performance by either:
1.Reducing the number of joins required during query execution, or 2.Reducing the number of rows to be retrieved from the Primary Data Table.
There are three principal denormalization techniques:
  1. Include Descriptors in Primary Data Tables: Rather than placing ID descriptors in a separate Descriptor Table, descriptors may be placed in the Primary Data Table in order to eliminate a join. In Figure 6A, the Store_Descriptor is contained in the Primary Data Table along with the Store_ID. 2.Include Child Records in Parent Record: By adding columns to the Primary Data Table, child records can be added to the parent record, thereby allowing all child records to be retrieved with every parent record retrieval. This technique improves performance by eliminating a join and removing child record criteria from the WHERE clause. However, query flexibility is impaired since child record attributes are added to columns of the parent record and can no longer be used as filtering criteria. This technique works best when there are a fixed and relatively small number of child records. In Figure 6B, Q1, Q2, Q3, and Q4 Sales are represented as columns in the Primary Data Table.
In some systems, the most recent child record is used much more frequently than other child records. In these cases, adding the most recent child record to the parent record can improve performance. In Figure 6C, the most recent child data, last quarters sales, is added to the parent record.

Summarization

One of the most powerful performance optimization techniques is the use of multiple, summarized Primary Data Tables. Data is typically loaded into the data warehouse from operational systems at an atomic level of detail. Most decision support queries however, require aggregate-level, consolidated information and not atomic detail. If processed against the atomic level, these queries require the retrieval of many records to yield aggregate information, and require that the data be mathematically summarized at query runtime, further impeding query performance.
To improve query performance, data can be pre-summarized and stored along frequently accessed consolidation hierarchies. It is impossible to pre-summarize atomic detail if the consolidation hierarchy is not known. Fortunately, these consolidation hierarchies are often identical to the attribute hierarchies within a dimension. In fact, for best query performance, Primary Data Tables should exist at each attribute intersection across all attribute hierarchies.
At first glance, the addition of Summary Tables may appear to dramatically increase data storage requirements. However, since Summary Tables get progressively smaller as one approaches the most aggregate level of the attribute hierarchy, only a 20-100% increase in data storage is typically required to physically store all attribute hierarchy-defined consolidation paths. This modest increase in disk storage typically yields a two-to-ten-fold increase in query performance over non-consolidated data warehouse designs.
The example Figure 7 shows two dimensions, Product and Geography, each of which has multiple hierarchical attributes. Data is extracted from the source systems at the lowest level of both hierarchies, which are the item and zip code levels. A Summary Table is then added for each attribute combination of both attribute hierarchies.

Partitioning

A final technique used to improve query performance is data partitioning. In a partitioned data model, multiple tables are created to store atomic level data. Figure 8 contrasts an atomic non-partitioned table with an atomic set of tables partitioned by month.
Partitioning of large atomic data tables into multiple smaller tables provides the following advantages:
1. Query response time is improved. 2.Incremental data backup and recovery is accelerated. 3.Time required to load into indexed tables is decreased.
Disadvantages of partitioning include:
  1. Joins and unions are required to retrieve related data from multiple tables.
  2. More intelligent query generation is required to determine which tables contain user-requested data.
  3. Additional metadata is required to describe the partitioned warehouse.

Naming Conventions

All column names referring to the same attribute or metric should be consistent throughout the warehouse. While perhaps obvious, name consistency is extremely important for understanding data. Consistency is absolutely required when using query generators which dynamically determine join columns based on column name.

Data Warehousing Summary

In enterprise-wide data warehouses, decision support query performance often needs to be improved to satisfy the requirements of interactive users. Through the use of techniques such as denormalization, summarization, and partitioning, the data warehouse data model can be optimized to deliver dramatically improved performance over highly normalized models. While these techniques are not new, it traditionally has been difficult to create a data delivery architecture which seamlessly supports performance-optimized models since metadata requirements have been ill-defined and DSS engines have had limited query generation capabilities.

1 comment: