Rational Insight data warehouse model

The IBM® Rational® Insight data warehouse (RIDW) is the component that stores data from various operational systems in a format that facilitates analytical reporting, and provides the engine required for query optimization, multidimensional analysis, and historical trending. Its data structure is different from that of live transactional systems, which generally store information in normalized form. While incoming data passes through several stages during which it is stored temporarily in normalized tables, the final format used is a star schema. This section describes the basic data structure of the RIDW.

Data storage areas

The RIDW contains three data storage areas:

Configuration area

This area stores important configuration information about entities such as product data sources and calendars. The configuration area contains the reference settings of the warehouse. It has four main tables:
Table 1.
Table name Purpose of the table
SYS_CONFIG Contains system configuration information, such as the date dimension calendar type.
ETL_INFO Contains extract, transform, and load (ETL) configuration information, such as ETL schedule (daily, weekly, or monthly)
For compatibility with multiple data sources, RIDW supports both ISO and Gregorian date types. You can select the configuration settings when the warehouse is installed, and before you execute an ETL transaction.

Operational data store

This is the area where information gathered from the data sources during the ETL process is received, consolidated and stored. RIDW supports the following products: IBM Rational ClearQuest® (Enterprise and ALM schemas), IBM Rational ClearCase®, IBM Rational Requisite® Pro, IBM Rational Team Concert, IBM Rational Test Manager, IBM Rational Quality Manager, IBM Rational Team Manager, and Microsoft® Project. The data is then cleaned and processed to ensure consistency between multiple sources. RIDW database design principles seek to ensure the following:
  • Traceability of data to their source operational systems
  • Grouping of elements based on subjects
  • Present a unified normalized area independent of data sources
Control columns
The ETL process adds a few additional columns that contain control information used by the RIDW later. For instance, the following three mandatory columns provide identification information:
  • EXTERNAL_KEY1 stores the natural key if it is an integer
  • EXTERNAL_KEY2 store the natural key if it of type varchar
  • SOURCE_ID stores the GUID value of the data source
End-users do not need to configure the GUID as it is assigned to a data source during the design phase.

Two additional control columns are also added: ISSOFDELETED and REC_DATETIME.

ISSOFDELETED is used to mark a record that has been physically deleted from the original source. Its value should be set to 1 if the record is deleted and to 0 if not deleted.
Note: Records in the normalized area should never be physically deleted since they might be linked to measures in the fact and dimension area.
REC_DATETIME is a timestamp column whose value is set by the database whenever a record is inserted, and is subsequently updated by the ETL should a record be updated.

Surrogate keys

Surrogate keys are the primary keys created by the ETL process. They are required to override differences between the formats of the primary keys from different sources. In every extracted table, RIDW creates its own primary keys and retains the original key in the EXTERNAL_KEY1 or EXTERNAL_KEY2 columns. Surrogate keys are 0-based integers.

Classification tables

These are tables that represents commonly used artifacts such as projects, requests, requirements, tasks, activities, and components. They are recognizable by their name, which contains the word “CLASSIFICATION”. Every classification table has a pre-defined record type that represents a common understanding of the artifact. Converting extracted tables into classification tables is managed by the ETL process.

Dummy record

All classifications and artifact tables contain a dummy row identifiable by a primary key value of -1. Dummy rows support the normalization of data in the event of null values for the referential constraints defined.

Fact and dimension area

The fact and dimension area implements a star schema design, which is a set of facts, dimensions, and historical traceability tables. A fact table contains the measures or ‘facts' of a particular business process. For instance, if project management is a process that you want to measure then a corresponding fact table might contain the number of requirements. Dimension tables contain attributes used to constrain and group data when performing data warehousing queries. If number of requirements is a fact table then you might want to use requirement type as a dimension.

Design conventions

Dimension table names start with “D_” and have surrogate keys to identify records. Fact table names with “F_”, while tables that provide historical insight on measures found in the fact tables start with “H_”. The primary key is usually a combination of surrogate key of the measure and the surrogate key of the artifact that contributed to the measure. Fact table names with “F_C_”, are equivalent to their original trending fact table, with the exception that they only store the most recent snapshot of the data, whereas their counterparts accumulate data to allow trending over time. The data mart also comprises a set of views that provide a window to the operational data store, for the purpose of drill down for more details or for the possibility of providing list reports.

Control columns

All fact tables have a column that references the date dimension D_DATE. For trending tables, the date in this context indicates the date that the snapshot of the data was taken. For the other fact tables, the date reflects the occurrence of an event such as the creation of a record. The snapshot of the data can be tagged to indicate a ‘per day', ‘per week', or ‘per month' basis, depending on the value stored in the SYS_CONFIG table in the configuration area. Other control columns include the REC_DATEIME column, which serves as a record stamp for the insert or update of a record.

NULL indicator record

As in the operational data store, dimension tables contain a NULL indicator row identifiable by a primary key value of -1. This represents the NULL value for tables with constraints.


Feedback