Thursday, 2 June 2011

SCD – Slowly Changing Dimensions

SCD – Slowly Changing Dimensions
Implementing SCD’s is common concern in Data Warehouse design. Let me brief about different types of SCD’s first then we can step in to SCD’s design.
Dimensions that change over time are called Slowly Changing Dimensions. For instance, a product price changes over time; People change their names for some reason; Country and State names may change over time. These are a few examples of Slowly Changing Dimensions since some changes are happening to them over a period of time.
Everybody knows that ‘Slowly Changing Dimensions’ is the acronym for SCD.
We have 4 types of SCD’s in Data Warehouse
Let’s say I have a customer dimension with these columns mainly
 (Customer Id, Customer First Name, Customer Last Name, Customer Country)
Customer
Id
Customer First Name
Customer Last Name
Customer Country
1
Sudheer
Sharma
India

Now, this guy moved to US. In source the country name has been changed to US, we need to update that in our target dimension to reflect this change.
SCD Type 1: The new incoming record (changed/modified data set) replaces the existing old record in target.
Customer
Id
Customer First Name
Customer Last Name
Customer Country
1
Sudheer
Sharma
US

Old value (India) is overwritten by the new value (US) and there is no way to find out the old version of data. It holds only the current version of data.




SCD Type 2: In this case, an additional record is added into the customer dimension. The beauty of this approach is it will maintain two versions, you will find two records the older version and the current version. In other words it maintains history. Again we can implement Type 2 in following methods
  1. Versioning
  2. Effective Dates
  3. By setting Current Flag values/Record Indicators.
Method 1: Versioning
Customer Id
Customer First Name
Customer
Last Name
Customer Country
Effective Year
Version
1
Sudheer
Sharma
India
2008
0
1
Sudheer
Sharma
US
2009
1

Method 2: Effective Dates
Customer Id
Customer
First Name
Customer Last Name
Customer Country
Effective
Start Date
Effective EndDate
1
Sudheer
Sharma
India
01/01/2008
12/31/2008
1
Sudheer
Sharma
US
01/01/2009
tilldate

Method 3: Effective Dates & Current Record Indicators
Customer Id
Customer
First Name
Customer
Last Name
Customer Country
Effective
Start Date
Effective
End Date
Current Record IND
1
Sudheer
Sharma
India
01/01/2008
12/31/2008
N
1
Sudheer
Sharma
US
01/01/2009
tilldate
Y

SCD Type 3: In this approach, only the information about a previous value of a dimension is written into the database. An ‘old ‘or ‘previous’ column is created which stores the immediate previous attribute.
Product
ID
Product
Name
Current
Year
Current
Price
Previous
Year
Previous
Price
1
Close-up
2008
50.00
2007
45.00

The problem with this approach is over years, if the product price continuously changes, then the complete history may not be stored, only the latest change will be stored. For example, in year 2009, if the product price changes to 60, then we would not be able to see the complete history of 2007 prices, since the old values would have been updated with 2008 information.
SCD Type 4: In this approach, one table hold current data and another table keeps historical data for each dimension.
Customer Dimension
Customer
Id
Customer
First Name
Customer
Last Name
Customer
Country
1
Sudheer
Sharma
US

Customer History Table
Customer
Id
Customer
First Name
Customer
Last Name
Customer
Country
Effective Year
1
Sudheer
Sharma
India
2008


DIMENSION TABLE VS FACT TABLE

Wednesday, 1 June 2011

Working with Sessions Overview

Working with Sessions Overview
A session is a set of instructions that tells the Integration Service how and when to move data from sources to targets. A session is a type of task, similar to other tasks available in the Workflow Manager. In the Workflow Manager, you configure a session by creating a Session task. To run a session, you must first create a workflow to contain the Session task.
When you create a Session task, you enter general information such as the session name, session schedule, and the Integration Service to run the session. You can also select options to run pre-session shell commands, send On-Success or On-Failure email, and use FTP to transfer source and target files.
You can configure the session to override parameters established in the mapping, such as source and target location, source and target type, error tracing levels, and transformation attributes. You can also configure the session to collect performance details for the session and store them in the PowerCenter repository. You might view performance details for a session to tune the session.
You can run as many sessions in a workflow as you need. You can run the Session tasks sequentially or concurrently, depending on the requirement.
The Integration Service creates several files and in-memory caches depending on the transformations and options used in the session. For more information about session output files and caches, see “Integration Service Architecture” in the PowerCenter Administrator Guide

Working with Sessions Overview

Working with Sessions Overview
A session is a set of instructions that tells the Integration Service how and when to move data from sources to targets. A session is a type of task, similar to other tasks available in the Workflow Manager. In the Workflow Manager, you configure a session by creating a Session task. To run a session, you must first create a workflow to contain the Session task.
When you create a Session task, you enter general information such as the session name, session schedule, and the Integration Service to run the session. You can also select options to run pre-session shell commands, send On-Success or On-Failure email, and use FTP to transfer source and target files.
You can configure the session to override parameters established in the mapping, such as source and target location, source and target type, error tracing levels, and transformation attributes. You can also configure the session to collect performance details for the session and store them in the PowerCenter repository. You might view performance details for a session to tune the session.
You can run as many sessions in a workflow as you need. You can run the Session tasks sequentially or concurrently, depending on the requirement.
The Integration Service creates several files and in-memory caches depending on the transformations and options used in the session. For more information about session output files and caches, see “Integration Service Architecture” in the PowerCenter Administrator Guide

Repository Architecture

Repository Architecture
The PowerCenter repository resides in a relational database. The repository database tables contain the instructions required to extract, transform, and load data. Repository clients access the repository database tables through the Repository Service. A repository client is any PowerCenter component that connects to the repository.
The Repository Service manages repository metadata transaction requests from repository clients. Each Repository Service manages a single repository. The Repository Service uses object-locking to ensure the consistency of metadata in the repository.
A Repository Service process is a multi-threaded process that fetches, inserts, and updates metadata in the repository database tables. A Repository Service process is an instance of the Repository Service that runs on a particular machine, or node.
The Repository Service accepts client metadata transaction requests from the following PowerCenter components:
PowerCenter Client tools. Use the Designer to create and store mapping metadata in the repository. Use the Workflow Manager to store workflow metadata and connection object information in the repository. Use the Workflow Monitor to retrieve workflow run status information and session logs written by the Integration Service. Use the Repository Manager to organize and secure metadata by creating folders. You can manage the repository from the PowerCenter Administration Console.
pmrep and infacmd. Use pmrep to perform repository metadata administration tasks, such as listing repository objects. Use infacmd to perform service-related functions, such as creating or removing a Repository Service.


Integration Service. When you start the Integration Service, it connects to the repository to schedule workflows. When you run a workflow, the Integration Service retrieves workflow task and mapping metadata from the repository. During the workflow run, the Integration Service writes workflow status information to the repository.


Integration Service Architecture Overview

Integration Service Architecture Overview
The Integration Service moves data from sources to targets based on workflow and mapping metadata stored in a repository. When a workflow starts, the Integration Service retrieves mapping, workflow, and session metadata from the repository. It extracts data from the mapping sources and stores the data in memory while it applies the transformation rules configured in the mapping. The Integration Service loads the transformed data into one or more targets.
Figure 10-1 shows the processing path between the Integration Service, repository, source, and target:
Figure 10-1. Integration Service and Data Movement

To move data from sources to targets, the Integration Service uses the following components:
Integration Service process. The Integration Service starts one or more Integration Service processes to run and monitor workflows. When you run a workflow, the Integration Service process starts and locks the workflow, runs the workflow tasks, and starts the process to run sessions. For more information about the Integration Service process, see Integration Service Process.
Load Balancer. The Integration Service uses the Load Balancer to dispatch tasks. The Load Balancer dispatches tasks to achieve optimal performance. It may dispatch tasks to a single node or across the nodes in a grid. For more information about the Load Balancer, see Load Balancer.


Data Transformation Manager (DTM) process. The Integration Service starts a DTM process to run each Session and Command task within a workflow. The DTM process performs session validations, creates threads to initialize the session, read, write, and transform data, and handles pre- and post- session operations. For more information about the DTM process, see Data Transformation Manager (DTM) Process.

The Integration Service can achieve high performance using symmetric multi-processing systems. It can start and run multiple tasks concurrently. It can also concurrently process partitions within a single session. When you create multiple partitions within a session, the Integration Service creates multiple database connections to a single source and extracts a separate range of data for each connection. It also transforms and loads the data in parallel. For more information about partitioning, see Processing Threads.
You can create an Integration Service on any machine where you installed the PowerCenter Services. You can configure the Integration Service using the Administration Console or the pmcmd command line program. For more information about the Administration Console, see Using the Administration Console. For more information about pmcmd, see the Command Line Reference.