ETL SQL

Learn Extract Transform Load using SQL & PySpark

Teradata – Slowly Changing Dimensions (SCD-1 & SCD-2) with examples

Slowly changing dimensions or SCD are the dimensions that change over time. It called “slowly” because the change is expected to come rarely or rather I shall say not frequently. Let’s understand this with one real life example.

You went to a coffee shop and ordered some coffee and snacks. The cashier asked for your mobile number and name. Assuming that you are the first time customer an entry will be made in CUSTOMER table in the data warehouse. It may look like below with default value for some columns:

Interaction – 1
Customer_IDCustomer_NameCustomer_MobileCustomer_DOBStateEntry_Date
1Mark999-999-9999 01-01-1900 NY24-05-2019

Next time , when you visit the coffee shop , cashier may ask you for your mobile number. Once you tell him, he may identify you as “Mark”. He may ask you about your Date-of-Birth (DOB) to share any special offer on your birthday. And you may decide to share this info or may be not with him. If you share that another info is added to the table in warehouse.

Interaction – 2
Customer_IDCustomer_NameCustomer_MobileCustomer_DOBStateEntry_Date
1Mark999-999-999918-04-1990 NY29-05-2019

Now there can be multiple ways to store this info into data warehouse Table. And the manner in which you will store this info will determine the SCD type you are using. We will talk about popular SCD types used in any data warehouse.

SCD-1: In this type, you will simply overwrite existing information in the table. The existing record will be replaced with the new one.

SCD-2: In this type, you will make new entry into the table with existing record. Some identifier is used to easily identify which is the current record.

SCD-3: In this type, record is over-written in the extra ‘technical’ column which holds the most recent entry.

SCD-4: In this type, history i.e. all the records all maintained in the separate history table and the most current record is only kept in the main table.

There are few more hybrid SCD types available but above mentioned ones are the most common ones. In the subsequent posts, we will discuss each one of the SCD types in detail – how to implement with Teradata SQL, advantages and disadvantage of each.

In SCD-1, we overwrite the existing information in the table with the most recent info received from Source. This is very common SCD type used in the data warehouse and is good only for situations when you don’t want to keep historical data for that entity. Let’s quickly jump to technical example to understand it:

Create Target Table and load it with sample data:

drop table customer;
create table customer
(
c_mobile bigint,
c_name varchar(30),
c_dob date,
c_email varchar(100),
c_insert_dt date,
c_update_dt date
)
primary index(c_mobile);

insert into customer values (111111111,'Jon Groff','1980-03-10',,date,date);
insert into customer values (222222222,'Kenneth Jarrett','1986-08-24',,date,date);
insert into customer values (333333333,'Daniel Jensen','1972-08-22',,date,date);
insert into customer values (444444444,'Jeffrey Kramer','1987-04-15',,date,date);
insert into customer values (555555555,'Pete  Bartels','1989-11-16',,date,date);


The table data will look like

C_MOBILEC_NAMEC_DOBC_EMAILC_INSERT_DTC_UPDATE_DT
111111111Jon Groff1980-03-10NULL2019-05-242019-05-24
222222222Kenneth Jarrett1986-08-24NULL2019-05-242019-05-24
333333333Daniel Jensen1972-08-22NULL2019-05-242019-05-24
444444444Jeffrey Kramer1987-04-15NULL2019-05-242019-05-24
555555555Pete Bartels1989-11-16NULL2019-05-242019-05-24

Now let’s create a staging table which holds most recent info shared from Source.

create table stg_customer
(
c_mobile bigint,
c_name varchar(30),
c_dob date,
c_email varchar(100)
)
primary index(c_mobile);

insert into stg_customer values (111111111,'Jon Groff','1980-03-10','jon.groff@gmail.com');
insert into stg_customer values (222222222,'Kenneth Jarrett','1986-10-24',NULL);
insert into stg_customer values (333333333,'Daniel Craig','1972-08-22',NULL);
insert into stg_customer values (666666666,'John Snow','1982-02-11',NULL);
insert into stg_customer values (777777777,'Arya Stark','1998-05-16',NULL);

The data in staging table looks like

C_MOBILEC_NAMEC_DOBC_EMAIL
111111111Jon Groff1980-03-10jon.groff@gmail.com
222222222Kenneth Jarrett1986-10-24NULL
333333333Daniel Craig1972-08-22NULL
666666666John Snow1982-02-11NULL
777777777Arya Stark1998-05-16NULL

Now to implement SCD-1 in Teradata , the easiest method is to run a MERGE statement on Target table “customer” using Staging table “stg_customer”

MERGE customer using stg_customer on 
customer.c_mobile = stg_customer.c_mobile
WHEN MATCHED THEN UPDATE 
SET
c_name = stg_customer.c_name,
c_dob = stg_customer.c_dob,
c_email = stg_customer.c_email,
c_update_dt = current_date
WHEN NOT MATCHED THEN
INSERT values (
stg_customer.c_mobile,
stg_customer.c_name,
stg_customer.c_dob,
stg_customer.c_email,
current_date,
current_date
);

Now the data in Target table will look like

C_MOBILEC_NAMEC_DOBC_EMAILC_INSERT_DTC_UPDATE_DTCOMMENT
111111111Jon Groff1980-03-10jon.groff@gmail.com2019-05-242019-05-25Email Update
222222222Kenneth Jarrett1986-10-24NULL2019-05-242019-05-25DOB Update
333333333Daniel Craig1972-08-22NULL2019-05-242019-05-25Name Update
444444444Jeffrey Kramer1987-04-15NULL2019-05-242019-05-24No Change
555555555Pete Bartels1989-11-16NULL2019-05-242019-05-24No Change
666666666John Snow1982-02-11NULL2019-05-252019-05-25New Record
777777777Arya Stark1998-05-16NULL2019-05-252019-05-25New Record

*Comment is not the actual column of the table. It is just for understanding purpose I have added above.

You can also run separate UPDATE+INSERT to replicate same functionality.Now let’s talk about advantage and disadvantage of SCD-1.

Advantage: Easy to implement, Table Size stays in control, Unique records in the table, Table can be accessed directly without worrying about recent or old record.

Disadvantage: No history is maintained. May not be adequate for data analysis.

In SCD (Slowly Changing Dimension) type 2, we have to store all the historical information about any Dimension. So unlike SCD-1, we cannot simply overwrite the updated data we are getting from Source. There are multiple ways in which one can implement SCD-2 in Teradata and we will discuss about couple of most popular approach in this post.

Approach 1: Set existing record as INACTIVE and Insert new record as ACTIVE

In this approach , create 2 technical DATE/TIMESTAMP columns. If your job is intra-day I will suggest keeping timestamp else you can keep DATE for daily jobs. Let’s call it INSERT_DT & UPDATE_DT.

Step 1: Update Existing Rows to mark as INACTIVE

Run UPDATE statement on existing TARGET table with records from SOURCE. This is required to identify and close existing records if any new information has come for same record. So you have to match “KEY” column and update UPDATE_DT to (CURRENT_DATE – 1). No other column is modified.

Step 2: Insert new/updated rows as ACTIVE

Whenever we load record [INSERT] into SCD2 table, you will load INSERT_DT with CURRENT_DATE and UPDATE_DT will be some far future date like ‘2999-12-31’. Since you have already closed all the existing records for which new info has come and assuming you have only DELTA records in Source, you can INSERT all records in the Source to Target now.

Note: With this approach whenever you are querying SCD2 table and don’t want history or shall I say require only active records, you have to apply filter condition to fetch active records. Like “WHERE UPDATE_DT=’2999-12-31′

Let’s see this with one example:

Step1: Creating Target & Source tables.

drop table customer;

create table customer
(
c_mobile bigint,
c_name varchar(30),
c_dob date,
c_email varchar(100),
c_insert_dt date,
c_update_dt date
)
primary index(c_mobile);
insert into customer values (111111111,'Jon Groff','1980-03-10',,'2019-06-02','2999-12-31');

insert into customer values (222222222,'Kenneth Jarrett','1986-08-24',,'2019-06-02','2999-12-31');

insert into customer values (333333333,'Daniel Jensen','1972-08-22',,'2019-06-02','2999-12-31');

insert into customer values (444444444,'Jeffrey Kramer','1987-04-15',,'2019-06-02','2999-12-31');

insert into customer values (555555555,'Pete  Bartels','1989-11-16',,'2019-06-02','2999-12-31');
drop table stg_customer;

create table stg_customer
(
c_mobile bigint,
c_name varchar(30),
c_dob date,
c_email varchar(100)
)
primary index(c_mobile);
insert into stg_customer values (111111111,'Jon Groff','1980-03-10','jon.groff@gmail.com');

insert into stg_customer values (222222222,'Kenneth Jarrett','1986-10-24',NULL);

insert into stg_customer values (333333333,'Daniel Craig','1972-08-22',NULL);

insert into stg_customer values (666666666,'John Snow','1982-02-11',NULL);

insert into stg_customer values (777777777,'Arya Stark','1998-05-16',NULL);

Let’s see how data looks like into the table now:

select * from customer; 
c_mobilec_namec_dobc_emailc_insert_dtc_update_dt
111111111Jon Groff1980-03-10NULL2019-06-022999-12-31
222222222Kenneth Jarrett1986-08-24NULL2019-06-022999-12-31
333333333Daniel Jensen1972-08-22NULL2019-06-022999-12-31
444444444Jeffrey Kramer1987-04-15NULL2019-06-022999-12-31
555555555Pete Bartels1989-11-16NULL2019-06-022999-12-31
select * from stg_customer;
c_mobilec_namec_dobc_email
111111111Jon Groff1980-03-10jon.groff@gmail.com
222222222Kenneth Jarrett1986-10-24NULL
333333333Daniel Craig1972-08-22NULL
666666666John Snow1982-02-11NULL
777777777Arya Stark1998-05-16NULL

Now run the UPDATE command to modify UPDATE_DT and run INSERT to load records.

update tb1 from customer tb1, stg_customer tb2 
 set c_update_dt=current_date - 1 
 where tb1.c_mobile = tb2.c_mobile 
and tb1.c_update_dt='2999-12-31';
insert into customer select stg_customer.c_mobile,
stg_customer.c_name,
stg_customer.c_dob,
stg_customer.c_email,
current_date,
'2999-12-31'
from stg_customer;

Now data in Target will look like:

c_mobilec_namec_dobc_emailc_insert_dtc_update_dt
111111111Jon Groff1980-03-10NULL2019-06-022019-06-02
111111111Jon Groff1980-03-10jon.groff@gmail.com2019-06-032999-12-31
222222222Kenneth Jarrett1986-08-24NULL2019-06-022019-06-02
222222222Kenneth Jarrett1986-10-24NULL2019-06-032999-12-31
333333333Daniel Jensen1972-08-22NULL2019-06-022019-06-02
333333333Daniel Craig1972-08-22NULL2019-06-032999-12-31
444444444Jeffrey Kramer1987-04-15NULL2019-06-022999-12-31
555555555Pete Bartels1989-11-16NULL2019-06-022999-12-31
666666666John Snow1982-02-11NULL2019-06-032999-12-31
777777777Arya Stark1998-05-16NULL2019-06-032999-12-31

To fetch only the active records:

select * from customer where c_update_dt='2999-12-31';
c_mobilec_namec_dobc_emailc_insert_dtc_update_dt
111111111Jon Groff1980-03-10jon.groff@gmail.com2019-06-032999-12-31
222222222Kenneth Jarrett1986-10-24NULL2019-06-032999-12-31
333333333Daniel Craig1972-08-22NULL2019-06-032999-12-31
444444444Jeffrey Kramer1987-04-15NULL2019-06-022999-12-31
555555555Pete Bartels1989-11-16NULL2019-06-022999-12-31
666666666John Snow1982-02-11NULL2019-06-032999-12-31
777777777Arya Stark1998-05-16NULL2019-06-032999-12-31

Approach 2: Append all the records and apply Logic while reading data

This will require only 1 technical column, “INSERT_DT”. You can simply INSERT all the records in the Target table. Now on top of this table you can create VIEW which fetches only the most recent entry for “KEY” columns and show it to user. You can apply “qualify row_number() over(partition by KEY_COL1,KEY_COL2 ORDER BY INSERT_DT desc)=1” in the view definition.

Let’s see one example for this:

drop table customer_2;

create table customer_2
(
c_mobile bigint,
c_name varchar(30),
c_dob date,
c_email varchar(100),
c_insert_dt date
)
primary index(c_mobile);
insert into customer_2 values (111111111,'Jon Groff','1980-03-10',,'2019-06-02');

insert into customer_2 values (222222222,'Kenneth Jarrett','1986-08-24',,'2019-06-02');

insert into customer_2 values (333333333,'Daniel Jensen','1972-08-22',,'2019-06-02');

insert into customer_2 values (444444444,'Jeffrey Kramer','1987-04-15',,'2019-06-02');

insert into customer_2 values (555555555,'Pete  Bartels','1989-11-16',,'2019-06-02');
select * from customer_2;
c_mobilec_namec_dobc_emailc_insert_dt
111111111Jon Groff1980-03-10NULL2019-06-02
222222222Kenneth Jarrett1986-08-24NULL2019-06-02
333333333Daniel Jensen1972-08-22NULL2019-06-02
444444444Jeffrey Kramer1987-04-15NULL2019-06-02
555555555Pete Bartels1989-11-16NULL2019-06-02

Now let’s load all the DELTA records from Source to Target:

insert into
	customer_2
select
	stg_customer.c_mobile,
	stg_customer.c_name,
	stg_customer.c_dob,
	stg_customer.c_email,
	current_date
from
	stg_customer;

Create a VIEW on top of Target table to show only active records to user

replace view vw_customer_2 as
select
	*
from
	customer_2 qualify row_number() over (partition by c_mobile
order by
	c_insert_dt desc)= 1

Fetch record from VIEW now

select * from vw_customer_2
c_mobilec_namec_dobc_emailc_insert_dt
111111111Jon Groff1980-03-10jon.groff@gmail.com2019-06-03
222222222Kenneth Jarrett1986-10-24NULL2019-06-03
333333333Daniel Craig1972-08-22NULL2019-06-03
444444444Jeffrey Kramer1987-04-15NULL2019-06-02
555555555Pete Bartels1989-11-16NULL2019-06-02
666666666John Snow1982-02-11NULL2019-06-03
777777777Arya Stark1998-05-16NULL2019-06-03

Note: This approach looks easy to implement however I will not recommend it. As the table size grows very fast and also query response time is higher during retrieval. However this approach is more suitable if you are running on platforms in which UPDATE operation comes with high cost especially on Hadoop like HIVE/SPARK etc.

Benefits of SCD2: Entire History is recorded. So good way to store information if analytics is required on data.

Not so good about SCD2: Table storage increases very fast. Require additional filter conditions to fetch active records.

Let me know if you have any query by leaving a comment.


Leave a Reply

Your email address will not be published. Required fields are marked *