Teradata – Slowly Changing Dimensions (SCD-1 & SCD-2) with examples

Slowly changing dimensions or SCD are the dimensions that change over time. It called “slowly” because the change is expected to come rarely or rather I shall say not frequently. Let’s understand this with one real life example.

You went to a coffee shop and ordered some coffee and snacks. The cashier asked for your mobile number and name. Assuming that you are the first time customer an entry will be made in CUSTOMER table in the data warehouse. It may look like below with default value for some columns:

Interaction – 1
Customer_ID	Customer_Name	Customer_Mobile	Customer_DOB	State	Entry_Date
1	Mark	999-999-9999	01-01-1900	NY	24-05-2019

Next time , when you visit the coffee shop , cashier may ask you for your mobile number. Once you tell him, he may identify you as “Mark”. He may ask you about your Date-of-Birth (DOB) to share any special offer on your birthday. And you may decide to share this info or may be not with him. If you share that another info is added to the table in warehouse.

Interaction – 2
Customer_ID	Customer_Name	Customer_Mobile	Customer_DOB	State	Entry_Date
1	Mark	999-999-9999	18-04-1990	NY	29-05-2019

Now there can be multiple ways to store this info into data warehouse Table. And the manner in which you will store this info will determine the SCD type you are using. We will talk about popular SCD types used in any data warehouse.

SCD-1: In this type, you will simply overwrite existing information in the table. The existing record will be replaced with the new one.

SCD-2: In this type, you will make new entry into the table with existing record. Some identifier is used to easily identify which is the current record.

SCD-3: In this type, record is over-written in the extra ‘technical’ column which holds the most recent entry.

SCD-4: In this type, history i.e. all the records all maintained in the separate history table and the most current record is only kept in the main table.

There are few more hybrid SCD types available but above mentioned ones are the most common ones. In the subsequent posts, we will discuss each one of the SCD types in detail – how to implement with Teradata SQL, advantages and disadvantage of each.

In SCD-1, we overwrite the existing information in the table with the most recent info received from Source. This is very common SCD type used in the data warehouse and is good only for situations when you don’t want to keep historical data for that entity. Let’s quickly jump to technical example to understand it:

Create Target Table and load it with sample data:

drop table customer;
create table customer
(
c_mobile bigint,
c_name varchar(30),
c_dob date,
c_email varchar(100),
c_insert_dt date,
c_update_dt date
)
primary index(c_mobile);

insert into customer values (111111111,'Jon Groff','1980-03-10',,date,date);
insert into customer values (222222222,'Kenneth Jarrett','1986-08-24',,date,date);
insert into customer values (333333333,'Daniel Jensen','1972-08-22',,date,date);
insert into customer values (444444444,'Jeffrey Kramer','1987-04-15',,date,date);
insert into customer values (555555555,'Pete  Bartels','1989-11-16',,date,date);

The table data will look like

C_MOBILE	C_NAME	C_DOB	C_EMAIL	C_INSERT_DT	C_UPDATE_DT
111111111	Jon Groff	1980-03-10	NULL	2019-05-24	2019-05-24
222222222	Kenneth Jarrett	1986-08-24	NULL	2019-05-24	2019-05-24
333333333	Daniel Jensen	1972-08-22	NULL	2019-05-24	2019-05-24
444444444	Jeffrey Kramer	1987-04-15	NULL	2019-05-24	2019-05-24
555555555	Pete Bartels	1989-11-16	NULL	2019-05-24	2019-05-24

Now let’s create a staging table which holds most recent info shared from Source.

create table stg_customer
(
c_mobile bigint,
c_name varchar(30),
c_dob date,
c_email varchar(100)
)
primary index(c_mobile);

insert into stg_customer values (111111111,'Jon Groff','1980-03-10','jon.groff@gmail.com');
insert into stg_customer values (222222222,'Kenneth Jarrett','1986-10-24',NULL);
insert into stg_customer values (333333333,'Daniel Craig','1972-08-22',NULL);
insert into stg_customer values (666666666,'John Snow','1982-02-11',NULL);
insert into stg_customer values (777777777,'Arya Stark','1998-05-16',NULL);

The data in staging table looks like

C_MOBILE	C_NAME	C_DOB	C_EMAIL
111111111	Jon Groff	1980-03-10	jon.groff@gmail.com
222222222	Kenneth Jarrett	1986-10-24	NULL
333333333	Daniel Craig	1972-08-22	NULL
666666666	John Snow	1982-02-11	NULL
777777777	Arya Stark	1998-05-16	NULL

Now to implement SCD-1 in Teradata , the easiest method is to run a MERGE statement on Target table “customer” using Staging table “stg_customer”

MERGE customer using stg_customer on 
customer.c_mobile = stg_customer.c_mobile
WHEN MATCHED THEN UPDATE 
SET
c_name = stg_customer.c_name,
c_dob = stg_customer.c_dob,
c_email = stg_customer.c_email,
c_update_dt = current_date
WHEN NOT MATCHED THEN
INSERT values (
stg_customer.c_mobile,
stg_customer.c_name,
stg_customer.c_dob,
stg_customer.c_email,
current_date,
current_date
);

Now the data in Target table will look like

C_MOBILE	C_NAME	C_DOB	C_EMAIL	C_INSERT_DT	C_UPDATE_DT	COMMENT
111111111	Jon Groff	1980-03-10	jon.groff@gmail.com	2019-05-24	2019-05-25	Email Update
222222222	Kenneth Jarrett	1986-10-24	NULL	2019-05-24	2019-05-25	DOB Update
333333333	Daniel Craig	1972-08-22	NULL	2019-05-24	2019-05-25	Name Update
444444444	Jeffrey Kramer	1987-04-15	NULL	2019-05-24	2019-05-24	No Change
555555555	Pete Bartels	1989-11-16	NULL	2019-05-24	2019-05-24	No Change
666666666	John Snow	1982-02-11	NULL	2019-05-25	2019-05-25	New Record
777777777	Arya Stark	1998-05-16	NULL	2019-05-25	2019-05-25	New Record

*Comment is not the actual column of the table. It is just for understanding purpose I have added above.

You can also run separate UPDATE+INSERT to replicate same functionality.Now let’s talk about advantage and disadvantage of SCD-1.

Advantage: Easy to implement, Table Size stays in control, Unique records in the table, Table can be accessed directly without worrying about recent or old record.

Disadvantage: No history is maintained. May not be adequate for data analysis.

In SCD (Slowly Changing Dimension) type 2, we have to store all the historical information about any Dimension. So unlike SCD-1, we cannot simply overwrite the updated data we are getting from Source. There are multiple ways in which one can implement SCD-2 in Teradata and we will discuss about couple of most popular approach in this post.

Approach 1: Set existing record as INACTIVE and Insert new record as ACTIVE

In this approach , create 2 technical DATE/TIMESTAMP columns. If your job is intra-day I will suggest keeping timestamp else you can keep DATE for daily jobs. Let’s call it INSERT_DT & UPDATE_DT.

Step 1: Update Existing Rows to mark as INACTIVE

Run UPDATE statement on existing TARGET table with records from SOURCE. This is required to identify and close existing records if any new information has come for same record. So you have to match “KEY” column and update UPDATE_DT to (CURRENT_DATE – 1). No other column is modified.

Step 2: Insert new/updated rows as ACTIVE

Whenever we load record [INSERT] into SCD2 table, you will load INSERT_DT with CURRENT_DATE and UPDATE_DT will be some far future date like ‘2999-12-31’. Since you have already closed all the existing records for which new info has come and assuming you have only DELTA records in Source, you can INSERT all records in the Source to Target now.

Note: With this approach whenever you are querying SCD2 table and don’t want history or shall I say require only active records, you have to apply filter condition to fetch active records. Like “WHERE UPDATE_DT=’2999-12-31′“

Let’s see this with one example:

Step1: Creating Target & Source tables.

drop table customer;

create table customer
(
c_mobile bigint,
c_name varchar(30),
c_dob date,
c_email varchar(100),
c_insert_dt date,
c_update_dt date
)
primary index(c_mobile);

insert into customer values (111111111,'Jon Groff','1980-03-10',,'2019-06-02','2999-12-31');

insert into customer values (222222222,'Kenneth Jarrett','1986-08-24',,'2019-06-02','2999-12-31');

insert into customer values (333333333,'Daniel Jensen','1972-08-22',,'2019-06-02','2999-12-31');

insert into customer values (444444444,'Jeffrey Kramer','1987-04-15',,'2019-06-02','2999-12-31');

insert into customer values (555555555,'Pete  Bartels','1989-11-16',,'2019-06-02','2999-12-31');

drop table stg_customer;

create table stg_customer
(
c_mobile bigint,
c_name varchar(30),
c_dob date,
c_email varchar(100)
)
primary index(c_mobile);

insert into stg_customer values (111111111,'Jon Groff','1980-03-10','jon.groff@gmail.com');

insert into stg_customer values (222222222,'Kenneth Jarrett','1986-10-24',NULL);

insert into stg_customer values (333333333,'Daniel Craig','1972-08-22',NULL);

insert into stg_customer values (666666666,'John Snow','1982-02-11',NULL);

insert into stg_customer values (777777777,'Arya Stark','1998-05-16',NULL);

Let’s see how data looks like into the table now:

select * from customer;

c_mobile	c_name	c_dob	c_email	c_insert_dt	c_update_dt
111111111	Jon Groff	1980-03-10	NULL	2019-06-02	2999-12-31
222222222	Kenneth Jarrett	1986-08-24	NULL	2019-06-02	2999-12-31
333333333	Daniel Jensen	1972-08-22	NULL	2019-06-02	2999-12-31
444444444	Jeffrey Kramer	1987-04-15	NULL	2019-06-02	2999-12-31
555555555	Pete Bartels	1989-11-16	NULL	2019-06-02	2999-12-31

select * from stg_customer;

c_mobile	c_name	c_dob	c_email
111111111	Jon Groff	1980-03-10	jon.groff@gmail.com
222222222	Kenneth Jarrett	1986-10-24	NULL
333333333	Daniel Craig	1972-08-22	NULL
666666666	John Snow	1982-02-11	NULL
777777777	Arya Stark	1998-05-16	NULL

Now run the UPDATE command to modify UPDATE_DT and run INSERT to load records.

update tb1 from customer tb1, stg_customer tb2 
 set c_update_dt=current_date - 1 
 where tb1.c_mobile = tb2.c_mobile 
and tb1.c_update_dt='2999-12-31';

insert into customer select stg_customer.c_mobile,
stg_customer.c_name,
stg_customer.c_dob,
stg_customer.c_email,
current_date,
'2999-12-31'
from stg_customer;

Now data in Target will look like:

c_mobile	c_name	c_dob	c_email	c_insert_dt	c_update_dt
111111111	Jon Groff	1980-03-10	NULL	2019-06-02	2019-06-02
111111111	Jon Groff	1980-03-10	jon.groff@gmail.com	2019-06-03	2999-12-31
222222222	Kenneth Jarrett	1986-08-24	NULL	2019-06-02	2019-06-02
222222222	Kenneth Jarrett	1986-10-24	NULL	2019-06-03	2999-12-31
333333333	Daniel Jensen	1972-08-22	NULL	2019-06-02	2019-06-02
333333333	Daniel Craig	1972-08-22	NULL	2019-06-03	2999-12-31
444444444	Jeffrey Kramer	1987-04-15	NULL	2019-06-02	2999-12-31
555555555	Pete Bartels	1989-11-16	NULL	2019-06-02	2999-12-31
666666666	John Snow	1982-02-11	NULL	2019-06-03	2999-12-31
777777777	Arya Stark	1998-05-16	NULL	2019-06-03	2999-12-31

To fetch only the active records:

select * from customer where c_update_dt='2999-12-31';

c_mobile	c_name	c_dob	c_email	c_insert_dt	c_update_dt
111111111	Jon Groff	1980-03-10	jon.groff@gmail.com	2019-06-03	2999-12-31
222222222	Kenneth Jarrett	1986-10-24	NULL	2019-06-03	2999-12-31
333333333	Daniel Craig	1972-08-22	NULL	2019-06-03	2999-12-31
444444444	Jeffrey Kramer	1987-04-15	NULL	2019-06-02	2999-12-31
555555555	Pete Bartels	1989-11-16	NULL	2019-06-02	2999-12-31
666666666	John Snow	1982-02-11	NULL	2019-06-03	2999-12-31
777777777	Arya Stark	1998-05-16	NULL	2019-06-03	2999-12-31

Approach 2: Append all the records and apply Logic while reading data

This will require only 1 technical column, “INSERT_DT”. You can simply INSERT all the records in the Target table. Now on top of this table you can create VIEW which fetches only the most recent entry for “KEY” columns and show it to user. You can apply “qualify row_number() over(partition by KEY_COL1,KEY_COL2 ORDER BY INSERT_DT desc)=1” in the view definition.

Let’s see one example for this:

drop table customer_2;

create table customer_2
(
c_mobile bigint,
c_name varchar(30),
c_dob date,
c_email varchar(100),
c_insert_dt date
)
primary index(c_mobile);

insert into customer_2 values (111111111,'Jon Groff','1980-03-10',,'2019-06-02');

insert into customer_2 values (222222222,'Kenneth Jarrett','1986-08-24',,'2019-06-02');

insert into customer_2 values (333333333,'Daniel Jensen','1972-08-22',,'2019-06-02');

insert into customer_2 values (444444444,'Jeffrey Kramer','1987-04-15',,'2019-06-02');

insert into customer_2 values (555555555,'Pete  Bartels','1989-11-16',,'2019-06-02');

select * from customer_2;

c_mobile	c_name	c_dob	c_email	c_insert_dt
111111111	Jon Groff	1980-03-10	NULL	2019-06-02
222222222	Kenneth Jarrett	1986-08-24	NULL	2019-06-02
333333333	Daniel Jensen	1972-08-22	NULL	2019-06-02
444444444	Jeffrey Kramer	1987-04-15	NULL	2019-06-02
555555555	Pete Bartels	1989-11-16	NULL	2019-06-02

Now let’s load all the DELTA records from Source to Target:

insert into
	customer_2
select
	stg_customer.c_mobile,
	stg_customer.c_name,
	stg_customer.c_dob,
	stg_customer.c_email,
	current_date
from
	stg_customer;

Create a VIEW on top of Target table to show only active records to user

replace view vw_customer_2 as
select
	*
from
	customer_2 qualify row_number() over (partition by c_mobile
order by
	c_insert_dt desc)= 1

Fetch record from VIEW now

select * from vw_customer_2

c_mobile	c_name	c_dob	c_email	c_insert_dt
111111111	Jon Groff	1980-03-10	jon.groff@gmail.com	2019-06-03
222222222	Kenneth Jarrett	1986-10-24	NULL	2019-06-03
333333333	Daniel Craig	1972-08-22	NULL	2019-06-03
444444444	Jeffrey Kramer	1987-04-15	NULL	2019-06-02
555555555	Pete Bartels	1989-11-16	NULL	2019-06-02
666666666	John Snow	1982-02-11	NULL	2019-06-03
777777777	Arya Stark	1998-05-16	NULL	2019-06-03

Note: This approach looks easy to implement however I will not recommend it. As the table size grows very fast and also query response time is higher during retrieval. However this approach is more suitable if you are running on platforms in which UPDATE operation comes with high cost especially on Hadoop like HIVE/SPARK etc.

Benefits of SCD2: Entire History is recorded. So good way to store information if analytics is required on data.

Not so good about SCD2: Table storage increases very fast. Require additional filter conditions to fetch active records.

Let me know if you have any query by leaving a comment.

ETL SQL