SCD-2 in Teradata

In SCD (Slowly Changing Dimension) type 2, we have to store all the historical information about any Dimension. So unlike SCD-1, we cannot simply overwrite the updated data we are getting from Source. There are multiple ways in which one can implement SCD-2 in Teradata and we will discuss about couple of most popular approach in this post.

Approach 1: Set existing record as INACTIVE and Insert new record as ACTIVE

In this approach , create 2 technical DATE/TIMESTAMP columns. If your job is intra-day I will suggest keeping timestamp else you can keep DATE for daily jobs. Let's call it INSERT_DT & UPDATE_DT.

Step 1: Update Existing Rows to mark as INACTIVE

Run UPDATE statement on existing TARGET table with records from SOURCE. This is required to identify and close existing records if any new information has come for same record. So you have to match "KEY" column and update UPDATE_DT to (CURRENT_DATE – 1). No other column is modified.

Step 2: Insert new/updated rows as ACTIVE

Whenever we load record [INSERT] into SCD2 table, you will load INSERT_DT with CURRENT_DATE and UPDATE_DT will be some far future date like '2999-12-31'. Since you have already closed all the existing records for which new info has come and assuming you have only DELTA records in Source, you can INSERT all records in the Source to Target now.

Note: With this approach whenever you are querying SCD2 table and don't want history or shall I say require only active records, you have to apply filter condition to fetch active records. Like "WHERE UPDATE_DT='2999-12-31'"

Let's see this with one example:

Step1: Creating Target & Source tables.

drop table customer;

create table customer
(
c_mobile bigint,
c_name varchar(30),
c_dob date,
c_email varchar(100),
c_insert_dt date,
c_update_dt date
)
primary index(c_mobile);
insert into customer values (111111111,'Jon Groff','1980-03-10',,'2019-06-02','2999-12-31');

insert into customer values (222222222,'Kenneth Jarrett','1986-08-24',,'2019-06-02','2999-12-31');

insert into customer values (333333333,'Daniel Jensen','1972-08-22',,'2019-06-02','2999-12-31');

insert into customer values (444444444,'Jeffrey Kramer','1987-04-15',,'2019-06-02','2999-12-31');

insert into customer values (555555555,'Pete  Bartels','1989-11-16',,'2019-06-02','2999-12-31');
drop table stg_customer;

create table stg_customer
(
c_mobile bigint,
c_name varchar(30),
c_dob date,
c_email varchar(100)
)
primary index(c_mobile);
insert into stg_customer values (111111111,'Jon Groff','1980-03-10','jon.groff@gmail.com');

insert into stg_customer values (222222222,'Kenneth Jarrett','1986-10-24',NULL);

insert into stg_customer values (333333333,'Daniel Craig','1972-08-22',NULL);

insert into stg_customer values (666666666,'John Snow','1982-02-11',NULL);

insert into stg_customer values (777777777,'Arya Stark','1998-05-16',NULL);

Let's see how data looks like into the table now:

select * from customer; 
c_mobilec_namec_dobc_emailc_insert_dtc_update_dt
111111111Jon Groff1980-03-10NULL2019-06-022999-12-31
222222222Kenneth Jarrett1986-08-24NULL2019-06-022999-12-31
333333333Daniel Jensen1972-08-22NULL2019-06-022999-12-31
444444444Jeffrey Kramer1987-04-15NULL2019-06-022999-12-31
555555555Pete Bartels1989-11-16NULL2019-06-022999-12-31
select * from stg_customer;
c_mobilec_namec_dobc_email
111111111Jon Groff1980-03-10jon.groff@gmail.com
222222222Kenneth Jarrett1986-10-24NULL
333333333Daniel Craig1972-08-22NULL
666666666John Snow1982-02-11NULL
777777777Arya Stark1998-05-16NULL

Now run the UPDATE command to modify UPDATE_DT and run INSERT to load records.

update tb1 from customer tb1, stg_customer tb2 
 set c_update_dt=current_date - 1 
 where tb1.c_mobile = tb2.c_mobile 
and tb1.c_update_dt='2999-12-31';
insert into customer select stg_customer.c_mobile,
stg_customer.c_name,
stg_customer.c_dob,
stg_customer.c_email,
current_date,
'2999-12-31'
from stg_customer;

Now data in Target will look like:

c_mobilec_namec_dobc_emailc_insert_dtc_update_dt
111111111Jon Groff1980-03-10NULL2019-06-022019-06-02
111111111Jon Groff1980-03-10jon.groff@gmail.com2019-06-032999-12-31
222222222Kenneth Jarrett1986-08-24NULL2019-06-022019-06-02
222222222Kenneth Jarrett1986-10-24NULL2019-06-032999-12-31
333333333Daniel Jensen1972-08-22NULL2019-06-022019-06-02
333333333Daniel Craig1972-08-22NULL2019-06-032999-12-31
444444444Jeffrey Kramer1987-04-15NULL2019-06-022999-12-31
555555555Pete Bartels1989-11-16NULL2019-06-022999-12-31
666666666John Snow1982-02-11NULL2019-06-032999-12-31
777777777Arya Stark1998-05-16NULL2019-06-032999-12-31

To fetch only the active records:

select * from customer where c_update_dt='2999-12-31';
c_mobilec_namec_dobc_emailc_insert_dtc_update_dt
111111111Jon Groff1980-03-10jon.groff@gmail.com2019-06-032999-12-31
222222222Kenneth Jarrett1986-10-24NULL2019-06-032999-12-31
333333333Daniel Craig1972-08-22NULL2019-06-032999-12-31
444444444Jeffrey Kramer1987-04-15NULL2019-06-022999-12-31
555555555Pete Bartels1989-11-16NULL2019-06-022999-12-31
666666666John Snow1982-02-11NULL2019-06-032999-12-31
777777777Arya Stark1998-05-16NULL2019-06-032999-12-31

Approach 2: Append all the records and apply Logic while reading data

This will require only 1 technical column, "INSERT_DT". You can simply INSERT all the records in the Target table. Now on top of this table you can create VIEW which fetches only the most recent entry for "KEY" columns and show it to user. You can apply "qualify row_number() over(partition by KEY_COL1,KEY_COL2 ORDER BY INSERT_DT desc)=1" in the view definition.

Let's see one example for this:

drop table customer_2;

create table customer_2
(
c_mobile bigint,
c_name varchar(30),
c_dob date,
c_email varchar(100),
c_insert_dt date
)
primary index(c_mobile);
insert into customer_2 values (111111111,'Jon Groff','1980-03-10',,'2019-06-02');

insert into customer_2 values (222222222,'Kenneth Jarrett','1986-08-24',,'2019-06-02');

insert into customer_2 values (333333333,'Daniel Jensen','1972-08-22',,'2019-06-02');

insert into customer_2 values (444444444,'Jeffrey Kramer','1987-04-15',,'2019-06-02');

insert into customer_2 values (555555555,'Pete  Bartels','1989-11-16',,'2019-06-02');
select * from customer_2;
c_mobilec_namec_dobc_emailc_insert_dt
111111111Jon Groff1980-03-10NULL2019-06-02
222222222Kenneth Jarrett1986-08-24NULL2019-06-02
333333333Daniel Jensen1972-08-22NULL2019-06-02
444444444Jeffrey Kramer1987-04-15NULL2019-06-02
555555555Pete Bartels1989-11-16NULL2019-06-02

Now let's load all the DELTA records from Source to Target:

insert into
	customer_2
select
	stg_customer.c_mobile,
	stg_customer.c_name,
	stg_customer.c_dob,
	stg_customer.c_email,
	current_date
from
	stg_customer;

Create a VIEW on top of Target table to show only active records to user

replace view vw_customer_2 as
select
	*
from
	customer_2 qualify row_number() over (partition by c_mobile
order by
	c_insert_dt desc)= 1

Fetch record from VIEW now

select * from vw_customer_2
c_mobilec_namec_dobc_emailc_insert_dt
111111111Jon Groff1980-03-10jon.groff@gmail.com2019-06-03
222222222Kenneth Jarrett1986-10-24NULL2019-06-03
333333333Daniel Craig1972-08-22NULL2019-06-03
444444444Jeffrey Kramer1987-04-15NULL2019-06-02
555555555Pete Bartels1989-11-16NULL2019-06-02
666666666John Snow1982-02-11NULL2019-06-03
777777777Arya Stark1998-05-16NULL2019-06-03

Note: This approach looks easy to implement however I will not recommend it. As the table size grows very fast and also query response time is higher during retrieval. However this approach is more suitable if you are running on platforms in which UPDATE operation comes with high cost especially on Hadoop like HIVE/SPARK etc.

Benefits of SCD2: Entire History is recorded. So good way to store information if analytics is required on data.

Not so good about SCD2: Table storage increases very fast. Require additional filter conditions to fetch active records.

Let me know if you have any query by leaving a comment.

3 thoughts on “SCD-2 in Teradata

  1. The below update in the 1st approach needs correction.

    update tb1 from customer tb1, stg_customer tb2
    set c_update_dt=current_date – 1
    where tb1.c_mobile = tb2.c_mobile

    As per my understanding, we should only consider the active records for updation in SCD2 otherwise next day-delta will mess up the history. so below filter is required.

    and tbl1.c_update_dt='2999-12-31'

Leave a Reply

Your email address will not be published. Required fields are marked *