In SCD (Slowly Changing Dimension) type 2, we have to store all the historical information about any Dimension. So unlike SCD-1, we cannot simply overwrite the updated data we are getting from Source. There are multiple ways in which one can implement SCD-2 in Teradata and we will discuss about couple of most popular approach in this post.
Approach 1: Set existing record as INACTIVE and Insert new record as ACTIVE
In this approach , create 2 technical DATE/TIMESTAMP columns. If your job is intra-day I will suggest keeping timestamp else you can keep DATE for daily jobs. Let's call it INSERT_DT & UPDATE_DT.
Step 1: Update Existing Rows to mark as INACTIVE
Run UPDATE statement on existing TARGET table with records from SOURCE. This is required to identify and close existing records if any new information has come for same record. So you have to match "KEY" column and update UPDATE_DT to (CURRENT_DATE – 1). No other column is modified.
Step 2: Insert new/updated rows as ACTIVE
Whenever we load record [INSERT] into SCD2 table, you will load INSERT_DT with CURRENT_DATE and UPDATE_DT will be some far future date like '2999-12-31'. Since you have already closed all the existing records for which new info has come and assuming you have only DELTA records in Source, you can INSERT all records in the Source to Target now.
Note: With this approach whenever you are querying SCD2 table and don't want history or shall I say require only active records, you have to apply filter condition to fetch active records. Like "WHERE UPDATE_DT='2999-12-31'"
Let's see this with one example:
Step1: Creating Target & Source tables.
drop table customer; create table customer ( c_mobile bigint, c_name varchar(30), c_dob date, c_email varchar(100), c_insert_dt date, c_update_dt date ) primary index(c_mobile);
insert into customer values (111111111,'Jon Groff','1980-03-10',,'2019-06-02','2999-12-31'); insert into customer values (222222222,'Kenneth Jarrett','1986-08-24',,'2019-06-02','2999-12-31'); insert into customer values (333333333,'Daniel Jensen','1972-08-22',,'2019-06-02','2999-12-31'); insert into customer values (444444444,'Jeffrey Kramer','1987-04-15',,'2019-06-02','2999-12-31'); insert into customer values (555555555,'Pete Bartels','1989-11-16',,'2019-06-02','2999-12-31');
drop table stg_customer; create table stg_customer ( c_mobile bigint, c_name varchar(30), c_dob date, c_email varchar(100) ) primary index(c_mobile);
insert into stg_customer values (111111111,'Jon Groff','1980-03-10','jon.groff@gmail.com'); insert into stg_customer values (222222222,'Kenneth Jarrett','1986-10-24',NULL); insert into stg_customer values (333333333,'Daniel Craig','1972-08-22',NULL); insert into stg_customer values (666666666,'John Snow','1982-02-11',NULL); insert into stg_customer values (777777777,'Arya Stark','1998-05-16',NULL);
Let's see how data looks like into the table now:
select * from customer;
c_mobile | c_name | c_dob | c_email | c_insert_dt | c_update_dt |
---|---|---|---|---|---|
111111111 | Jon Groff | 1980-03-10 | NULL | 2019-06-02 | 2999-12-31 |
222222222 | Kenneth Jarrett | 1986-08-24 | NULL | 2019-06-02 | 2999-12-31 |
333333333 | Daniel Jensen | 1972-08-22 | NULL | 2019-06-02 | 2999-12-31 |
444444444 | Jeffrey Kramer | 1987-04-15 | NULL | 2019-06-02 | 2999-12-31 |
555555555 | Pete Bartels | 1989-11-16 | NULL | 2019-06-02 | 2999-12-31 |
select * from stg_customer;
c_mobile | c_name | c_dob | c_email |
---|---|---|---|
111111111 | Jon Groff | 1980-03-10 | jon.groff@gmail.com |
222222222 | Kenneth Jarrett | 1986-10-24 | NULL |
333333333 | Daniel Craig | 1972-08-22 | NULL |
666666666 | John Snow | 1982-02-11 | NULL |
777777777 | Arya Stark | 1998-05-16 | NULL |
Now run the UPDATE command to modify UPDATE_DT and run INSERT to load records.
update tb1 from customer tb1, stg_customer tb2 set c_update_dt=current_date - 1 where tb1.c_mobile = tb2.c_mobile and tb1.c_update_dt='2999-12-31';
insert into customer select stg_customer.c_mobile, stg_customer.c_name, stg_customer.c_dob, stg_customer.c_email, current_date, '2999-12-31' from stg_customer;
Now data in Target will look like:
c_mobile | c_name | c_dob | c_email | c_insert_dt | c_update_dt |
---|---|---|---|---|---|
111111111 | Jon Groff | 1980-03-10 | NULL | 2019-06-02 | 2019-06-02 |
111111111 | Jon Groff | 1980-03-10 | jon.groff@gmail.com | 2019-06-03 | 2999-12-31 |
222222222 | Kenneth Jarrett | 1986-08-24 | NULL | 2019-06-02 | 2019-06-02 |
222222222 | Kenneth Jarrett | 1986-10-24 | NULL | 2019-06-03 | 2999-12-31 |
333333333 | Daniel Jensen | 1972-08-22 | NULL | 2019-06-02 | 2019-06-02 |
333333333 | Daniel Craig | 1972-08-22 | NULL | 2019-06-03 | 2999-12-31 |
444444444 | Jeffrey Kramer | 1987-04-15 | NULL | 2019-06-02 | 2999-12-31 |
555555555 | Pete Bartels | 1989-11-16 | NULL | 2019-06-02 | 2999-12-31 |
666666666 | John Snow | 1982-02-11 | NULL | 2019-06-03 | 2999-12-31 |
777777777 | Arya Stark | 1998-05-16 | NULL | 2019-06-03 | 2999-12-31 |
To fetch only the active records:
select * from customer where c_update_dt='2999-12-31';
c_mobile | c_name | c_dob | c_email | c_insert_dt | c_update_dt |
---|---|---|---|---|---|
111111111 | Jon Groff | 1980-03-10 | jon.groff@gmail.com | 2019-06-03 | 2999-12-31 |
222222222 | Kenneth Jarrett | 1986-10-24 | NULL | 2019-06-03 | 2999-12-31 |
333333333 | Daniel Craig | 1972-08-22 | NULL | 2019-06-03 | 2999-12-31 |
444444444 | Jeffrey Kramer | 1987-04-15 | NULL | 2019-06-02 | 2999-12-31 |
555555555 | Pete Bartels | 1989-11-16 | NULL | 2019-06-02 | 2999-12-31 |
666666666 | John Snow | 1982-02-11 | NULL | 2019-06-03 | 2999-12-31 |
777777777 | Arya Stark | 1998-05-16 | NULL | 2019-06-03 | 2999-12-31 |
Approach 2: Append all the records and apply Logic while reading data
This will require only 1 technical column, "INSERT_DT". You can simply INSERT all the records in the Target table. Now on top of this table you can create VIEW which fetches only the most recent entry for "KEY" columns and show it to user. You can apply "qualify row_number() over(partition by KEY_COL1,KEY_COL2 ORDER BY INSERT_DT desc)=1" in the view definition.
Let's see one example for this:
drop table customer_2; create table customer_2 ( c_mobile bigint, c_name varchar(30), c_dob date, c_email varchar(100), c_insert_dt date ) primary index(c_mobile);
insert into customer_2 values (111111111,'Jon Groff','1980-03-10',,'2019-06-02'); insert into customer_2 values (222222222,'Kenneth Jarrett','1986-08-24',,'2019-06-02'); insert into customer_2 values (333333333,'Daniel Jensen','1972-08-22',,'2019-06-02'); insert into customer_2 values (444444444,'Jeffrey Kramer','1987-04-15',,'2019-06-02'); insert into customer_2 values (555555555,'Pete Bartels','1989-11-16',,'2019-06-02');
select * from customer_2;
c_mobile | c_name | c_dob | c_email | c_insert_dt |
---|---|---|---|---|
111111111 | Jon Groff | 1980-03-10 | NULL | 2019-06-02 |
222222222 | Kenneth Jarrett | 1986-08-24 | NULL | 2019-06-02 |
333333333 | Daniel Jensen | 1972-08-22 | NULL | 2019-06-02 |
444444444 | Jeffrey Kramer | 1987-04-15 | NULL | 2019-06-02 |
555555555 | Pete Bartels | 1989-11-16 | NULL | 2019-06-02 |
Now let's load all the DELTA records from Source to Target:
insert into customer_2 select stg_customer.c_mobile, stg_customer.c_name, stg_customer.c_dob, stg_customer.c_email, current_date from stg_customer;
Create a VIEW on top of Target table to show only active records to user
replace view vw_customer_2 as select * from customer_2 qualify row_number() over (partition by c_mobile order by c_insert_dt desc)= 1
Fetch record from VIEW now
select * from vw_customer_2
c_mobile | c_name | c_dob | c_email | c_insert_dt |
---|---|---|---|---|
111111111 | Jon Groff | 1980-03-10 | jon.groff@gmail.com | 2019-06-03 |
222222222 | Kenneth Jarrett | 1986-10-24 | NULL | 2019-06-03 |
333333333 | Daniel Craig | 1972-08-22 | NULL | 2019-06-03 |
444444444 | Jeffrey Kramer | 1987-04-15 | NULL | 2019-06-02 |
555555555 | Pete Bartels | 1989-11-16 | NULL | 2019-06-02 |
666666666 | John Snow | 1982-02-11 | NULL | 2019-06-03 |
777777777 | Arya Stark | 1998-05-16 | NULL | 2019-06-03 |
Note: This approach looks easy to implement however I will not recommend it. As the table size grows very fast and also query response time is higher during retrieval. However this approach is more suitable if you are running on platforms in which UPDATE operation comes with high cost especially on Hadoop like HIVE/SPARK etc.
Benefits of SCD2: Entire History is recorded. So good way to store information if analytics is required on data.
Not so good about SCD2: Table storage increases very fast. Require additional filter conditions to fetch active records.
Let me know if you have any query by leaving a comment.
3 thoughts on “SCD-2 in Teradata”
Raj could you help us by letting us know how to proceed with data cleaning in teradata studio
The below update in the 1st approach needs correction.
update tb1 from customer tb1, stg_customer tb2
set c_update_dt=current_date – 1
where tb1.c_mobile = tb2.c_mobile
As per my understanding, we should only consider the active records for updation in SCD2 otherwise next day-delta will mess up the history. so below filter is required.
and tbl1.c_update_dt='2999-12-31'
Hi Aniket.
Your understanding is absolutely right.
Thanks for pointing it out and I have fixed the code now.
Best
Raj