Slowly changing dimensions or SCD are the dimensions that change over time. It called “slowly” because the change is expected to come rarely or rather I shall say not frequently. Let’s understand this with one real life example.
You went to a coffee shop and ordered some coffee and snacks. The cashier asked for your mobile number and name. Assuming that you are the first time customer an entry will be made in CUSTOMER table in the data warehouse. It may look like below with default value for some columns:
Interaction – 1 | |||||
Customer_ID | Customer_Name | Customer_Mobile | Customer_DOB | State | Entry_Date |
1 | Mark | 999-999-9999 | 01-01-1900 | NY | 24-05-2019 |
Next time , when you visit the coffee shop , cashier may ask you for your mobile number. Once you tell him, he may identify you as “Mark”. He may ask you about your Date-of-Birth (DOB) to share any special offer on your birthday. And you may decide to share this info or may be not with him. If you share that another info is added to the table in warehouse.
Interaction – 2 | |||||
Customer_ID | Customer_Name | Customer_Mobile | Customer_DOB | State | Entry_Date |
1 | Mark | 999-999-9999 | 18-04-1990 | NY | 29-05-2019 |
Now there can be multiple ways to store this info into data warehouse Table. And the manner in which you will store this info will determine the SCD type you are using. We will talk about popular SCD types used in any data warehouse.
SCD-1: In this type, you will simply overwrite existing information in the table. The existing record will be replaced with the new one.
SCD-2: In this type, you will make new entry into the table with existing record. Some identifier is used to easily identify which is the current record.
SCD-3: In this type, record is over-written in the extra ‘technical’ column which holds the most recent entry.
SCD-4: In this type, history i.e. all the records all maintained in the separate history table and the most current record is only kept in the main table.
There are few more hybrid SCD types available but above mentioned ones are the most common ones. In the subsequent posts, we will discuss each one of the SCD types in detail – how to implement with Teradata SQL, advantages and disadvantage of each.
In SCD-1, we overwrite the existing information in the table with the most recent info received from Source. This is very common SCD type used in the data warehouse and is good only for situations when you don’t want to keep historical data for that entity. Let’s quickly jump to technical example to understand it:
Create Target Table and load it with sample data:
drop table customer; create table customer ( c_mobile bigint, c_name varchar(30), c_dob date, c_email varchar(100), c_insert_dt date, c_update_dt date ) primary index(c_mobile); insert into customer values (111111111,'Jon Groff','1980-03-10',,date,date); insert into customer values (222222222,'Kenneth Jarrett','1986-08-24',,date,date); insert into customer values (333333333,'Daniel Jensen','1972-08-22',,date,date); insert into customer values (444444444,'Jeffrey Kramer','1987-04-15',,date,date); insert into customer values (555555555,'Pete Bartels','1989-11-16',,date,date);
The table data will look like
C_MOBILE | C_NAME | C_DOB | C_EMAIL | C_INSERT_DT | C_UPDATE_DT |
---|---|---|---|---|---|
111111111 | Jon Groff | 1980-03-10 | NULL | 2019-05-24 | 2019-05-24 |
222222222 | Kenneth Jarrett | 1986-08-24 | NULL | 2019-05-24 | 2019-05-24 |
333333333 | Daniel Jensen | 1972-08-22 | NULL | 2019-05-24 | 2019-05-24 |
444444444 | Jeffrey Kramer | 1987-04-15 | NULL | 2019-05-24 | 2019-05-24 |
555555555 | Pete Bartels | 1989-11-16 | NULL | 2019-05-24 | 2019-05-24 |
Now let’s create a staging table which holds most recent info shared from Source.
create table stg_customer ( c_mobile bigint, c_name varchar(30), c_dob date, c_email varchar(100) ) primary index(c_mobile); insert into stg_customer values (111111111,'Jon Groff','1980-03-10','jon.groff@gmail.com'); insert into stg_customer values (222222222,'Kenneth Jarrett','1986-10-24',NULL); insert into stg_customer values (333333333,'Daniel Craig','1972-08-22',NULL); insert into stg_customer values (666666666,'John Snow','1982-02-11',NULL); insert into stg_customer values (777777777,'Arya Stark','1998-05-16',NULL);
The data in staging table looks like
C_MOBILE | C_NAME | C_DOB | C_EMAIL |
---|---|---|---|
111111111 | Jon Groff | 1980-03-10 | jon.groff@gmail.com |
222222222 | Kenneth Jarrett | 1986-10-24 | NULL |
333333333 | Daniel Craig | 1972-08-22 | NULL |
666666666 | John Snow | 1982-02-11 | NULL |
777777777 | Arya Stark | 1998-05-16 | NULL |
Now to implement SCD-1 in Teradata , the easiest method is to run a MERGE statement on Target table “customer” using Staging table “stg_customer”
MERGE customer using stg_customer on customer.c_mobile = stg_customer.c_mobile WHEN MATCHED THEN UPDATE SET c_name = stg_customer.c_name, c_dob = stg_customer.c_dob, c_email = stg_customer.c_email, c_update_dt = current_date WHEN NOT MATCHED THEN INSERT values ( stg_customer.c_mobile, stg_customer.c_name, stg_customer.c_dob, stg_customer.c_email, current_date, current_date );
Now the data in Target table will look like
C_MOBILE | C_NAME | C_DOB | C_EMAIL | C_INSERT_DT | C_UPDATE_DT | COMMENT |
---|---|---|---|---|---|---|
111111111 | Jon Groff | 1980-03-10 | jon.groff@gmail.com | 2019-05-24 | 2019-05-25 | Email Update |
222222222 | Kenneth Jarrett | 1986-10-24 | NULL | 2019-05-24 | 2019-05-25 | DOB Update |
333333333 | Daniel Craig | 1972-08-22 | NULL | 2019-05-24 | 2019-05-25 | Name Update |
444444444 | Jeffrey Kramer | 1987-04-15 | NULL | 2019-05-24 | 2019-05-24 | No Change |
555555555 | Pete Bartels | 1989-11-16 | NULL | 2019-05-24 | 2019-05-24 | No Change |
666666666 | John Snow | 1982-02-11 | NULL | 2019-05-25 | 2019-05-25 | New Record |
777777777 | Arya Stark | 1998-05-16 | NULL | 2019-05-25 | 2019-05-25 | New Record |
*Comment is not the actual column of the table. It is just for understanding purpose I have added above.
You can also run separate UPDATE+INSERT to replicate same functionality.Now let’s talk about advantage and disadvantage of SCD-1.
Advantage: Easy to implement, Table Size stays in control, Unique records in the table, Table can be accessed directly without worrying about recent or old record.
Disadvantage: No history is maintained. May not be adequate for data analysis.
In SCD (Slowly Changing Dimension) type 2, we have to store all the historical information about any Dimension. So unlike SCD-1, we cannot simply overwrite the updated data we are getting from Source. There are multiple ways in which one can implement SCD-2 in Teradata and we will discuss about couple of most popular approach in this post.
Approach 1: Set existing record as INACTIVE and Insert new record as ACTIVE
In this approach , create 2 technical DATE/TIMESTAMP columns. If your job is intra-day I will suggest keeping timestamp else you can keep DATE for daily jobs. Let’s call it INSERT_DT & UPDATE_DT.
Step 1: Update Existing Rows to mark as INACTIVE
Run UPDATE statement on existing TARGET table with records from SOURCE. This is required to identify and close existing records if any new information has come for same record. So you have to match “KEY” column and update UPDATE_DT to (CURRENT_DATE – 1). No other column is modified.
Step 2: Insert new/updated rows as ACTIVE
Whenever we load record [INSERT] into SCD2 table, you will load INSERT_DT with CURRENT_DATE and UPDATE_DT will be some far future date like ‘2999-12-31’. Since you have already closed all the existing records for which new info has come and assuming you have only DELTA records in Source, you can INSERT all records in the Source to Target now.
Note: With this approach whenever you are querying SCD2 table and don’t want history or shall I say require only active records, you have to apply filter condition to fetch active records. Like “WHERE UPDATE_DT=’2999-12-31′“
Let’s see this with one example:
Step1: Creating Target & Source tables.
drop table customer; create table customer ( c_mobile bigint, c_name varchar(30), c_dob date, c_email varchar(100), c_insert_dt date, c_update_dt date ) primary index(c_mobile);
insert into customer values (111111111,'Jon Groff','1980-03-10',,'2019-06-02','2999-12-31'); insert into customer values (222222222,'Kenneth Jarrett','1986-08-24',,'2019-06-02','2999-12-31'); insert into customer values (333333333,'Daniel Jensen','1972-08-22',,'2019-06-02','2999-12-31'); insert into customer values (444444444,'Jeffrey Kramer','1987-04-15',,'2019-06-02','2999-12-31'); insert into customer values (555555555,'Pete Bartels','1989-11-16',,'2019-06-02','2999-12-31');
drop table stg_customer; create table stg_customer ( c_mobile bigint, c_name varchar(30), c_dob date, c_email varchar(100) ) primary index(c_mobile);
insert into stg_customer values (111111111,'Jon Groff','1980-03-10','jon.groff@gmail.com'); insert into stg_customer values (222222222,'Kenneth Jarrett','1986-10-24',NULL); insert into stg_customer values (333333333,'Daniel Craig','1972-08-22',NULL); insert into stg_customer values (666666666,'John Snow','1982-02-11',NULL); insert into stg_customer values (777777777,'Arya Stark','1998-05-16',NULL);
Let’s see how data looks like into the table now:
select * from customer;
c_mobile | c_name | c_dob | c_email | c_insert_dt | c_update_dt |
---|---|---|---|---|---|
111111111 | Jon Groff | 1980-03-10 | NULL | 2019-06-02 | 2999-12-31 |
222222222 | Kenneth Jarrett | 1986-08-24 | NULL | 2019-06-02 | 2999-12-31 |
333333333 | Daniel Jensen | 1972-08-22 | NULL | 2019-06-02 | 2999-12-31 |
444444444 | Jeffrey Kramer | 1987-04-15 | NULL | 2019-06-02 | 2999-12-31 |
555555555 | Pete Bartels | 1989-11-16 | NULL | 2019-06-02 | 2999-12-31 |
select * from stg_customer;
c_mobile | c_name | c_dob | c_email |
---|---|---|---|
111111111 | Jon Groff | 1980-03-10 | jon.groff@gmail.com |
222222222 | Kenneth Jarrett | 1986-10-24 | NULL |
333333333 | Daniel Craig | 1972-08-22 | NULL |
666666666 | John Snow | 1982-02-11 | NULL |
777777777 | Arya Stark | 1998-05-16 | NULL |
Now run the UPDATE command to modify UPDATE_DT and run INSERT to load records.
update tb1 from customer tb1, stg_customer tb2 set c_update_dt=current_date - 1 where tb1.c_mobile = tb2.c_mobile and tb1.c_update_dt='2999-12-31';
insert into customer select stg_customer.c_mobile, stg_customer.c_name, stg_customer.c_dob, stg_customer.c_email, current_date, '2999-12-31' from stg_customer;
Now data in Target will look like:
c_mobile | c_name | c_dob | c_email | c_insert_dt | c_update_dt |
---|---|---|---|---|---|
111111111 | Jon Groff | 1980-03-10 | NULL | 2019-06-02 | 2019-06-02 |
111111111 | Jon Groff | 1980-03-10 | jon.groff@gmail.com | 2019-06-03 | 2999-12-31 |
222222222 | Kenneth Jarrett | 1986-08-24 | NULL | 2019-06-02 | 2019-06-02 |
222222222 | Kenneth Jarrett | 1986-10-24 | NULL | 2019-06-03 | 2999-12-31 |
333333333 | Daniel Jensen | 1972-08-22 | NULL | 2019-06-02 | 2019-06-02 |
333333333 | Daniel Craig | 1972-08-22 | NULL | 2019-06-03 | 2999-12-31 |
444444444 | Jeffrey Kramer | 1987-04-15 | NULL | 2019-06-02 | 2999-12-31 |
555555555 | Pete Bartels | 1989-11-16 | NULL | 2019-06-02 | 2999-12-31 |
666666666 | John Snow | 1982-02-11 | NULL | 2019-06-03 | 2999-12-31 |
777777777 | Arya Stark | 1998-05-16 | NULL | 2019-06-03 | 2999-12-31 |
To fetch only the active records:
select * from customer where c_update_dt='2999-12-31';
c_mobile | c_name | c_dob | c_email | c_insert_dt | c_update_dt |
---|---|---|---|---|---|
111111111 | Jon Groff | 1980-03-10 | jon.groff@gmail.com | 2019-06-03 | 2999-12-31 |
222222222 | Kenneth Jarrett | 1986-10-24 | NULL | 2019-06-03 | 2999-12-31 |
333333333 | Daniel Craig | 1972-08-22 | NULL | 2019-06-03 | 2999-12-31 |
444444444 | Jeffrey Kramer | 1987-04-15 | NULL | 2019-06-02 | 2999-12-31 |
555555555 | Pete Bartels | 1989-11-16 | NULL | 2019-06-02 | 2999-12-31 |
666666666 | John Snow | 1982-02-11 | NULL | 2019-06-03 | 2999-12-31 |
777777777 | Arya Stark | 1998-05-16 | NULL | 2019-06-03 | 2999-12-31 |
Approach 2: Append all the records and apply Logic while reading data
This will require only 1 technical column, “INSERT_DT”. You can simply INSERT all the records in the Target table. Now on top of this table you can create VIEW which fetches only the most recent entry for “KEY” columns and show it to user. You can apply “qualify row_number() over(partition by KEY_COL1,KEY_COL2 ORDER BY INSERT_DT desc)=1” in the view definition.
Let’s see one example for this:
drop table customer_2; create table customer_2 ( c_mobile bigint, c_name varchar(30), c_dob date, c_email varchar(100), c_insert_dt date ) primary index(c_mobile);
insert into customer_2 values (111111111,'Jon Groff','1980-03-10',,'2019-06-02'); insert into customer_2 values (222222222,'Kenneth Jarrett','1986-08-24',,'2019-06-02'); insert into customer_2 values (333333333,'Daniel Jensen','1972-08-22',,'2019-06-02'); insert into customer_2 values (444444444,'Jeffrey Kramer','1987-04-15',,'2019-06-02'); insert into customer_2 values (555555555,'Pete Bartels','1989-11-16',,'2019-06-02');
select * from customer_2;
c_mobile | c_name | c_dob | c_email | c_insert_dt |
---|---|---|---|---|
111111111 | Jon Groff | 1980-03-10 | NULL | 2019-06-02 |
222222222 | Kenneth Jarrett | 1986-08-24 | NULL | 2019-06-02 |
333333333 | Daniel Jensen | 1972-08-22 | NULL | 2019-06-02 |
444444444 | Jeffrey Kramer | 1987-04-15 | NULL | 2019-06-02 |
555555555 | Pete Bartels | 1989-11-16 | NULL | 2019-06-02 |
Now let’s load all the DELTA records from Source to Target:
insert into customer_2 select stg_customer.c_mobile, stg_customer.c_name, stg_customer.c_dob, stg_customer.c_email, current_date from stg_customer;
Create a VIEW on top of Target table to show only active records to user
replace view vw_customer_2 as select * from customer_2 qualify row_number() over (partition by c_mobile order by c_insert_dt desc)= 1
Fetch record from VIEW now
select * from vw_customer_2
c_mobile | c_name | c_dob | c_email | c_insert_dt |
---|---|---|---|---|
111111111 | Jon Groff | 1980-03-10 | jon.groff@gmail.com | 2019-06-03 |
222222222 | Kenneth Jarrett | 1986-10-24 | NULL | 2019-06-03 |
333333333 | Daniel Craig | 1972-08-22 | NULL | 2019-06-03 |
444444444 | Jeffrey Kramer | 1987-04-15 | NULL | 2019-06-02 |
555555555 | Pete Bartels | 1989-11-16 | NULL | 2019-06-02 |
666666666 | John Snow | 1982-02-11 | NULL | 2019-06-03 |
777777777 | Arya Stark | 1998-05-16 | NULL | 2019-06-03 |
Note: This approach looks easy to implement however I will not recommend it. As the table size grows very fast and also query response time is higher during retrieval. However this approach is more suitable if you are running on platforms in which UPDATE operation comes with high cost especially on Hadoop like HIVE/SPARK etc.
Benefits of SCD2: Entire History is recorded. So good way to store information if analytics is required on data.
Not so good about SCD2: Table storage increases very fast. Require additional filter conditions to fetch active records.
Let me know if you have any query by leaving a comment.
Leave a Reply