In this post I will explain how we can create a Data Warehouse from scratch.
This is not a dive deep of DWH in which I will not go through the theoretical concepts of Data Warehouse. I assume you are already aware of that and wish to do some hands-on.
This post is focussed purely on implementation side of the DWH. But that does not mean we will not cover the important concepts applicable to any DWH. I am not going to follow bookish language or steps of building a DWH but what I feel is important and easy-to-follow for a beginner.
In my opinion , there are 5 steps to consider when building a DWH: Business requirement Analyse the input data Dimension modeling Create DB objects Load scripts
Project Overview
We are going to build a data warehouse for a big pizza store franchise and it operates across different cities in the country. As part of the initial POC, the business team has appointed to a store manager who will interact with us.
He will share the require data with us and we have to proceed with 5 step process to build a data warehouse. We will analyse the 4 csv files and come up with existing OLTP data model for those files.
Dataset is inspired from data available in Maven analytics (Pizza Place Sales)
For creating Data Model , my preferred tool nowadays is dbdiagram.io
It allows you to create Data Model programatically too hence it saves lot of time if you have too many data objects in-scope.
I have written a blog post that explains how you can use python to create data model programatically. You may want to read it as well.
The DBML to be used for given input CSV files.
Table pizzas { pizza_id varchar(100) pizza_type_id varchar(100) size varchar(5) price decimal(5,2) } Table pizza_types { pizza_type_id varchar(100) name varchar(100) category varchar(100) ingredients varchar(1000) } Table orders { order_id integer date date time time } Table order_details { order_details_id integer order_id integer pizza_id varchar(100) quantity smallint } Ref: orders.order_id < order_details.order_id Ref: pizzas.pizza_id < order_details.pizza_id Ref: pizzas.pizza_type_id > pizza_types.pizza_type_id
Dimensional Modeling
We will create dimensional model for our datawarehouse project now.
Followings points should be considered while creating Dimensional Model.
- The pizza/pizza types tables can be merged to one single product dimension.
- We can give it more generic name so that we can cater to more items in the future. So in place of pizza we will call it product.
- At later stage , we may split it or keep it merged – snowflake schema design where dimension is further split into smaller dimensions table to make it more atomic.
- For Orders we can merge it to make it one order table.
- Missing Dimension : Date/Time Dimension.
- Additionally we will create new Store Dimension as well to cater to future adhoc queries
- New fact tables – SalesOrder will be created to calculate measures.
- The final list of tables will be :
- dim_products
- dim_orders
- fact_sales
- dim_stores
- dim_date
- dim_time
- 2 Technical columns – dt_insert , rec_source are also added to implement ABC (Audit Balance Control) at start. We may add more technical columns at later stage as we proceed.
DBML code for Dimensional Model for our data warehouse project
Table dim_product { product_sk integer product_id varchar(100) product_type_id varchar(100) product_name varchar(100) product_category varchar(100) product_family varchar(100) product_ingredients varchar(1000) product_size varchar(5) product_price decimal(5,2) rec_source varchar(10) dt_insert date } Table dim_orders { order_sk integer order_id integer order_detail_id integer product_id varchar(100) date date time time rec_source varchar(10) dt_insert date } Table fact_sales_order { sales_order_sk integer order_sk integer product_sk integer store_sk integer date_id bigint time_id bigint product_price decimal(5,2) quantity smallint rec_source varchar(10) dt_insert date } Table dim_store { store_sk integer store_id varchar(100) store_name varchar(100) store_zip varchar(100) store_city varchar(100) store_staff smallint rec_source varchar(10) dt_insert date } Table dim_date { date_id bigint date date year smallint month smallint monthname varchar(20) day smallint dayofyear smallint weekdayname varchar(20) calendarweek smallint quartal varchar(5) weekend varchar(20) is_holiday varchar(3) period varchar(30) } Table dim_time { time_id bigint timeofday varchar(10) quarterhour varchar(20) daytimename varchar(20) daynight varchar(20) } Ref: dim_product.product_sk - fact_sales_order.product_sk Ref: dim_orders.order_sk - fact_sales_order.order_sk Ref: dim_store.store_sk - fact_sales_order.store_sk Ref: dim_date.date_id - fact_sales_order.date_id Ref: dim_time.time_id - fact_sales_order.time_id