Build Data Warehouse from Scratch | Dummy Project with codebase

In this post I will explain how we can create a Data Warehouse from scratch.

This is not a dive deep of DWH in which I will not go through the theoretical concepts of Data Warehouse. I assume you are already aware of that and wish to do some hands-on.

This post is focussed purely on implementation side of the DWH. But that does not mean we will not cover the important concepts applicable to any DWH. I am not going to follow bookish language or steps of building a DWH but what I feel is important and easy-to-follow for a beginner.

In my opinion , there are 5 steps to consider when building a DWH: Business requirement Analyse the input data Dimension modeling Create DB objects Load scripts

Project Overview

We are going to build a data warehouse for a big pizza store franchise and it operates across different cities in the country. As part of the initial POC, the business team has appointed to a store manager who will interact with us.

He will share the require data with us and we have to proceed with 5 step process to build a data warehouse. We will analyse the 4 csv files and come up with existing OLTP data model for those files.

Dataset is inspired from data available in Maven analytics (Pizza Place Sales)

For creating Data Model , my preferred tool nowadays is dbdiagram.io
It allows you to create Data Model programatically too hence it saves lot of time if you have too many data objects in-scope.

I have written a blog post that explains how you can use python to create data model programatically. You may want to read it as well.

The DBML to be used for given input CSV files.

Table pizzas {
 pizza_id varchar(100) 
    pizza_type_id varchar(100)
    size varchar(5) 
    price decimal(5,2)
}


Table pizza_types {
 pizza_type_id varchar(100)
    name varchar(100) 
    category varchar(100)
    ingredients varchar(1000)
}

Table orders {
    order_id integer
    date date
    time time
}

Table order_details {
 order_details_id integer
    order_id integer
    pizza_id varchar(100)
    quantity smallint
}

Ref: orders.order_id < order_details.order_id
Ref: pizzas.pizza_id < order_details.pizza_id
Ref: pizzas.pizza_type_id > pizza_types.pizza_type_id

Dimensional Modeling

We will create dimensional model for our datawarehouse project now.

Followings points should be considered while creating Dimensional Model.

  • The pizza/pizza types tables can be merged to one single product dimension.
  • We can give it more generic name so that we can cater to more items in the future. So in place of pizza we will call it product.
  • At later stage , we may split it or keep it merged – snowflake schema design where dimension is further split into smaller dimensions table to make it more atomic.
  • For Orders we can merge it to make it one order table.
  • Missing Dimension : Date/Time Dimension.
  • Additionally we will create new Store Dimension as well to cater to future adhoc queries
  • New fact tables – SalesOrder will be created to calculate measures.
  • The final list of tables will be :
    • dim_products
    • dim_orders
    • fact_sales
    • dim_stores
    • dim_date
    • dim_time
  • 2 Technical columns – dt_insert , rec_source are also added to implement ABC (Audit Balance Control) at start. We may add more technical columns at later stage as we proceed.

DBML code for Dimensional Model for our data warehouse project

Table dim_product {
  product_sk integer
  product_id varchar(100) 
  product_type_id varchar(100)
  product_name varchar(100)
  product_category varchar(100)
  product_family varchar(100)
  product_ingredients varchar(1000)
  product_size varchar(5) 
  product_price decimal(5,2)
  rec_source varchar(10)
  dt_insert date
}


Table dim_orders {
    order_sk integer
    order_id integer
    order_detail_id integer
    product_id varchar(100)
    date date
    time time
    rec_source varchar(10)
    dt_insert date
}

Table fact_sales_order {
    sales_order_sk integer
    order_sk integer
    product_sk integer
    store_sk integer
    date_id bigint
    time_id bigint
    product_price decimal(5,2)
    quantity smallint
    rec_source varchar(10)
    dt_insert date
}

Table dim_store {
  store_sk integer
  store_id varchar(100) 
  store_name varchar(100)
  store_zip varchar(100)
  store_city varchar(100)
  store_staff smallint
  rec_source varchar(10)
  dt_insert date
}

Table dim_date {
  date_id      bigint
    date         date
    year         smallint
    month        smallint
    monthname    varchar(20)
    day          smallint
    dayofyear    smallint
    weekdayname  varchar(20)
    calendarweek smallint
    quartal      varchar(5)
    weekend      varchar(20)
    is_holiday   varchar(3)
    period       varchar(30)
}

Table dim_time {
    time_id     bigint
    timeofday   varchar(10)
    quarterhour varchar(20)
    daytimename varchar(20)
    daynight    varchar(20)
}

Ref: dim_product.product_sk - fact_sales_order.product_sk
Ref: dim_orders.order_sk - fact_sales_order.order_sk
Ref: dim_store.store_sk - fact_sales_order.store_sk
Ref: dim_date.date_id - fact_sales_order.date_id
Ref: dim_time.time_id - fact_sales_order.time_id