Create data model from SQL | Step-by-Step guide for reverse engineering

In this post I will share a quick method which you can use to create a data model from any existing SQLs in your data warehouse environment. Data Model actually plays a very vital role in building understanding of your data warehouse. When you are new to any project, understanding the existing data model can help you establish relationships between key technical tables quickly.

We are talking about Data Model like you can see below.

Generated Data Model

In order to demonstrate how you can use the strategy defined in this post to build data model we will be using below mentioned queries.

/*
 Query to find all events and occurrences in one year
*/

SELECT eve.eventname,count(*) as total_occurences
FROM Event eve
inner join
Date_cal dat
on
eve.dateid = dat.dateid
where dat.caldate between '2022-01-01' and '2022-12-31'
group by eve.eventname
;

/*
Query to find all events on 30th Oct 2022 in Bangalore city
*/

SELECT eve.eventname,
eve.starttime,
ven.venuename,
ven.venuecity,
ven.venuecitypop,
ven.venuestate,
dat.day,
dat.holiday,
dat.caldate
FROM Event eve
inner join 
Venue ven
on 
eve.venueid = ven.venueid
inner join
Date_cal dat
on
eve.dateid = dat.dateid
where ven.venuecity = 'Bangalore'
and dat.caldate = '2022-10-30'
;


/* 
Query to find the total number of tickets sold, price paid, and commission paid for each event and then determine profit for each event.
*/
select
 sales_event.eventname , sales_event.tot_Qtysold, sales_event.tot_Pricepaid - sales_event.tot_commission as tot_profit
from
(
select 
sum(sal.Qtysold) as tot_Qtysold,
sum(sal.Pricepaid) as tot_Pricepaid,
sum(sal.commission) as tot_commission,
eve.eventname
FROM
Sales sal
inner join
Event eve
on
sal.eventid = eve.eventid
group by eve.eventname
)sales_event
order by 2 desc;


/* 
Query to find least popular events in history 
*/

SELECT 
cat.catname,
eve.eventname,
sum(sal.Qtysold) as tot_tickets_sold
FROM Category cat
inner join 
Event eve
on
cat.catid = eve.catid
inner join 
Sales sal
on
sal.eventid = eve.eventid
GROUP BY cat.catname,eve.eventname
ORDER BY 3 asc;

Now assume you work as Data Engineer and the business user is raising some data issues in the Reports/Visualisation built on top of these queries. You are very new to the project and seeing these queries for first time. If you have data model with you then you can easily understand how tables are joined , key columns etc. Else you have to go through the queries and learn it during the hard work.

SQL into Data Model – The Solution

The solution uses python packages with custom code to parse SQL and then generate DBML output. We will use the DBML output generated into the web app to visualise it and save as data model.

Step 1: Install the sqlparse package

pip install sqlparse

Step 2: Install the sql-metadata package

pip install sql-metadata

Step 3: Run the below python command

import sqlparse
from sql_metadata import Parser

strsql = """
/*
 Query to find all events and occurrences in one year
*/

SELECT eve.eventname,count(*) as total_occurences
FROM Event eve
inner join
Date_cal dat
on
eve.dateid = dat.dateid
where dat.caldate between '2022-01-01' and '2022-12-31'
group by eve.eventname
;

/*
Query to find all events on 30th Oct 2022 in Bangalore city
*/

SELECT eve.eventname,
eve.starttime,
ven.venuename,
ven.venuecity,
ven.venuecitypop,
ven.venuestate,
dat.day,
dat.holiday,
dat.caldate
FROM Event eve
inner join 
Venue ven
on 
eve.venueid = ven.venueid
inner join
Date_cal dat
on
eve.dateid = dat.dateid
where ven.venuecity = 'Bangalore'
and dat.caldate = '2022-10-30'
;


/* 
Query to find the total number of tickets sold, price paid, and commission paid for each event and then determine profit for each event.
*/
select
 sales_event.eventname , sales_event.tot_Qtysold, sales_event.tot_Pricepaid - sales_event.tot_commission as tot_profit
from
(
select 
sum(sal.Qtysold) as tot_Qtysold,
sum(sal.Pricepaid) as tot_Pricepaid,
sum(sal.commission) as tot_commission,
eve.eventname
FROM
Sales sal
inner join
Event eve
on
sal.eventid = eve.eventid
group by eve.eventname
)sales_event
order by 2 desc;


/* 
Query to find least popular events in history 
*/

SELECT 
cat.catname,
eve.eventname,
sum(sal.Qtysold) as tot_tickets_sold
FROM Category cat
inner join 
Event eve
on
cat.catid = eve.catid
inner join 
Sales sal
on
sal.eventid = eve.eventid
GROUP BY cat.catname,eve.eventname
ORDER BY 3 asc;
"""

statements = sqlparse.split(strsql)
all_columns = list()
all_tables = list()
all_joins = ""

for stmt in statements:
    parser = Parser(stmt)
    for key,value in parser.columns_dict.items():
        if value is not None:
            all_columns.append(value)
    for j in parser.tables:
        all_tables.append(j.split())
    i=0
    join_cond=""
    if 'join' in parser.columns_dict:
        for j in parser.columns_dict['join']:
            if i%2 == 0:
                join_cond = join_cond + "" + "Ref: "+j
            else:
                join_cond = join_cond + " - " + j +"|"
            i = i + 1
        all_joins = all_joins + join_cond


all_joins_list = all_joins.split("|")
all_joins_list = list(set(all_joins_list))
all_joins = '|'.join(all_joins_list)


all_columns = sum(all_columns, [])
all_columns = list(set(all_columns))

all_tables = sum(all_tables, [])
all_tables = list(set(all_tables))


i=0
table_def=""
for j in all_tables:
    table_struct=""
    table_struct = "Table "+j+" {\nrowid bigint"
    for i in all_columns:
        if i.startswith(j+'.') :
            table_struct = table_struct +"\n"+ i.split('.')[-1] + " string "
    table_def = table_def + table_struct + "\n}\n|"


for i in table_def[:-1].split("|"):
    print (i)

for i in all_joins[1:].split("|"):
    print (i)

  • Change the SQL as per your requirement in the Python code.

Step 4: Copy the DBML output generated

Table Date_cal {
rowid bigint
holiday string 
day string 
dateid string 
caldate string 
}

Table Event {
rowid bigint
starttime string 
eventname string 
venueid string 
eventid string 
dateid string 
catid string 
}

Table Category {
rowid bigint
catname string 
catid string 
}

Table Venue {
rowid bigint
venuecitypop string 
venueid string 
venuecity string 
venuestate string 
venuename string 
}

Table Sales {
rowid bigint
commission string 
Pricepaid string 
eventid string 
Qtysold string 
}

Ref: Category.catid - Event.catid
Ref: Sales.eventid - Event.eventid
Ref: Event.dateid - Date_cal.dateid
Ref: Event.venueid - Venue.venueid

Copy the DBML and open dbdiagram.io

Step 5: Paste the DBML output and see the Magic

DBML and corresponding Data Model

Assumptions in the code

Disclaimer: The code is my weekend project and I do not intend to spend more time on this presently. Also this project is for learning purpose only. It is not recommended to use it for real-life project as it lacks many salient features in Data Model like granular level etc. Use it , Learn it , Enjoy it. 🙂

My main goal is to have standard working package readily available which I can pick and customise as per my needs in future. Also at the same time I wish to give head start to someone who is in similar need to build Data Model from existing SQL.

Some assumptions I have made in the code:

  • One technical column "rowid" has been added to each table. However you can add more or rename it. Generally table have technical columns like batch id , surrogate keys, insert date, update date etc.
  • One-One Mapping between tables. The DBML generated assumes one-one mapping between tables participated in Join conditions. We only have queries to generate DBML however we can extend it to query actual tables to establish proper mapping between two tables.
  • The code is tested for limited queries and may/may not support complex queries or sql functions.

Why built this utility in first place ?

There are many tools available in the market (even free ones) which does similar task. Two big challenge I find in using it.

  • Limitation due to Constraints dependency : Most of the similar tools I have tested require PRIMARY KEY & FOREIGN KEY defined at the table level in table structure/ DDL. In my use-case this was not necessary. I wanted to give more weightage to joining columns.
  • Limitation due to flexibility in the generate Data Model : Also for my use case I wanted more flexibility in the generated Data Model like adding or removing custom columns.

The possibilities are endless. You can create a mechanism to connect to your query log or query history system table in your database and generate Data Model for all the executed queries in last 24 hours.

Leave a Comment