In this post I will share a quick method which you can use to create a data model from any existing SQLs in your data warehouse environment. Data Model actually plays a very vital role in building understanding of your data warehouse. When you are new to any project, understanding the existing data model can help you establish relationships between key technical tables quickly.
We are talking about Data Model like you can see below.

In order to demonstrate how you can use the strategy defined in this post to build data model we will be using below mentioned queries.
/* Query to find all events and occurrences in one year */ SELECT eve.eventname,count(*) as total_occurences FROM Event eve inner join Date_cal dat on eve.dateid = dat.dateid where dat.caldate between '2022-01-01' and '2022-12-31' group by eve.eventname ; /* Query to find all events on 30th Oct 2022 in Bangalore city */ SELECT eve.eventname, eve.starttime, ven.venuename, ven.venuecity, ven.venuecitypop, ven.venuestate, dat.day, dat.holiday, dat.caldate FROM Event eve inner join Venue ven on eve.venueid = ven.venueid inner join Date_cal dat on eve.dateid = dat.dateid where ven.venuecity = 'Bangalore' and dat.caldate = '2022-10-30' ; /* Query to find the total number of tickets sold, price paid, and commission paid for each event and then determine profit for each event. */ select sales_event.eventname , sales_event.tot_Qtysold, sales_event.tot_Pricepaid - sales_event.tot_commission as tot_profit from ( select sum(sal.Qtysold) as tot_Qtysold, sum(sal.Pricepaid) as tot_Pricepaid, sum(sal.commission) as tot_commission, eve.eventname FROM Sales sal inner join Event eve on sal.eventid = eve.eventid group by eve.eventname )sales_event order by 2 desc; /* Query to find least popular events in history */ SELECT cat.catname, eve.eventname, sum(sal.Qtysold) as tot_tickets_sold FROM Category cat inner join Event eve on cat.catid = eve.catid inner join Sales sal on sal.eventid = eve.eventid GROUP BY cat.catname,eve.eventname ORDER BY 3 asc;
Now assume you work as Data Engineer and the business user is raising some data issues in the Reports/Visualisation built on top of these queries. You are very new to the project and seeing these queries for first time. If you have data model with you then you can easily understand how tables are joined , key columns etc. Else you have to go through the queries and learn it during the hard work.
SQL into Data Model – The Solution
The solution uses python packages with custom code to parse SQL and then generate DBML output. We will use the DBML output generated into the web app to visualise it and save as data model.
Step 1: Install the sqlparse package
pip install sqlparse
Step 2: Install the sql-metadata package
pip install sql-metadata
Step 3: Run the below python command
import sqlparse
from sql_metadata import Parser
strsql = """
/*
Query to find all events and occurrences in one year
*/
SELECT eve.eventname,count(*) as total_occurences
FROM Event eve
inner join
Date_cal dat
on
eve.dateid = dat.dateid
where dat.caldate between '2022-01-01' and '2022-12-31'
group by eve.eventname
;
/*
Query to find all events on 30th Oct 2022 in Bangalore city
*/
SELECT eve.eventname,
eve.starttime,
ven.venuename,
ven.venuecity,
ven.venuecitypop,
ven.venuestate,
dat.day,
dat.holiday,
dat.caldate
FROM Event eve
inner join
Venue ven
on
eve.venueid = ven.venueid
inner join
Date_cal dat
on
eve.dateid = dat.dateid
where ven.venuecity = 'Bangalore'
and dat.caldate = '2022-10-30'
;
/*
Query to find the total number of tickets sold, price paid, and commission paid for each event and then determine profit for each event.
*/
select
sales_event.eventname , sales_event.tot_Qtysold, sales_event.tot_Pricepaid - sales_event.tot_commission as tot_profit
from
(
select
sum(sal.Qtysold) as tot_Qtysold,
sum(sal.Pricepaid) as tot_Pricepaid,
sum(sal.commission) as tot_commission,
eve.eventname
FROM
Sales sal
inner join
Event eve
on
sal.eventid = eve.eventid
group by eve.eventname
)sales_event
order by 2 desc;
/*
Query to find least popular events in history
*/
SELECT
cat.catname,
eve.eventname,
sum(sal.Qtysold) as tot_tickets_sold
FROM Category cat
inner join
Event eve
on
cat.catid = eve.catid
inner join
Sales sal
on
sal.eventid = eve.eventid
GROUP BY cat.catname,eve.eventname
ORDER BY 3 asc;
"""
statements = sqlparse.split(strsql)
all_columns = list()
all_tables = list()
all_joins = ""
for stmt in statements:
parser = Parser(stmt)
for key,value in parser.columns_dict.items():
if value is not None:
all_columns.append(value)
for j in parser.tables:
all_tables.append(j.split())
i=0
join_cond=""
if 'join' in parser.columns_dict:
for j in parser.columns_dict['join']:
if i%2 == 0:
join_cond = join_cond + "" + "Ref: "+j
else:
join_cond = join_cond + " - " + j +"|"
i = i + 1
all_joins = all_joins + join_cond
all_joins_list = all_joins.split("|")
all_joins_list = list(set(all_joins_list))
all_joins = '|'.join(all_joins_list)
all_columns = sum(all_columns, [])
all_columns = list(set(all_columns))
all_tables = sum(all_tables, [])
all_tables = list(set(all_tables))
i=0
table_def=""
for j in all_tables:
table_struct=""
table_struct = "Table "+j+" {\nrowid bigint"
for i in all_columns:
if i.startswith(j+'.') :
table_struct = table_struct +"\n"+ i.split('.')[-1] + " string "
table_def = table_def + table_struct + "\n}\n|"
for i in table_def[:-1].split("|"):
print (i)
for i in all_joins[1:].split("|"):
print (i)
- Change the SQL as per your requirement in the Python code.
Step 4: Copy the DBML output generated
Table Date_cal {
rowid bigint
holiday string
day string
dateid string
caldate string
}
Table Event {
rowid bigint
starttime string
eventname string
venueid string
eventid string
dateid string
catid string
}
Table Category {
rowid bigint
catname string
catid string
}
Table Venue {
rowid bigint
venuecitypop string
venueid string
venuecity string
venuestate string
venuename string
}
Table Sales {
rowid bigint
commission string
Pricepaid string
eventid string
Qtysold string
}
Ref: Category.catid - Event.catid
Ref: Sales.eventid - Event.eventid
Ref: Event.dateid - Date_cal.dateid
Ref: Event.venueid - Venue.venueidCopy the DBML and open dbdiagram.io
Step 5: Paste the DBML output and see the Magic

Assumptions in the code
Disclaimer: The code is my weekend project and I do not intend to spend more time on this presently. Also this project is for learning purpose only. It is not recommended to use it for real-life project as it lacks many salient features in Data Model like granular level etc. Use it , Learn it , Enjoy it. 🙂
My main goal is to have standard working package readily available which I can pick and customise as per my needs in future. Also at the same time I wish to give head start to someone who is in similar need to build Data Model from existing SQL.
Some assumptions I have made in the code:
- One technical column “rowid” has been added to each table. However you can add more or rename it. Generally table have technical columns like batch id , surrogate keys, insert date, update date etc.
- One-One Mapping between tables. The DBML generated assumes one-one mapping between tables participated in Join conditions. We only have queries to generate DBML however we can extend it to query actual tables to establish proper mapping between two tables.
- The code is tested for limited queries and may/may not support complex queries or sql functions.
Why built this utility in first place ?
There are many tools available in the market (even free ones) which does similar task. Two big challenge I find in using it.
- Limitation due to Constraints dependency : Most of the similar tools I have tested require PRIMARY KEY & FOREIGN KEY defined at the table level in table structure/ DDL. In my use-case this was not necessary. I wanted to give more weightage to joining columns.
- Limitation due to flexibility in the generate Data Model : Also for my use case I wanted more flexibility in the generated Data Model like adding or removing custom columns.
The possibilities are endless. You can create a mechanism to connect to your query log or query history system table in your database and generate Data Model for all the executed queries in last 24 hours.
Leave a Reply