In this post I will share a quick method which you can use to create a data model from any existing SQLs in your data warehouse environment. Data Model actually plays a very vital role in building understanding of your data warehouse. When you are new to any project, understanding the existing data model can help you establish relationships between key technical tables quickly.
We are talking about Data Model like you can see below.

In order to demonstrate how you can use the strategy defined in this post to build data model we will be using below mentioned queries.
/* Query to find all events and occurrences in one year */ SELECT eve.eventname,count(*) as total_occurences FROM Event eve inner join Date_cal dat on eve.dateid = dat.dateid where dat.caldate between '2022-01-01' and '2022-12-31' group by eve.eventname ; /* Query to find all events on 30th Oct 2022 in Bangalore city */ SELECT eve.eventname, eve.starttime, ven.venuename, ven.venuecity, ven.venuecitypop, ven.venuestate, dat.day, dat.holiday, dat.caldate FROM Event eve inner join Venue ven on eve.venueid = ven.venueid inner join Date_cal dat on eve.dateid = dat.dateid where ven.venuecity = 'Bangalore' and dat.caldate = '2022-10-30' ; /* Query to find the total number of tickets sold, price paid, and commission paid for each event and then determine profit for each event. */ select sales_event.eventname , sales_event.tot_Qtysold, sales_event.tot_Pricepaid - sales_event.tot_commission as tot_profit from ( select sum(sal.Qtysold) as tot_Qtysold, sum(sal.Pricepaid) as tot_Pricepaid, sum(sal.commission) as tot_commission, eve.eventname FROM Sales sal inner join Event eve on sal.eventid = eve.eventid group by eve.eventname )sales_event order by 2 desc; /* Query to find least popular events in history */ SELECT cat.catname, eve.eventname, sum(sal.Qtysold) as tot_tickets_sold FROM Category cat inner join Event eve on cat.catid = eve.catid inner join Sales sal on sal.eventid = eve.eventid GROUP BY cat.catname,eve.eventname ORDER BY 3 asc;
Now assume you work as Data Engineer and the business user is raising some data issues in the Reports/Visualisation built on top of these queries. You are very new to the project and seeing these queries for first time. If you have data model with you then you can easily understand how tables are joined , key columns etc. Else you have to go through the queries and learn it during the hard work.
SQL into Data Model – The Solution
The solution uses python packages with custom code to parse SQL and then generate DBML output. We will use the DBML output generated into the web app to visualise it and save as data model.
Step 1: Install the sqlparse package
pip install sqlparse
Step 2: Install the sql-metadata package
pip install sql-metadata
Step 3: Run the below python command
import sqlparse from sql_metadata import Parser strsql = """ /* Query to find all events and occurrences in one year */ SELECT eve.eventname,count(*) as total_occurences FROM Event eve inner join Date_cal dat on eve.dateid = dat.dateid where dat.caldate between '2022-01-01' and '2022-12-31' group by eve.eventname ; /* Query to find all events on 30th Oct 2022 in Bangalore city */ SELECT eve.eventname, eve.starttime, ven.venuename, ven.venuecity, ven.venuecitypop, ven.venuestate, dat.day, dat.holiday, dat.caldate FROM Event eve inner join Venue ven on eve.venueid = ven.venueid inner join Date_cal dat on eve.dateid = dat.dateid where ven.venuecity = 'Bangalore' and dat.caldate = '2022-10-30' ; /* Query to find the total number of tickets sold, price paid, and commission paid for each event and then determine profit for each event. */ select sales_event.eventname , sales_event.tot_Qtysold, sales_event.tot_Pricepaid - sales_event.tot_commission as tot_profit from ( select sum(sal.Qtysold) as tot_Qtysold, sum(sal.Pricepaid) as tot_Pricepaid, sum(sal.commission) as tot_commission, eve.eventname FROM Sales sal inner join Event eve on sal.eventid = eve.eventid group by eve.eventname )sales_event order by 2 desc; /* Query to find least popular events in history */ SELECT cat.catname, eve.eventname, sum(sal.Qtysold) as tot_tickets_sold FROM Category cat inner join Event eve on cat.catid = eve.catid inner join Sales sal on sal.eventid = eve.eventid GROUP BY cat.catname,eve.eventname ORDER BY 3 asc; """ statements = sqlparse.split(strsql) all_columns = list() all_tables = list() all_joins = "" for stmt in statements: parser = Parser(stmt) for key,value in parser.columns_dict.items(): if value is not None: all_columns.append(value) for j in parser.tables: all_tables.append(j.split()) i=0 join_cond="" if 'join' in parser.columns_dict: for j in parser.columns_dict['join']: if i%2 == 0: join_cond = join_cond + "" + "Ref: "+j else: join_cond = join_cond + " - " + j +"|" i = i + 1 all_joins = all_joins + join_cond all_joins_list = all_joins.split("|") all_joins_list = list(set(all_joins_list)) all_joins = '|'.join(all_joins_list) all_columns = sum(all_columns, []) all_columns = list(set(all_columns)) all_tables = sum(all_tables, []) all_tables = list(set(all_tables)) i=0 table_def="" for j in all_tables: table_struct="" table_struct = "Table "+j+" {\nrowid bigint" for i in all_columns: if i.startswith(j+'.') : table_struct = table_struct +"\n"+ i.split('.')[-1] + " string " table_def = table_def + table_struct + "\n}\n|" for i in table_def[:-1].split("|"): print (i) for i in all_joins[1:].split("|"): print (i)
- Change the SQL as per your requirement in the Python code.
Step 4: Copy the DBML output generated
Table Date_cal { rowid bigint holiday string day string dateid string caldate string } Table Event { rowid bigint starttime string eventname string venueid string eventid string dateid string catid string } Table Category { rowid bigint catname string catid string } Table Venue { rowid bigint venuecitypop string venueid string venuecity string venuestate string venuename string } Table Sales { rowid bigint commission string Pricepaid string eventid string Qtysold string } Ref: Category.catid - Event.catid Ref: Sales.eventid - Event.eventid Ref: Event.dateid - Date_cal.dateid Ref: Event.venueid - Venue.venueid
Copy the DBML and open dbdiagram.io
Step 5: Paste the DBML output and see the Magic

Assumptions in the code
Disclaimer: The code is my weekend project and I do not intend to spend more time on this presently. Also this project is for learning purpose only. It is not recommended to use it for real-life project as it lacks many salient features in Data Model like granular level etc. Use it , Learn it , Enjoy it. 🙂
My main goal is to have standard working package readily available which I can pick and customise as per my needs in future. Also at the same time I wish to give head start to someone who is in similar need to build Data Model from existing SQL.
Some assumptions I have made in the code:
- One technical column “rowid” has been added to each table. However you can add more or rename it. Generally table have technical columns like batch id , surrogate keys, insert date, update date etc.
- One-One Mapping between tables. The DBML generated assumes one-one mapping between tables participated in Join conditions. We only have queries to generate DBML however we can extend it to query actual tables to establish proper mapping between two tables.
- The code is tested for limited queries and may/may not support complex queries or sql functions.
Why built this utility in first place ?
There are many tools available in the market (even free ones) which does similar task. Two big challenge I find in using it.
- Limitation due to Constraints dependency : Most of the similar tools I have tested require PRIMARY KEY & FOREIGN KEY defined at the table level in table structure/ DDL. In my use-case this was not necessary. I wanted to give more weightage to joining columns.
- Limitation due to flexibility in the generate Data Model : Also for my use case I wanted more flexibility in the generated Data Model like adding or removing custom columns.
The possibilities are endless. You can create a mechanism to connect to your query log or query history system table in your database and generate Data Model for all the executed queries in last 24 hours.
Leave a Reply