Duplicate rows in the table can result in data quality issues. Sometimes it is ok to have duplicate rows in the table as per business requirement. However if the duplicate rows are loaded into the table accidentally you must remove the duplicate rows.
What are duplicate rows in SQL ?
When the value of two or more rows is exactly the same for all the columns then we categorise such rows as duplicate rows.
Sometimes few column values may match but not all the columns then considering such rows as duplicate rows in not correct. You can have same value for few columns as per business data but we should not consider those rows as duplicate rows.
How do I find if duplicate rows exists in SQL ?
There are multiple ways to identify duplicate rows in SQL. Most common one is by comparing count of rows in the table with count of DISTINCT rows in the same table. If the count DISTINCT is lesser than the count of rows in the table then the table has duplicate rows in it.
In this tutorial, we will use sales_demo table which has duplicate records in it.
Let's run the below two queries to check the count and determine is the table has duplicate records in it.
select count(*) from sales_demo; --272466 select count(*) from (select distinct * from sales_demo)t1; --172456
There is count difference in the output of above two queries hence we can confirm that the table has duplicate rows in it.
How to find DISTINCT rows in sql
We can fetch DISTINCT rows in the sql by using DISTINCT clause. We can also use GROUP BY all the columns to get same result.
select DISTINCT salesid, listid, sellerid, buyerid, eventid, dateid, qtysold, pricepaid, commission, saletime from sales_demo; --172456 select salesid, listid, sellerid, buyerid, eventid, dateid, qtysold, pricepaid, commission, saletime from sales_demo group by salesid, listid, sellerid, buyerid, eventid, dateid, qtysold, pricepaid, commission, saletime; --172456
Both the queries will give same result. Now many people will say use GROUP BY in place of DISTINCT as it is performant. But in most of the databases now, there is hardly any difference in the cost and timings. It does depend on other factors like how many total rows in the table, how many duplicate rows, row length etc.
How to find only Unique rows in sql
Let's see how we can find unique records only in SQL. This does not mean DISTINCT rows , I meant rows for which there is no duplicate record in the table. We will use GROUP BY all columns with HAVING clause to check duplicate records.
We will use below query for sales_demo table.
select count(*) from ( select salesid, listid, sellerid, buyerid, eventid, dateid, qtysold, pricepaid, commission, saletime from sales_demo group by salesid, listid, sellerid, buyerid, eventid, dateid, qtysold, pricepaid, commission, saletime having count(*)=1 )t1; --72456
This means there are 72456 unique rows in the table. For these rows there is no identical duplicate rows.
Look at the sample below. It has 6 total rows, 4 distinct rows and 2 unique rows.
** s# column is added for understanding purpose only. It is not part of table sales_demo.
Delete duplicate rows from table on the basis of one column
DISTINCT is used to remove duplicates considering all the columns present in SELECT clause. If you want to remove duplicate rows on the basis of one column or few columns then using row_numer analytical function is better option.
select salesid, listid, sellerid, buyerid, eventid, dateid, qtysold, pricepaid, commission, saletime from ( select salesid, listid, sellerid, buyerid, eventid, dateid, qtysold, pricepaid, commission, saletime, row_number() over (partition by salesid order by saletime desc) as rnk from sales_demo )t1 where rnk=1;
You can use above query to get the required result set and load into temporary table. Then delete & reload actual table from this temporary table. Refer to below example for details.
Delete duplicate rows in the table in SQL
The most effective way of removing duplicate rows in the table in SQL is by using temporary table. Create a table with same structure and load data from original table into new table using DISTINCT or GROUP BY on all columns.
Then you can truncate the original table.
Once table is empty , reload it from temporary table.
Put all the SQLs with-in single transaction between BEGIN/END transaction.
create table sales_demo_distinct (like sales_demo); begin transaction; insert into sales_demo_distinct select distinct * from sales_demo; delete from sales_demo; insert into sales_demo select * from sales_demo_distinct; end transaction; drop table sales_demo_distinct; select count(*) from sales_demo; --172456
You can add more columns to partition by clause for which you want distinct rows.
Why tables have duplicate rows ?
Now we have cleansed sales_demo table and removed duplicate rows from it. However if we do not fix the loading logic for this table in data pipeline then again it will have duplicate rows in the future.
Top 3 reasons why any table can have duplicate rows.
- Source sent duplicate rows
- This is the most common reason of duplicate rows in the table.
- To handle this create a staging table which is loaded from data from the Source first.
- Now apply DISTINCT while reading from this staging table in your data pipeline.
- You can truncate the staging table once the data is read from it and is no longer required.
- Improper Join condition/s while loading table
- When we are joining 2 or more tables then joining conditions determine the output.
- If the joining conditions are not accurate or complete then it may result in one-many mapping and the result set may have duplicate rows in it.
- Re-run of the failed Jobs
- Sometimes the data pipeline may fail and you may re-run the job after fixing the issue.
- If any partial data was loaded into the table before failure then it must be cleansed before restarting the job.
- Sometimes same job is triggered twice and it loads same data again. The ETL logic should handle this scenario where the data already processed should not be loaded again.
Note: I have used Amazon Redshift SQL. For your database SQL syntax may differ.