Redshift Select Distinct

Redshift SELECT Distinct is used to fetch unique rows from the table. DISTINCT is applied to all the columns mentioned in the SELECT clause and not just the first column.

The example fetches DISTINCT values for one column in the table

select distinct catgroup
from category;
catgroup
Sports
Shows
Concerts

redshift select distinct multiple columns

You can apply DISTINCT on multiple columns to fetch unique rows considering all the columns and not just the first column in SELECT list.

select distinct catid,catgroup
from category;
catidcatgroup
1Sports
2Sports
3Sports
4Sports
5Sports
6Shows
7Shows
8Shows
9Concerts
10Concerts
11Concerts

redshift select distinct on is not supported

Redshift does not support DISTINCT ON clause. In other databases you can specify which row to pick if duplicate exists using DISTINCT ON clause. In redshift, select distinct on is not supported.

select distinct on (catgroup)
from category
order by catgroup,catid;

--ERROR: SELECT DISTINCT ON is not supported 

redshift select distinct count

In Redshift you can calculate count of distinct values for any column. Count Distinct is supported for only one column in Redshift. If you will list more than one column in DISTINCT in COUNT function then redshift will throw error.

select  count(distinct catid) from category; --11

select  count(distinct catgroup) from category; --3

select  count(distinct catgroup,catid) from category; 

--ERROR: function count(character varying, smallint) does not exist Hint: No function matches the given name and argument types. You may need to add explicit type casts.

redshift select distinct vs group by

Is DISTINCT slower than GROUP BY ? There is no factual numbers which can support than DISTINCT is slower than GROUP BY in Redshift. If you will compare explain plan for GROUP BY and DISTINCT then the result will be absolutely same in most of the cases. In some databases, explain plan may differ and Group By may be the more preferred one.

But with new databases like Redshift the SQL execution plan is created and most optimised one is picked. In Redshift it does not make much difference.

One rule which I follow is if I am aware than there are less records and more unique values I may be ok with DISTINCT clause else I will prefer writing Group By clause.

Leave a Reply

Your email address will not be published. Required fields are marked *