ETL SQL

Learn Extract Transform Load using SQL & PySpark

Redshift create table & copy data from S3

In this post, we will see a very simple example in which we will create a Redshift table with basic structure and then we will see what all additional properties Redshift will add to it by default. Then we will quickly discuss about those properties and in subsequent posts we will see how these properties impact the overall query performance of these tables. I will be referring to TPC-H tables and queries in Redshift related posts.
Let’s create a sample table now.

CREATE TABLE IF NOT EXISTS h_part(
p_partkey BIGINT ,
p_name VARCHAR(55) ,
p_mfgr VARCHAR(25) ,
p_brand VARCHAR(10) ,
p_type VARCHAR(25) ,
p_size DECIMAL(38,0) ,
p_container VARCHAR(10) ,
p_retailprice DECIMAL(38,18) ,
p_comment VARCHAR(23)
);

So I am creating “h_part” table with few columns and I have just specified the datatype for all the columns. This is minimum table level properties you must specify in order to create any table in Redshift. Now let’s check the table definition in Redshift.

CREATE TABLE IF NOT EXISTS public.h_part
(
" p_partkey BIGINT ENCODE az64"
" ,p_name VARCHAR(55) ENCODE lzo"
" ,p_mfgr VARCHAR(25) ENCODE lzo"
" ,p_brand VARCHAR(10) ENCODE lzo"
" ,p_type VARCHAR(25) ENCODE lzo"
" ,p_size NUMERIC(38,0) ENCODE az64"
" ,p_container VARCHAR(10) ENCODE lzo"
" ,p_retailprice NUMERIC(38,18) ENCODE az64"
" ,p_comment VARCHAR(23) ENCODE lzo"
)
DISTSTYLE AUTO
;

Two important points to notice here:

1) ENCODE: Appropriate encoding (compression technique) is added to each column. Since Redshift is columnar database, it leverages advantage of having specific compression algorithm for each column as per datatype rather than uniform compression for entire table.

2) DISTSTYLE: Distribution style of “AUTO” is added to the table which will work as “ALL” when table is small and will switch to “EVEN” as the table size will grow.

We will not change anything for now. We will keep table structure as-is and will proceed to data loading in the next post. We will talk more about these properties in details in later posts.

4 ways to create table in Redshift

In the previous post, we saw that you can create table in redshift by specifying 3 minimum details : Table Name, Column Name & Column Datatype. Redshift automatically adds encoding & distribution style to the table if nothing is specified explicitly. In this post, we will see 4 ways in which can create table in Redshift.

1) CREATE Table by specifying DDL in Redshift

This is the most common way of creating table in redshift by supplying DDL. In this user can specify Table Name, Column Name, Column Datatype, Distribution style , Distribution Key & Sort Key. Also we will add few columns as NOT NULL in table structure else default is NULL for columns.

CREATE TABLE IF NOT EXISTS h_part(
p_partkey BIGINT NOT NULL,
p_name VARCHAR(40) ,
p_brand VARCHAR(10) NOT NULL,
p_retailprice DECIMAL(10,2) ,
p_active date NOT NULL,
p_inactive date NOT NULL
)
diststyle key
distkey (p_partkey)
compound sortkey (p_active,p_brand);

Let’s look at the table components once it is created.

select
	cast(tablename as varchar(30)) as tbl,
	cast("column" as varchar(30)) as col,
	"type",
	"encoding",
	"distkey",
	"sortkey",
	"notnull"
from
	pg_table_def
where
	tablename in ('h_part')
order by
	1;

tblcoltypeencodingdistkeysortkeynotnull
h_partp_partkeybigintaz64true0true
h_partp_namecharacter varying(40)lzofalse0false
h_partp_brandcharacter varying(10)nonefalse2true
h_partp_retailpricenumeric(10,2)az64false0false
h_partp_activedatenonefalse1true
h_partp_inactivedateaz64false0true

So we see default encoding of AZ64 is applied to NUMERIC & DATE columns, LZO compression is applied to STRING columns. Sort Key columns are not encoded and are kept as RAW only. Why ? Because encoding sort key columns may result in overhead while computing. More details on this in future post. So we can see proper distkey , sortkey & NOT NULL columns in the output.

2) CREATE TABLE LIKE in Redshift

We can create new table from existing table in redshift by using LIKE command. It is a one-line command which will copy most of the properties of Source table into new Target table. Let’s look at the example below:

create table h_part_like (LIKE h_part);

Let’s see the table components after it is created.

tblcoltypeencodingdistkeysortkeynotnull
h_part_likep_partkeybigintaz64true0true
h_part_likep_namecharacter varying(40)lzofalse0false
h_part_likep_brandcharacter varying(10)nonefalse2true
h_part_likep_retailpricenumeric(10,2)az64false0false
h_part_likep_activedatenonefalse1true
h_part_likep_inactivedateaz64false0true

The output looks exactly same as creating table via DDL. So if you want most of the table properties in the new table then LIKE is the best choice.

Note: CREATE TABLE LIKE creates empty table. If you want to create a back-up of any table with data then either you run INSERT statement once the table is created or create table using other method which we have shared below.

3) CREATE TABLE AS (CTAS) in Redshift

CTAS is a common method available in most of the RDBMS including Redshift to create a new table from existing table. With this method you can also copy data from Source to Target table. Let’s look at the example below:

 create table  h_part_ctas as  (select * from h_part); 

Let’s check the table components below:

tblcoltypeencodingdistkeysortkeynotnull
h_part_ctasp_partkeybigintaz64true0false
h_part_ctasp_activedatenonefalse1false
h_part_ctasp_inactivedateaz64false0false
h_part_ctasp_namecharacter varying(40)lzofalse0false
h_part_ctasp_brandcharacter varying(10)nonefalse2false
h_part_ctasp_retailpricenumeric(10,2)az64false0false

The table structure with CTAS looks identical but the column property like NOT NULL is lost. So if you wish to keep column level properties then CTAS may lose it and you may want to use LIKE + INSERT instead.

4) SELECT INTO new Table in Redshift

You can also create new table using SELECT INTO statement which is same as we have in many other RDBMS. With SELECT INTO you can create a new table with same table structure & with data. Let’s see the example below:

select * INTO h_part_select from h_part;
tblcoltypeencodingdistkeysortkeynotnull
h_part_selectp_partkeybigintaz64true0false
h_part_selectp_activedatenonefalse1false
h_part_selectp_inactivedateaz64false0false
h_part_selectp_namecharacter varying(40)lzofalse0false
h_part_selectp_brandcharacter varying(10)nonefalse2false
h_part_selectp_retailpricenumeric(10,2)az64false0false

This is same output as we saw in CTAS. Column level properties like NOT NULL is lost. We do get table with same distkey & sortkey and with Data using this approach.

You can pick any of the 4 way of creating new table in Redshift depending on your requirement. For fresh tables, it is generally the first approach of CREATING tables via DDL and for ETL you may want to use other approaches like LIKE (without data) and CTAS (with data).

Redshift Copy Command – Load S3 data into table

In the previous post, we created few tables in Redshift and in this post we will see how to load data present in S3 into these tables. To load data into Redshift, the most preferred method is COPY command and we will use same in this post. COPY command loads data in parallel leveraging the MPP core structure of Redshift. The default file delimiter is “|” (pipe) and I have used same for data file in this example. So let’s see the COPY command in action below:

COPY H_PART
FROM
's3://<bucketname>/<bucketfolder>'
 credentials 'aws_iam_role=arn:aws:iam::<account-id>:role/<rolewiths3access>';

Following parameters are required in above statement:
1) bucketname : S3 Bucket Name
2) bucketfolder : S3 Bucket Folder in which file exists
3) account-id: Account ID used for creating role
4) rolewiths3access : Create a role with S3 access

Replace above 4 parameters in your Copy statement and run the command from any SQL Editor.

You can also give Access & Secret key for authorisation in COPY command in place of ROLE.

copy h_customer
FROM
's3://<bucketname>/<bucketfolder>'
access_key_id '<access-key>' secret_access_key '<secret-key>'

However the preferred method is to use IAM Role for copying data into Redshift table.
After running the command I was able to see data in my Redshift tables.

select count(*) from h_customer; --100000
select count(*) from h_lineitem; --100000
select count(*) from h_nation; --25
select count(*) from h_order;--100000
select count(*) from h_part; --100000
select count(*) from h_partsupp; --100000
select count(*) from h_supplier; --100000

I have covered very basic manner of loading data file from S3 into Redshift table in this post. In the later posts, we will see more options like manifest file, fixed width or delimited file, using compressed files etc which are available to load data into Redshift tables.

Should you use Parquet files with Redshift Copy ?

Amazon Redshift supports loading columnar file formats like PARQUET, ORC. This extends compatibility and possibility of moving data easily from different environments for your data lake or data warehouse. Parquet primarily is a very popular file format on Hadoop and is first preference of Big Data experts.

Now you can load parquet files in Amazon Redshift but does that mean it should be your first preference ? In this post, we will talk about why you should prefer parquet files over csv or other readable formats. And will also cover few scenarios in which you should avoid parquet files.

Copy command to load Parquet file from S3 into a Redshift table

copy TABLENAME 
from 
's3://<s3bucket>/<s3folder>/attendence.parquet' 
iam_role 'arn:aws:iam::<actid>:role/<rolenm>' 
format as parquet;

“FORMAT AS PARQUET” informs redshift that it is parquet file. You don’t have to supply any other information like delimiter, header etc.

Advantages of using PARQUET files in Redshift Copy

  1. Saves Space: Parquet by default is highly compressed format so it saves space on S3.
  2. Saves I/O: Since file size is reduced I/O & network bandwidth required to transfer file from S3 to Redshift is reduced too.
  3. Saves Time: Smaller size of file takes lesser time to transfer from S3 into Redshift and also to load into Redshift Table.
  4. Default Parallelism:When you load a parquet file then Redshift splits a single parquet file into 128 MB file parts. Depending on the slices you have in your redshift cluster, the 128 MB file parts shall be processed in parallel during copy. Remember 1 CSV is loaded by 1 Slice only hence it is suggested to split CSV file into multiples of total slices in cluster before loading.
  5. Preserves Charset: Parquet file when generated with specific charset preserves it properly. So chances of facing charset related issues or junk characters while loading a parquet file are less.

Challenges with Parquet files in Redshift Copy

  1. No Format Options available: Presently, almost none of the options available in copy works with parquet like MAXERROR, IGNOREHEADER etc.
  2. Inbuilt Schema info: Parquet comes with inbuilt info on file-columns metadata. So if you want to modify some column dataype in Redshift Table then it will fail like you cannot load INTEGER column in a file into a VARCHAR column into a Redshift table.
  3. Difficult to debug: If you run into some copy error related with data type,data size or data value then in absence of any utility which can help you in previewing parquet file debugging becomes a big challenge.

Parquet VS CSV – case study

In this example, I have created two identical tables and loaded one with csv file while other with parquet file. Below is the observation:

Number of records in File: 68,104,695 (68 Mn+)
Size of Data Files: CSV – 1.3 GB | Parquet – 864 MB

Copy Command to Load Data File into Table

copy temp_attendance_txt 
from 
's3://<s3bucket>/<s3folder>/attendence.txt' 
iam_role 'arn:aws:iam::<actid>:role/<rolenm>' 
delimiter '|';

select pg_last_copy_count(); 
select pg_last_copy_id(); 

copy temp_attendance_parquet 
from 
's3://<s3bucket>/<s3folder>/attendence.parquet' 
iam_role 'arn:aws:iam::<actid>:role/<rolenm>' 
format as parquet;

select pg_last_copy_count();
select pg_last_copy_id();

pg_last_copy_count: Tells the number of records loaded as part of COPY statement.

pg_last_copy_id: Tells the QueryID of the last COPY statement.

The above 2 information is essential to confirm if COPY loaded same number of records into Redshift table. Also with the QUERY ID you can check in different STL/SVL tables/views to get more insight into COPY statements.

Check Timing of COPY statements

select datediff(s,starttime,endtime) as duration,
* 
from 
stl_query 
where query in (QueryID1,QueryID2);

In this case, PARQUET took 16 seconds where as CSV took 48 seconds. I see difference of 3 times which is massive if you consider running thousands of loads every day.

Slices participating in COPY

select * 
from 
STL_LOAD_COMMITS
where query in (QueryID1,QueryID2)
order by query,slice;

In this case, I can see parquet copy has 7 slices participating in the load. Where as in CSV it is single slice which takes care of loading file into Redshift table. Parquet file size is 864MB so 864/128 = ~7 slices.

In this post, I have shared my experience with Parquet so far. I won’t say that you must use Parquet or must not as it totally depends on your use-case.

Hope the information shared in this post will help you in handling parquet files efficiently in Redshift.

Leave a Reply

Your email address will not be published. Required fields are marked *