Amazon Redshift supports loading columnar file formats like PARQUET, ORC. This extends compatibility and possibility of moving data easily from different environments for your data lake or data warehouse. Parquet primarily is a very popular file format on Hadoop and is first preference of Big Data experts.
Now you can load parquet files in Amazon Redshift but does that mean it should be your first preference ? In this post, we will talk about why you should prefer parquet files over csv or other readable formats. And will also cover few scenarios in which you should avoid parquet files.
Copy command to load Parquet file from S3 into a Redshift table
copy TABLENAME from 's3://<s3bucket>/<s3folder>/attendence.parquet' iam_role 'arn:aws:iam::<actid>:role/<rolenm>' format as parquet;
"FORMAT AS PARQUET" informs redshift that it is parquet file. You don't have to supply any other information like delimiter, header etc.
Advantages of using PARQUET files in Redshift Copy
- Saves Space: Parquet by default is highly compressed format so it saves space on S3.
- Saves I/O: Since file size is reduced I/O & network bandwidth required to transfer file from S3 to Redshift is reduced too.
- Saves Time: Smaller size of file takes lesser time to transfer from S3 into Redshift and also to load into Redshift Table.
- Default Parallelism:When you load a parquet file then Redshift splits a single parquet file into 128 MB file parts. Depending on the slices you have in your redshift cluster, the 128 MB file parts shall be processed in parallel during copy. Remember 1 CSV is loaded by 1 Slice only hence it is suggested to split CSV file into multiples of total slices in cluster before loading.
- Preserves Charset: Parquet file when generated with specific charset preserves it properly. So chances of facing charset related issues or junk characters while loading a parquet file are less.
Challenges with Parquet files in Redshift Copy
- No Format Options available: Presently, almost none of the options available in copy works with parquet like MAXERROR, IGNOREHEADER etc.
- Inbuilt Schema info: Parquet comes with inbuilt info on file-columns metadata. So if you want to modify some column dataype in Redshift Table then it will fail like you cannot load INTEGER column in a file into a VARCHAR column into a Redshift table.
- Difficult to debug: If you run into some copy error related with data type,data size or data value then in absence of any utility which can help you in previewing parquet file debugging becomes a big challenge.
Parquet VS CSV – case study
In this example, I have created two identical tables and loaded one with csv file while other with parquet file. Below is the observation:
Number of records in File: 68,104,695 (68 Mn+)
Size of Data Files: CSV – 1.3 GB | Parquet – 864 MB
Copy Command to Load Data File into Table
copy temp_attendance_txt from 's3://<s3bucket>/<s3folder>/attendence.txt' iam_role 'arn:aws:iam::<actid>:role/<rolenm>' delimiter '|'; select pg_last_copy_count(); select pg_last_copy_id(); copy temp_attendance_parquet from 's3://<s3bucket>/<s3folder>/attendence.parquet' iam_role 'arn:aws:iam::<actid>:role/<rolenm>' format as parquet; select pg_last_copy_count(); select pg_last_copy_id();
pg_last_copy_count: Tells the number of records loaded as part of COPY statement.
pg_last_copy_id: Tells the QueryID of the last COPY statement.
The above 2 information is essential to confirm if COPY loaded same number of records into Redshift table. Also with the QUERY ID you can check in different STL/SVL tables/views to get more insight into COPY statements.
Check Timing of COPY statements
select datediff(s,starttime,endtime) as duration, * from stl_query where query in (QueryID1,QueryID2);
In this case, PARQUET took 16 seconds where as CSV took 48 seconds. I see difference of 3 times which is massive if you consider running thousands of loads every day.
Slices participating in COPY
select * from STL_LOAD_COMMITS where query in (QueryID1,QueryID2) order by query,slice;
In this case, I can see parquet copy has 7 slices participating in the load. Where as in CSV it is single slice which takes care of loading file into Redshift table. Parquet file size is 864MB so 864/128 = ~7 slices.
In this post, I have shared my experience with Parquet so far. I won't say that you must use Parquet or must not as it totally depends on your use-case.
Hope the information shared in this post will help you in handling parquet files efficiently in Redshift.