Teradata data distribution & Indexes (Primary/Secondary)

Teradata uses a HASHING algorithm on PRIMARY INDEX column/s to distribute rows among various AMP’s. The process of rows distribution is the main reason behind the massive parallel core architecture of TERADATA.

Teradata HASHROW, HASHBUCKET, HASHAMP

TERADATA uses indexes to determine the distribution of rows. Teradata uses a hashing algorithm which processes the index and gives the HASH VALUE. Depending on the HASH VALUE , it refers to HASH MAP to decide the HASH BUCKET and hence the HASH AMP. That particular AMP will store that record. Similarly, there are other AMP’s also receiving their share of records. So each record is stored in specific AMP depending on the HASH VALUE.

This is the reason it is suggested that the columns with more unique values and used in joins and filters are preferable index columns. So whenever set of records are received , index columns are processed and are stored in respective AMP’s. Since the work is distributed between AMP’s, thereby TERADATA is so swift. So we can say TERADATA is as fast as the slowest AMP.

To understand HASH Functions and ROWS Distribution in Teradata refer to below query which try to process a record with index column value as ‘NITIN’:

SELECT
 HASHROW ('NITIN')   AS "ROW HASH"
,HASHBUCKET (HASHROW ('NITIN')) AS "HASH BUCKET"
,HASHAMP (HASHBUCKET (HASHROW ('NITIN'))) AS "HASH AMP";

DATA Skewness

TERADATA is unique because of the parallel architecture. TERADATA rows distribution is important to understand as it will help us in realising better index columns. One must also check for AMP skewness. In case of improper index column selection , uneven distribution of rows may occur; which will eventually lead to more work for some specific AMP while other AMP’s would be free. Hence increasing turnaround time. AMP skewness can be check by using simple HASH functions available in TERADATA.

Teradata Data distribution directly impacts performance of queries executed on any table. So selecting a good Primary Index is of utmost importance in any Teradata environment. Let’s look at the example below to understand this.

Teradata Data Distribution Example

Table employee has emp_id column which is unique with value starts from 1 and finishes at 50. Other column emp_gender which can be ‘Male’ (M),’Female’ (F),’Not Available’ (NA). There is total 50 rows in the table. Now let’s see if we pick employee_id column as primary index then in a 5-AMP Teradata system how the data distribution shall look like.

In the above diagram we have compared a situation when we create primary index on emp_id column which is unique and then used emp_gender column which have only 3 unique values.

Teradata Data Distribution Observation

Even Distribution vs Skewed Distribution

When we select emp_id column as Primary Index then we see EVEN distribution of rows among AMPs. Each amp is storing same number of rows i.e. 10 in this example. However when we select emp_gender as Primary Index then the distribution is highly skewed. If one value occurs more than the other values then that particular AMP has to do more work than others. In this example, AMP3 is handling 25 rows while AMP1 has 15 and AMP5 has only 10 rows.

Reduced Parallelism

Teradata is all about Parallelism which means it divides the task among amps and executes the process in parallel. If less AMPs are participating in the process then with reduced parallelism the Teradata system goes under-utilised. In this example, with emp_id as primary index all 5 AMPs are involved in processing and are equally utilised. However when we select emp_gender only 3 AMPs got some data while AMP2 & AMP4 did not get any data for employee table. So when we will query this table only 3 AMP will work in place of 5. Hence AMP2 & AMP4 are under-utilised.

Performance Impact

It is said that the Teradata process is as fast as the slowest AMP. So when one AMP is overloaded then that particular AMP will take more time to complete the task. The other AMPs will finish their task but overall process will wait for slowest AMP or most overloaded AMP to complete its task. In this case, when we select emp_id as primary index then all 5 AMPs got equal number of rows to process so they shall finish the task at almost same time. When we select emp_gender as primary index then AMP3 has to process more rows than AMP1 & AMP5 whereas AMP2 & AMP4 does not have any data to process. This directly impacts the overall performance of the job.

Identifying Good Primary Index column in Teradata

So now we know that a lot of the problems and performance issues we face in Teradata is due to bad Primary Index column. Thankfully you can identify a good candidate for Primary Index using below steps:

1) Identify columns used in the Filter & Join conditions

First step is to identify the columns which you will be using or are used in the filter or join conditions. This step is very important because to fully utilise the primary index benefit you must use it in the queries too else you may end up doing FULL Table Scan which must be avoided when ever possible.

2) Identify the Cardinality of identified columns

Once you have list of identified columns check the cardinality of columns and pick the top 3 high cardinality columns. High Cardinality means columns with most number of unique values. Like emp_id is unique in our example as it has 50 unique values for given 50 rows so it has high cardinality.

3) Check if all AMPs are utilised

It is very important to check that if you pick any column as primary index whether all columns shall be storing some portion of table data or not. Also the rows each AMP will get shall be in the similar range to avoid skewness.
Run the below query to confirm if all the AMPs in the system are storing a portion of table when identified column is used as Primary Index

SELECT HASHAMP(HASHBUCKET(HASHROW(emp_id))), COUNT(*)
FROM employee 
GROUP BY 1
ORDER BY 1;
                                  
SELECT HASHAMP(HASHBUCKET(HASHROW(emp_gender))), COUNT(*)
FROM employee 
GROUP BY 1
ORDER BY 1;

4) Candidate Key or Single Column Primary Index

In some cases, a single column Primary Index may not be able to fulfil all the requirements mentioned above. In that case, repeat the process from Step-1 by considering 2 columns as primary index together as composite key. In some cases, I have seen even 4 columns together used as Primary Index. However be careful because if the primary index has 4 columns and your query uses only 3 of it then you may not be leveraging Primary Index benefits for that query.

How Teradata distribute rows ? It is very important to understand how Teradata actually distribute rows among AMPs. We all know that PRIMARY INDEX is used for data distribution. All the AMPs in the Teradata system maintains a portion of each table and this forms the basis of ‘parallel architecture and share nothing‘ architecture of Teradata. We will see how exactly it happens and what all hash functions are used to determine the respective AMP which will hold the table.

There are three main functions:

Teradata hashrow

HASHROW: The function returns 4 BYTE output for each input value. So the primary index value is passed through HASHROW function to generate output which is further passed to HASHBUCKET. Example: If MARK is my primary index value then HASHROW will return output something like below:

Teradata hashbucket

HASHBUCKET: The function returns the bucket number in HASHMAP which will hold the table row. The input to HASHBUCKET is 4 BYTE HASHROW output and it returns the BUCKET#.

Teradata hashamp

HASHAMP: The function returns the AMP number which will hold the table row in its vdisk. This AMP will be responsible for the portion of table rows which it stores in it vdisk. It takes HASHBUCKET# as the input and returns the AMP number.

All the three Teradata HASH functions work together to determine which AMP in Teradata will store what portion of table data.

If you have worked on other RDBMS then you must be aware of Primary KEY concept. In Teradata ,Primary Key does not play as important role as Primary INDEX. In this post, we will see difference between Primary Key and Primary Index.

One must not get confused between Primary Key and Primary Index in Teradata. Primary KEY is more of a logical thing however Primary INDEX is more of physical thing. In Teradata, Primary INDEX is used for finding best access path for data retrieval and data insertion and Primary KEY is used for finding each rows uniquely just like in other RDBMS.

5 differences between PRIMARY KEY and PRIMARY INDEX

#	PRIMARY KEY	PRIMARY INDEX
1	PRIMARY KEY cannot be NULL	PRIMARY INDEX can be NULL
2	PRIMARY KEY is not mandatory in Teradata	PRIMARY INDEX is mandatory in Teradata with exception of NOPI tables
3	PRIMARY KEY does not help in data distribution.	PRIMARY INDEX helps in data distribution
4	PRIMARY KEY should be unique.	PRIMARY INDEX can be UNIQUE (Unique Primary Index) or NON UNIQUE (Primary Index).
5	PRIMARY KEY is logical implementation.	PRIMARY INDEX is physical implementation.

Primary Key vs Primary Index in Teradata

Now we will see few scenarios to see how these two are handled in Teradata

I have not defined PRIMARY INDEX or PRIMARY KEY on table what will happen now

In this case, Teradata will check if any column is defined as UNIQUE, then it will make that column as UNIQUE PRIMARY INDEX else first column will be created as PRIMARY INDEX.

I have not defined PRIMARY INDEX however a column is defined as PRIMARY KEY

In this case, Teradata will make the PRIMARY KEY column as UNIQUE PRIMARY INDEX of the table.

I have defined both PRIMARY KEY and PRIMARY INDEX on different column

In this case, Teradata will make PRIMARY KEY column as UNIQUE SECONDARY INDEX i.e. UNIQUE INDEX on the table.

So one must understand the importance of PRIMARY INDEX in Teradata. Generally, PRIMARY KEY concept is taken care by UNIQUE PRIMARY INDEX in Teradata environment.

what is the difference between primary key and unique key

Similar to primary key , you can define column as UNIQUE which will enforce unique value for the column. The difference is you can have multiple columns in the table defined as UNIQUE however a table can have only one primary key.

What is Secondary Index in Teradata ?

Secondary Index in Teradata provides an alternate path to retrieve the data. It is used only for data retrieval and it has nothing to do with data distribution. For data distribution, Teradata use PRIMARY INDEX. So why do we need Secondary Index , when Primary Index is available for Data Storage as well as Data Retrieval ? There may be some situations when the query may not be using Primary Index column for data retrieval. In such cases, data retrieval will be very slow. In such situations we can create Secondary Index on such columns which are not part of PRIMARY Index however are used very often in JOIN conditions or other conditions for data retrieval. Like PRIMARY INDEX, we have two types of SECONDARY Index too:

Unique Secondary Index

CREATE UNIQUE INDEX [COLUMN_NAME] ON TABLENAME;

Non-Unique Secondary Index

CREATE INDEX [COLUMN_NAME] ON TABLENAME;

Creating SECONDARY INDEX may help in performance optimisation however it also comes at some cost in terms of resources. Whenever a SECONDARY index is created on table , a subtable is created on all the AMPs which hold following information:

SECONDARY INDEX VALUE || SECONDARY INDEX ROW_ID || PRIMARY INDEX ROW_ID

So whenever we query using column defined as SECONDARY INDEX, all AMP’s are asked to check for their sub-table if they hold that value. If yes, then AMPs retrieve the corresponding PRIMARY INDEX row_id from their subtable. Then the AMP holding the PRIMARY INDEX row_id is asked to retrieve respective records. Hence, Data Retrieval via Secondary Index is always 2 AMP or more AMP operation. For NUSI [Non Unique Secondary Index] subtable is created in the same AMP holding the PRIMARY row_id. However for USI[Unique Secondary Index], subtables hold the information about rows of different AMPs. Secondary Index avoids FULL TABLE scan. However one should collect STATS on Secondary Index columns in order to allow Optimizer to use Secondary Index and not Full Table Scan.

Advantages of Secondary Index

Avoids FULL Table Scan by providing alternate data retrieval path.
Enhances performances.
Can be dropped and created anytime.
A table may have multiple Secondary Index defined where as only one Primary Index is permissible.

Disadvantages of Secondary Index

Needs extra storage space for SUBTABLE.
Needs extra I/O to maintain SUBTABLE.
Collect STATS is required in order to avoid FULL TABLE SCAN.

Drop Secondary Index

DROP INDEX [COLUMN_NAME] ON TABLENAME;

ETL SQL