Teradata uses a HASHING algorithm on PRIMARY INDEX column/s to distribute rows among various AMP's. The process of rows distribution is the main reason behind the massive parallel core architecture of TERADATA.
Teradata HASHROW, HASHBUCKET, HASHAMP
TERADATA uses indexes to determine the distribution of rows. Teradata uses a hashing algorithm which processes the index and gives the HASH VALUE. Depending on the HASH VALUE , it refers to HASH MAP to decide the HASH BUCKET and hence the HASH AMP. That particular AMP will store that record. Similarly, there are other AMP's also receiving their share of records. So each record is stored in specific AMP depending on the HASH VALUE.
This is the reason it is suggested that the columns with more unique values and used in joins and filters are preferable index columns. So whenever set of records are received , index columns are processed and are stored in respective AMP's. Since the work is distributed between AMP's, thereby TERADATA is so swift. So we can say TERADATA is as fast as the slowest AMP.
To understand HASH Functions and ROWS Distribution in Teradata refer to below query which try to process a record with index column value as 'NITIN':
SELECT HASHROW ('NITIN') AS "ROW HASH" ,HASHBUCKET (HASHROW ('NITIN')) AS "HASH BUCKET" ,HASHAMP (HASHBUCKET (HASHROW ('NITIN'))) AS "HASH AMP";
TERADATA is unique because of the parallel architecture. TERADATA rows distribution is important to understand as it will help us in realising better index columns. One must also check for AMP skewness. In case of improper index column selection , uneven distribution of rows may occur; which will eventually lead to more work for some specific AMP while other AMP's would be free. Hence increasing turnaround time. AMP skewness can be check by using simple HASH functions available in TERADATA.
Teradata Data distribution directly impacts performance of queries executed on any table. So selecting a good Primary Index is of utmost importance in any Teradata environment. Let's look at the example below to understand this.
Teradata Data Distribution Example
Table employee has emp_id column which is unique with value starts from 1 and finishes at 50. Other column emp_gender which can be 'Male' (M),'Female' (F),'Not Available' (NA). There is total 50 rows in the table. Now let's see if we pick employee_id column as primary index then in a 5-AMP Teradata system how the data distribution shall look like.
In the above diagram we have compared a situation when we create primary index on emp_id column which is unique and then used emp_gender column which have only 3 unique values.
Teradata Data Distribution Observation
Even Distribution vs Skewed Distribution
When we select emp_id column as Primary Index then we see EVEN distribution of rows among AMPs. Each amp is storing same number of rows i.e. 10 in this example. However when we select emp_gender as Primary Index then the distribution is highly skewed. If one value occurs more than the other values then that particular AMP has to do more work than others. In this example, AMP3 is handling 25 rows while AMP1 has 15 and AMP5 has only 10 rows.
Teradata is all about Parallelism which means it divides the task among amps and executes the process in parallel. If less AMPs are participating in the process then with reduced parallelism the Teradata system goes under-utilised. In this example, with emp_id as primary index all 5 AMPs are involved in processing and are equally utilised. However when we select emp_gender only 3 AMPs got some data while AMP2 & AMP4 did not get any data for employee table. So when we will query this table only 3 AMP will work in place of 5. Hence AMP2 & AMP4 are under-utilised.
It is said that the Teradata process is as fast as the slowest AMP. So when one AMP is overloaded then that particular AMP will take more time to complete the task. The other AMPs will finish their task but overall process will wait for slowest AMP or most overloaded AMP to complete its task. In this case, when we select emp_id as primary index then all 5 AMPs got equal number of rows to process so they shall finish the task at almost same time. When we select emp_gender as primary index then AMP3 has to process more rows than AMP1 & AMP5 whereas AMP2 & AMP4 does not have any data to process. This directly impacts the overall performance of the job.
Identifying Good Primary Index column in Teradata
So now we know that a lot of the problems and performance issues we face in Teradata is due to bad Primary Index column. Thankfully you can identify a good candidate for Primary Index using below steps:
1) Identify columns used in the Filter & Join conditions
First step is to identify the columns which you will be using or are used in the filter or join conditions. This step is very important because to fully utilise the primary index benefit you must use it in the queries too else you may end up doing FULL Table Scan which must be avoided when ever possible.
2) Identify the Cardinality of identified columns
Once you have list of identified columns check the cardinality of columns and pick the top 3 high cardinality columns. High Cardinality means columns with most number of unique values. Like emp_id is unique in our example as it has 50 unique values for given 50 rows so it has high cardinality.
3) Check if all AMPs are utilised
It is very important to check that if you pick any column as primary index whether all columns shall be storing some portion of table data or not. Also the rows each AMP will get shall be in the similar range to avoid skewness.
Run the below query to confirm if all the AMPs in the system are storing a portion of table when identified column is used as Primary Index
SELECT HASHAMP(HASHBUCKET(HASHROW(emp_id))), COUNT(*) FROM employee GROUP BY 1 ORDER BY 1; SELECT HASHAMP(HASHBUCKET(HASHROW(emp_gender))), COUNT(*) FROM employee GROUP BY 1 ORDER BY 1;
4) Candidate Key or Single Column Primary Index
In some cases, a single column Primary Index may not be able to fulfil all the requirements mentioned above. In that case, repeat the process from Step-1 by considering 2 columns as primary index together as composite key. In some cases, I have seen even 4 columns together used as Primary Index. However be careful because if the primary index has 4 columns and your query uses only 3 of it then you may not be leveraging Primary Index benefits for that query.