ETL SQL

Learn Extract Transform Load using SQL & PySpark

Teradata – the complete guide

What is Teradata ?

Teradata is defined as the Relational Database Management Solution which supports Massively Parallel Processing. Teradata is meant to provide active datawarehouse , data marts and data analytics environment to customer. Teradata is also considered as the leading data warehouse solution in the market.

First thing you must understand is what is Teradata and why and where is it used. Also why is it so popular in the Datawarehouse and Analytics market since last decade or so. To start with I want you to remember only 2 main points to answer the question What is Teradata ?

  • RDBMS (Relational DataBase Management System)
    • Stores DATA in Tabular format
    • Table has ROWS & COLUMNS
    • All rows has same format
    • Each column has specific datatype
    • Objects are related generally
  • MPP (Massively Parallel Processing)
    • Parallel units of processors
    • Independent work
    • Concurrent Tasks

Whenever you are asked what is Teradata ? You should reply it is RDBMS & is MPP. In addition to the above mentioned points kindly understand that Teradata is primarily used for Data warehousing solutions – Active Data warehouse, Data Marts and Data Analytics. Also remember that Teradata follows “shared nothing architecture“.

The parallelism in Teradata is built in the core architecture hence even a small query or a very complex query both can take advantage of parallelism in Teradata.

Let’s see the Teradata Architecture now which will answer few of our questions related with parallelism in Teradata.

Teradata Architecture – Basic

Teradata consists of 4 main components :

Parsing Engine

BYNET

AMPs

VDISKs

The below diagram displays a very basic Teradata Architecture.

Teradata Architecture Diagram

TERADATA is incredibly fast because of the parallel architecture. It is very important to understand the Teradata Architecture and it is very often asked in the interviews as well.

Parsing Engine

The first component is PE (Parsing Engine). It has 4 sub-components in it :

Session Control

It checks for user authorisation before processing any SQL queries. So we can say that Session Control checks that the username and password you are using is accurate or not. A PE can support upto 120 sessions.

Parser

It checks for the SQL syntax and user rights to access various database objects referred in the SQL query submitted by user.

Optimizer

It create a plan or execution steps to follow in order to perform actions on database objects as per SQL query submitted by user. The optimiser is parallel aware it means it understands the Teradata system and data demography too. Optimiser always creates and pick the plan which is least expensive in terms of TIME only. The plan which can execute the query fastest is pick irrespective of the fact that it may consume more resources like ampcpu, iocount etc.

Dispatcher

It passes the execution steps to BYNET. Dispatcher also plays an important role by combining all the responses received and send it to user.

BYNET

Next component is BYNET. It is used for communication between Parsing Engine (PE) and AMP’s (Access Module Processor). There are two BYNET’s available in any TERADATA environment: BYNET-0 & BYNET-1. Two BYNET’s allows for continuous sending and receiving messages between PE’s and AMP’s. Traffic is automatically distributed between 2 BYNETs. BYNET consists of both software and hardware components in MPP system.

AMPs

Next to BYNET, we have AMP’s. AMP’s can be considered as the worker in TERADATA Architecture. Each AMP has its own dedicated VDISK (Virtual DISK) to which it queries and process the steps planned by Optimizer. AMP’s work only on their own VDISK and do not have access to other AMP’s VDISK. Once the AMP perform the steps, it send back response to PE via BYNET where all the responses from various AMP’s is collected and sent back to user.

VDISKs

VDISKs or virtual disks are logical partitions created in physical disks. Each AMP gets a share of disk to store and retrieve data. Each AMP is responsible for data stored in the associated VDISK only. Hence the data sharing does not happen between AMPs as such.

This is a very basic description of Teradata Architecture.

If you understand this much , it is sufficient for beginner level. However if you wish to understand more about hardware and software components in Teradata then following section covers it in detail. I will consider this as advanced level as many people working in Teradata for years may not have this information related with Hardware or Software components in Teradata.

Teradata Architecture – Advance

In addition to the teradata architecture components listed above I would like to share few more hardware & software components. Also I would like to share few concepts too which will help you in understanding the necessity of software or hardware that exists in Teradata.

Nodes

Teradata is installed on one or more nodes and each Node has multiple components in it which are hardware and softwares coming together to provide high speed compute environment. You can assume (just for easy understanding) that NODE is a computer in itself which has memory , storage and processor of its own.

Clique

When you group multiple nodes together “physically” using wires then it forms a clique. In a multi-node Teradata environment you can have one or more cliques depending on the physical cabling done to attach multiple nodes together. Cliques provide durability as if any node goes down , other nodes with-in the same clique quickly takes up the work of failed node and keep the system up and running.

Hot Standby Node

One node is a clique is not configured to process any tasks but it is used as back-up in case of failures. So whenever any node fails, Hot Standby Node replaces it and system literally faces any impact. When the failed node comes back, it typically becomes the new Hot Standby Node.

Hot or Cold Data

If the data is accessed frequently then it is HOT data and it is stored in a faster disk for quick access. Any data that is infrequently accessed then it is labelled as COLD and is stored in slower disk. For understanding purpose, you can assume you have 2 disks – one is HDD (Hard Disk Drive) and other is SSD (Solid State Drive). HDD is economical and can store more volume of data where as SSD is comparatively expensive and is of less volume however gives quicker access to stored data. So any data you need very often you will prefer to store it in SSD and the data which you don’t require very often you will push it to HDD.

Teradata Virtual Storage

Similarly , in Teradata, a feature was added since v13 that tags the data as cold or hot. It is called Teradata Virtual Storage or TVS. It takes care of identifying hot and cold data and shuffling data from one category to another.

Other than Memory & Disks, there are few ethernet cards & cables that constitute the hardware components of Teradata system.

Operating System

Teradata supports Windows & Linux versions of operating system. UNIX is no more supported.

Parallel Data Extensions

Parallel Data Extensions or PDE is the software layer added on top of Operating system to provide parallel environment. For understanding you can assume it like HADOOP is added on top of Linux Operating system to provide distributed environment. It takes care of resources and how to use it in optimised manner for parallelism.
Whenever you login into Teradata , to check the status of Teradata we run below command:

pdestate -a

This commands basically tell us whether PDE is in start or stop state. This is the first check we do if Teradata is not responding.

Trusted Parallel Application

Trusted Parallel Application or TPA is another software component that is added on top of PDE to take care of virtual processors in Teradata that includes AMPs & Parsing Engine.
Whenever we have to start Teradata services , we run below command:

/etc/init.d/tpa start

If you have been running these commands by any chance in Teradata now you know what are these software components and why do we need it.

There are few more components like Channel Driver and Teradata Gateway which are primarily used for communication between Parsing Engine and client.

Teradata Architecture – Exception

As I said at the start of this post, that Teradata is MPP however there is one exception to it. If Teradata is installed on a single node then it behaves as SMP (Symmetric MultiProcessing). SMP is when you have single memory pool and multiple cores share that memory to process the tasks. In MPP , you have multiple nodes where each node has its memory pool and cores use respective memory for computation.

Also in case of SMP Teradata , BYNET hardware is missing and the software components emulate the BYNET.

It is completely fine if you don’t remember all of the software and hardware components I have listed above. Just remember that other than the basic architecture Teradata is a combination of complex software running on highly optimized hardware to give you an incredible data warehousing platform. As I said, this is advanced level so even if you remember that few more hardware & software components exists in Teradata it should be fine for now.