Connect to Teradata from Spark

I was working on a project recently which involved data migration from Teradata to Hadoop. Most of the data migration was done using sqoop. However there was a requirement to do a quick lookup into Teradata table and do some processing in Spark. For this, I wanted to use Spark as it involves comparing data in Teradata table with HIVE table. So I connected Teradata via JDBC and created a dataframe from Teradata table. Next I created a dataframe from Hive table and did comparison. Let's see an example below for connecting Teradata to Spark directly via JDBC connection.

Step 1: Download the Jar files

We will need 3 jar files for creating connection to Teradata from Spark.

  1. terajdbc4.jar
  2. tdgssconfig.jar
  3. teradata-connector-1.4.4.jar

You can download these JAR files from teradata official website. Once downloaded keep the files in some local directory like /tmp/…

Step 2: Initiate spark-shell and pass all 3 Jar files

spark-shell --jars /tmp/raj/terajdbc4.jar,/tmp/raj/tdgssconfig.jar,/tmp/raj/teradata-connector-1.4.4.jar

Step 3: Spark JDBC to load Dataframe

Create a dataframe which will hold Teradata Output. We will use load method and will pass require properties like url, username, password, driver etc to establish connection.

val df_prez_ALL = sqlContext.load("jdbc", Map("url" -> "jdbc:teradata://192.168.200.200/DBS_PORT=1025, TMODE=TERA, user=admin, password=admin","dbtable" -> "db_lab.td_usa_prez","driver" -> "com.teradata.jdbc.TeraDriver"));

That's it. Now you have fetch required data from Teradata to Spark Dataframe. We can check dataframe output now:

scala> df_prez_ALL.show(5)
 +--------------------+----------+-------------------+--------------------+----------+----------+
 |           pres_name|  pres_dob|            pres_bp|             pres_bs|   pres_in|  pres_out|
 +--------------------+----------+-------------------+--------------------+----------+----------+
 |     Harry S. Truman|1884-05-08|              Lamar|            Missouri|1945-04-12|1953-01-20|
 |    Grover Cleveland|1837-03-18|           Caldwell|          New Jersey|1885-03-04|1889-03-04|
 |     Abraham Lincoln|1809-02-12|     Sinking spring|            Kentucky|1861-03-04|1865-04-15|
 |    Grover Cleveland|1837-03-18|           Caldwell|          New Jersey|1893-03-04|1897-03-04|
 |     Calvin Coolidge|1872-07-04|           Plymouth|             Vermont|1923-08-02|1929-03-04|
 +--------------------+----------+-------------------+--------------------+----------+----------+

If you want selective output or want to filter records rather than fetching complete table then pass custom SQL in parenthesis and give it a alias name in dbtable option. See example below:

val df_prez_NY = sqlContext.load("jdbc", Map("url" -> "jdbc:teradata://192.168.200.200/DBS_PORT=1025, TMODE=TERA, user=admin, password=admin","dbtable" -> "(select * from db_lab.td_usa_prez where pres_bs='New York') pres_1","driver" -> "com.teradata.jdbc.TeraDriver"));
scala> df_prez_NY.show(10)
 +--------------------+----------+----------+--------+----------+----------+
 |           pres_name|  pres_dob|   pres_bp| pres_bs|   pres_in|  pres_out|
 +--------------------+----------+----------+--------+----------+----------+
 |    Martin Van Buren|1782-12-05|Kinderhook|New York|1837-03-04|1841-03-04|
 |Franklin D. Roose…|1882-01-30| Hyde Park|New York|1933-03-04|1945-04-12|
 |  Theodore Roosevelt|1858-10-27| Manhattan|New York|1901-09-14|1909-03-04|
 |        Donald Trump|1946-06-14|    Queens|New York|2017-01-20|      null|
 |    Millard Fillmore|1800-01-07|Summerhill|New York|1850-07-09|1853-03-04|
 +--------------------+----------+----------+--------+----------+----------+

I have tried this on Spark 1.6.0 on HDP 2.4. Hope this helps. Let me know if any query/comment.

Leave a Reply

Your email address will not be published. Required fields are marked *