I was working on a project recently which involved data migration from Teradata to Hadoop. Most of the data migration was done using sqoop. However there was a requirement to do a quick lookup into Teradata table and do some processing in Spark. For this, I wanted to use Spark as it involves comparing data in Teradata table with HIVE table. So I connected Teradata via JDBC and created a dataframe from Teradata table. Next I created a dataframe from Hive table and did comparison. Let's see an example below for connecting Teradata to Spark directly via JDBC connection.
Step 1: Download the Jar files
We will need 3 jar files for creating connection to Teradata from Spark.
- terajdbc4.jar
- tdgssconfig.jar
- teradata-connector-1.4.4.jar
You can download these JAR files from teradata official website. Once downloaded keep the files in some local directory like /tmp/…
Step 2: Initiate spark-shell and pass all 3 Jar files
spark-shell --jars /tmp/raj/terajdbc4.jar,/tmp/raj/tdgssconfig.jar,/tmp/raj/teradata-connector-1.4.4.jar
Step 3: Spark JDBC to load Dataframe
Create a dataframe which will hold Teradata Output. We will use load method and will pass require properties like url, username, password, driver etc to establish connection.
val df_prez_ALL = sqlContext.load("jdbc", Map("url" -> "jdbc:teradata://192.168.200.200/DBS_PORT=1025, TMODE=TERA, user=admin, password=admin","dbtable" -> "db_lab.td_usa_prez","driver" -> "com.teradata.jdbc.TeraDriver"));
That's it. Now you have fetch required data from Teradata to Spark Dataframe. We can check dataframe output now:
scala> df_prez_ALL.show(5) +--------------------+----------+-------------------+--------------------+----------+----------+ | pres_name| pres_dob| pres_bp| pres_bs| pres_in| pres_out| +--------------------+----------+-------------------+--------------------+----------+----------+ | Harry S. Truman|1884-05-08| Lamar| Missouri|1945-04-12|1953-01-20| | Grover Cleveland|1837-03-18| Caldwell| New Jersey|1885-03-04|1889-03-04| | Abraham Lincoln|1809-02-12| Sinking spring| Kentucky|1861-03-04|1865-04-15| | Grover Cleveland|1837-03-18| Caldwell| New Jersey|1893-03-04|1897-03-04| | Calvin Coolidge|1872-07-04| Plymouth| Vermont|1923-08-02|1929-03-04| +--------------------+----------+-------------------+--------------------+----------+----------+
If you want selective output or want to filter records rather than fetching complete table then pass custom SQL in parenthesis and give it a alias name in dbtable option. See example below:
val df_prez_NY = sqlContext.load("jdbc", Map("url" -> "jdbc:teradata://192.168.200.200/DBS_PORT=1025, TMODE=TERA, user=admin, password=admin","dbtable" -> "(select * from db_lab.td_usa_prez where pres_bs='New York') pres_1","driver" -> "com.teradata.jdbc.TeraDriver"));
scala> df_prez_NY.show(10) +--------------------+----------+----------+--------+----------+----------+ | pres_name| pres_dob| pres_bp| pres_bs| pres_in| pres_out| +--------------------+----------+----------+--------+----------+----------+ | Martin Van Buren|1782-12-05|Kinderhook|New York|1837-03-04|1841-03-04| |Franklin D. Roose…|1882-01-30| Hyde Park|New York|1933-03-04|1945-04-12| | Theodore Roosevelt|1858-10-27| Manhattan|New York|1901-09-14|1909-03-04| | Donald Trump|1946-06-14| Queens|New York|2017-01-20| null| | Millard Fillmore|1800-01-07|Summerhill|New York|1850-07-09|1853-03-04| +--------------------+----------+----------+--------+----------+----------+
I have tried this on Spark 1.6.0 on HDP 2.4. Hope this helps. Let me know if any query/comment.