Connect to Teradata from Spark

I was working on a project recently which involved data migration from Teradata to Hadoop. Most of the data migration was done using sqoop. However there was a requirement to do a quick lookup into Teradata table and do some processing in Spark. For this, I wanted to use Spark as it involves comparing data in Teradata table with HIVE table. So I connected Teradata via JDBC and created a dataframe from Teradata table. Next I created a dataframe from Hive table and did comparison. Let’s see an example below for connecting Teradata to Spark directly via JDBC connection.

Step 1: We will need 3 jar files for creating connection to Teradata from Spark.First, terajdbc4.jar. Second, tdgssconfig.jar. Third, teradata-connector-1.4.4.jar. You can download these JAR files from teradata official website. Once downloaded keep the files in some local directory like /tmp/…

Step 2: Initiate Spark session by invoking spark-shell and pass all 3 Jar files path
spark-shell --jars /tmp/raj/terajdbc4.jar,/tmp/raj/tdgssconfig.jar,/tmp/raj/teradata-connector-1.4.4.jar

Step 3: Create a dataframe which will hold Teradata Output. We will use load method and will pass require properties like url, username, password, driver etc to establish connection.
val df_prez_ALL = sqlContext.load(“jdbc”, Map(“url” -> “jdbc:teradata://, TMODE=TERA, user=admin, password=admin”,”dbtable” -> “db_lab.td_usa_prez”,”driver” -> “com.teradata.jdbc.TeraDriver”));

That’s it. Now you have fetch required data from Teradata to Spark Dataframe. We can check dataframe output now:

| pres_name| pres_dob| pres_bp| pres_bs| pres_in| pres_out|
| Harry S. Truman|1884-05-08| Lamar| Missouri|1945-04-12|1953-01-20|
| Grover Cleveland|1837-03-18| Caldwell| New Jersey|1885-03-04|1889-03-04|
| Abraham Lincoln|1809-02-12| Sinking spring| Kentucky|1861-03-04|1865-04-15|
| Grover Cleveland|1837-03-18| Caldwell| New Jersey|1893-03-04|1897-03-04|
| Calvin Coolidge|1872-07-04| Plymouth| Vermont|1923-08-02|1929-03-04|
| Andrew Jackson|1767-03-15| Waxhaws Region|South/North Carolina|1829-03-04|1837-03-04|
| George W. Bush|1946-07-06| New Haven| Connecticut|2001-01-20|2009-01-20|
|Franklin D. Roose…|1882-01-30| Hyde Park| New York|1933-03-04|1945-04-12|
| Martin Van Buren|1782-12-05| Kinderhook| New York|1837-03-04|1841-03-04|
| Zachary Taylor|1784-11-24| Barboursville| Virginia|1849-03-04|1850-07-09|
| James A. Garfield|1831-11-19| Moreland Hills| Ohio|1881-03-04|1881-09-19|
| Jimmy Carter|1924-10-01| Plains| Georgia|1977-01-20|1981-01-20|
| Theodore Roosevelt|1858-10-27| Manhattan| New York|1901-09-14|1909-03-04|
| James Buchanan|1791-04-23| Cove Gap| Pennsylvania|1857-03-04|1861-03-04|
| John Tyler|1790-03-29|Charles City County| Virginia|1841-04-04|1845-03-04|
| William McKinley|1843-01-29| Niles| Ohio|1897-03-04|1901-09-14|
| Richard M. Nixon|1913-01-09| Yorba Linda| California|1969-01-20|1974-08-09|
| Thomas Jefferson|1743-04-13| Shadwell| Virginia|1801-03-04|1809-03-04|
| Lyndon B. Johnson|1908-08-27| Stonewall| Texas|1963-11-22|1969-01-20|
| George Washington|1732-02-22|Westmoreland County| Virginia|1789-04-30|1797-03-04|
| Herbert Hoover|1874-08-10| West Branch| Iowa|1929-03-04|1933-03-04|
| James Madison|1751-03-16| Port Conway| Virginia|1809-03-04|1817-03-04|
|William Henry Har…|1773-02-09|Charles City County| Virginia|1841-03-04|1841-04-04|
| Warren G. Harding|1865-11-02| Blooming Grove| Ohio|1921-03-04|1923-08-02|
| George H. W. Bush|1924-06-12| Milton| Massachusetts|1989-01-20|1993-01-20|
| Chester A. Arthur|1829-10-05| Fairfield| Vermont|1881-09-19|1885-03-04|
| Benjamin Harrison|1833-08-20| North Bend| Ohio|1889-03-04|1893-03-04|
| John Adams|1735-10-30| Braintree| Massachusetts|1797-03-04|1801-03-04|
| Woodrow Wilson|1856-12-28| Staunton| Virginia|1913-03-04|1921-03-04|
| Donald Trump|1946-06-14| Queens| New York|2017-01-20| null|
| Bill Clinton|1946-08-19| Hope| Arkansas|1993-01-20|2001-01-20|
| James Monroe|1758-04-28| Monroe Hall| Virginia|1817-03-04|1825-03-04|
| William Howard Taft|1857-09-15| Cincinnati| Ohio|1909-03-04|1913-03-04|
|Dwight D. Eisenhower|1890-10-14| Denison| Texas|1953-01-20|1961-01-20|
| John F. Kennedy|1917-05-29| Brookline| Massachusetts|1961-01-20|1963-11-22|
| James K. Polk|1795-11-02| Pineville| North Carolina|1845-03-04|1849-03-04|
| Rutherford B. Hayes|1822-10-04| Delaware| Ohio|1877-03-04|1881-03-04|
| Ronald Reagan|1911-02-06| Tampico| Illinois|1981-01-20|1989-01-20|
| Gerald R. Ford|1913-07-14| Omaha| Nebraska|1974-08-09|1977-01-20|
| John Quincy Adams|1767-07-11| Braintree| Massachusetts|1825-03-04|1829-03-04|
| Franklin Pierce|1804-11-23| Hillsborough| New Hampshire|1853-03-04|1857-03-04|
| Ulysses S. Grant|1822-04-27| Point Pleasant| Ohio|1869-03-04|1877-03-04|
| Millard Fillmore|1800-01-07| Summerhill| New York|1850-07-09|1853-03-04|
| Andrew Johnson|1808-12-29| Raleigh| North Carolina|1865-04-15|1869-03-04|
| Barack Obama|1961-08-04| Honolulu| Hawaii|2009-01-20|2017-01-20|

If you want selective output or want to filter records rather than fetching complete table then pass custom SQL in parenthesis and give it a alias name in dbtable option. See example below:

val df_prez_NY = sqlContext.load(“jdbc”, Map(“url” -> “jdbc:teradata://, TMODE=TERA, user=admin, password=admin”,”dbtable” -> “(select * from db_lab.td_usa_prez where pres_bs=’New York’) pres_1“,”driver” -> “com.teradata.jdbc.TeraDriver”));

| pres_name| pres_dob| pres_bp| pres_bs| pres_in| pres_out|
| Martin Van Buren|1782-12-05|Kinderhook|New York|1837-03-04|1841-03-04|
|Franklin D. Roose…|1882-01-30| Hyde Park|New York|1933-03-04|1945-04-12|
| Theodore Roosevelt|1858-10-27| Manhattan|New York|1901-09-14|1909-03-04|
| Donald Trump|1946-06-14| Queens|New York|2017-01-20| null|
| Millard Fillmore|1800-01-07|Summerhill|New York|1850-07-09|1853-03-04|

I have tried this on Spark 1.6.0 on HDP 2.4. Hope this helps. Let me know if any query/comment.

You may also like...

Leave a Reply

Your email address will not be published. Required fields are marked *