We are exploring various aspects of Apache Spark integration with DB2 and DB2 Connect drivers. We are documenting our experimentation for the benefit of others.
Currently Spark shell is available in Scala and Python. This article covers accessing DB2 data via Scala shell using DB2 supplied JDBC driver (IBM Data Server Driver for JDBC and SQLJ). Below are the step-by-step instructions, which will work for both DB2 z/OS and distributed.
1) Confirm that you have Java installed by running java -version from the Windows command line. Java Development Kit V1.7 or 1.8 is recommended.
2) Install Spark on local machine by downloading Spark here
3) We chose pre-built binaries (see Figure 1) instead of source code download to avoid building Spark in early experimentation phase.
4) Unzip the installation file to a local directory (e.g., C:/spark).
5) Start the Windows command prompt.
6) Navigate to the directory that has bin folder of Spark installation from step 4 (c:/spark/bin).
7) Download the DB2 JDBC driver jar (db2jcc.jar or db2jcc4.jar)
into C:\ or the location you desire.
8) Set spark_classpath to the location of the DB2 driver by running SET SPARK_CLASSPATH=c:\db2jcc.jar
9) Run the spark-shell.cmd script found in the bin folder to start Spark shell using Scala.
10) Upon a successful installation, you will see an output (see Figure 2) followed by a Scala prompt (see Figure 3).
11) In Figure 3, two important objects are already created for you:
- SparkContext: Any Spark application needs a SparkContext, which tells Spark how to access a cluster. In the shell mode, a SparkContext is already created for you in a variable called sc.
- SqlContext: This is needed to construct DataFrames (equivalent to relational tables) from database data and serves as the entry point for working with structured data.
12) Once Spark is running, you can issue queries to DB2 on all platforms through the DB2 JDBC driver. Tables from DB2 database can be loaded as a DataFrame using the following options on load:
- URL: The JDBC URL to connect (either z/OS or distributed database)
- dbtable: The JDBC table that should be read. Note that anything that is valid in a `FROM` clause of a SQL query can be used.
- driver: The class name of the JDBC driver needed to connect to this URL
13) From Scala command line, issue:
val employeeDF = sqlContext.load("jdbc", Map("url" -> "jdbc:db2://localhost:50000/sample:currentSchema=pallavipr;user=pallavipr;password=XXXXXX;","driver" -> "com.ibm.db2.jcc.DB2Driver","dbtable" -> "pallavipr.employee"))
14) Output containing the table metadata is shown in Figure 4.
15) To see the contents of the EMPLOYEE table, issue employeeDF.show() from Scala com-mand line, which shows the contents of the DataFrame as captured in Screenshot 5. Show() returns first 20 records from the table by default (out of ~40 rows that exist).
16) You can further narrow the search results above by using filter criteria. I.e., if you want to see only columns employee id, firstname, lastname and job title out of all existing columns, you will issue for results (see Figure 6): employ-eeDF.select("empno","firstnme","lastname",”job”).show()
17) Now if you want to filter out only those rows that have job title DESIGNER (see Figure 7), issue the following from Scala shell:
This article describes how data from DB2 z/OS data and DB2 distributed can be loaded into Spark easily and accessed via Scala shell. Load of DB2 data into Spark opens up many use cases, such as Spark being used to aggregate data across DB2 and other data stores. Follow-ups to this article will illustrate how DB2 data can be loaded in Spark using standalaone Scala or Java programs and can be aggregated with different sources of data such as JSON.
Pallavi Priyadarshini is the architect and product manager of DB2 Connect, IBM. Her current focus is set-ting strategic roadmap for DB2 Connect in cloud and mobile. Pallavi has more than 14 years experience in developing high performance and highly available data-centric solutions for enterprise customers. She has developed core functionalities in DB2 server engine in IBM Silicon Valley Labs. She has also developed Web based and security applications as part of two Silicon Valley startups. Pallavi has authored many pa-tents and blogs on current technology topics and is a regular speaker in global conferences. Pallavi has completed her master’s in Computer Science from San Jose State University, California, and bachelor’s in Computer Science from Nanyang Technological University, Singapore.
Parameswara R. Tatini has more than 10 years of experience in software product development. As a tech-nical lead at IBM, Param has developed core functionalities in the DB2 JDBC Drivers, DB2 Server Admin-istration tools and zOS Connect. He has experience in developing solutions for Geographical Information Systems and Spatial Analysis. Param has authored patents in databases systems. He is passionate about cloud, Bluemix, Spark and emerging technologies. He is the organizer for the 1,000-member IBM Bluemix Meetups Bangalore chapter. He has a master’s in Software Systems and bachelor’s in Information Systems from Birla Institute of Technology and Science in Pilani, India.