Originally posted by: Steve Haertel
We previously introduced Data Connectors in IBM Spectrum Conductor as a way to conveniently connect your Spark applications with your data sources. While we have a list of Data Connectors that come out of the box, you may want to know how to make your own! This blog describes how you can do that -- using MapR XD (formerly MapR-FS) as an example.
If you want to connect Spark to the MapR XD (version 6.1.0 in this example), consider making a Data Connector once per Spark version (2.2.x vs 2.3.x) and let IBM Spectrum Conductor automatically handle the movement of .jar and .xml files.
Here's how you get started on each management host in your cluster:
- Log in to your management host as the CLUSTERADMIN and go to the $EGO_CONFDIR/../../conductorspark/conf/dataconnectors/directory. (We want to be the CLUSTERADMIN user so that the files that we create match the ownership/permissions of the existing data connectors.)
- In the /types directory, create a new type, and give it metadata similar to existing types. The name is important for subsequent steps.
File:
MapR_fs.yml
File content:
type: MapR_fs
displayname: MapR file system
maxactive: -1
- Create a new directory in the dataconnectors directory with the name you used in step 2, followed by a dash (‘-‘), then by a version number, for example: MapR_fs-6.1.0. This directory will live beside other similarly formatted directories.
- Inside the new Data Connector directory:
- Create a directory called /lib/ and copy the JAR files required for your Spark version to this directory.
Spark 2.3.x - Requires JAR files from MapR 6.1.0
Note: Some of these jar files are provided by only the MapR client, but some others are provided by other MapR packages such as mapr-spark, mapr-hbase, and mapr-kafka.
/opt/mapr/hadoop/hadoop-2.7.0/share/hadoop/hdfs/hadoop-hdfs-2.7.0-mapr-1808.jar
/opt/mapr/hadoop/hadoop-2.7.0/share/hadoop/common/lib/hadoop-auth-2.7.0-mapr-1808.jar
/opt/mapr/hadoop/hadoop-2.7.0/share/hadoop/common/hadoop-common-2.7.0-mapr-1808.jar
/opt/mapr/lib/json-1.8.jar
/opt/mapr/lib/maprdb-6.1.0-mapr.jar
/opt/mapr/lib/maprfs-6.1.0-mapr.jar
/opt/mapr/lib/mapr-hbase-6.1.0-mapr.jar
/opt/mapr/lib/mapr-ojai-driver-6.1.0-mapr.jar
/opt/mapr/lib/ojai-3.0-mapr-1808.jar
/opt/mapr/lib/ojai-mapreduce-3.0-mapr-1808.jar
/opt/mapr/lib/ojai-scala-3.0-mapr-1808.jar
/opt/mapr/lib/zookeeper-3.4.11-mapr-1808.jar
/opt/mapr/spark/spark-2.3.2/jars/hive-exec-1.2.0-mapr-spark-MEP-6.0.0-1901.jar
/opt/mapr/spark/spark-2.3.2/jars/hive-metastore-1.2.0-mapr-spark-MEP-6.0.0-1901.jar
/opt/mapr/spark/spark-2.3.2/jars/maprdb-spark-2.3.2.0-mapr-1901.jar
/opt/mapr/hbase/hbase-1.1.8/lib/hbase-annotations-1.1.8-mapr-1901.jar
/opt/mapr/hbase/hbase-1.1.8/lib/hbase-client-1.1.8-mapr-1901.jar
/opt/mapr/hbase/hbase-1.1.8/lib/hbase-common-1.1.8-mapr-1901.jar
/opt/mapr/hbase/hbase-1.1.8/lib/hbase-hadoop2-compat-1.1.8-mapr-1901.jar
/opt/mapr/hbase/hbase-1.1.8/lib/hbase-hadoop-compat-1.1.8-mapr-1901.jar
/opt/mapr/hbase/hbase-1.1.8/lib/hbase-it-1.1.8-mapr-1901.jar
/opt/mapr/hbase/hbase-1.1.8/lib/hbase-prefix-tree-1.1.8-mapr-1901.jar
/opt/mapr/hbase/hbase-1.1.8/lib/hbase-procedure-1.1.8-mapr-1901.jar
/opt/mapr/hbase/hbase-1.1.8/lib/hbase-protocol-1.1.8-mapr-1901.jar
/opt/mapr/hbase/hbase-1.1.8/lib/hbase-resource-bundle-1.1.8-mapr-1901.jar
/opt/mapr/hbase/hbase-1.1.8/lib/hbase-rest-1.1.8-mapr-1901.jar
/opt/mapr/hbase/hbase-1.1.8/lib/hbase-server-1.1.8-mapr-1901.jar
/opt/mapr/hbase/hbase-1.1.8/lib/hbase-spark-1.1.8-mapr-1901.jar
/opt/mapr/hbase/hbase-1.1.8/lib/hbase-thrift-1.1.8-mapr-1901.jar
Additional JAR files required if using Kafka functionality with MapR:
/opt/mapr/kafka/kafka-1.1.1/libs/connect-api-1.1.1-mapr-1901.jar
/opt/mapr/kafka/kafka-1.1.1/libs/connect-file-1.1.1-mapr-1901.jar
/opt/mapr/kafka/kafka-1.1.1/libs/connect-json-1.1.1-mapr-1901.jar
/opt/mapr/kafka/kafka-1.1.1/libs/connect-runtime-1.1.1-mapr-1901.jar
/opt/mapr/kafka/kafka-1.1.1/libs/connect-transforms-1.1.1-mapr-1901.jar
/opt/mapr/kafka/kafka-1.1.1/libs/kafka-clients-1.1.1-mapr-1901.jar
/opt/mapr/kafka/kafka-1.1.1/libs/kafka-log4j-appender-1.1.1-mapr-1901.jar
/opt/mapr/kafka/kafka-1.1.1/libs/kafka-streams-1.1.1-mapr-1901.jar
/opt/mapr/kafka/kafka-1.1.1/libs/kafka-tools-1.1.1-mapr-1901.jar
/opt/mapr/kafka/kafka-1.1.1/libs/mapr-streams-6.1.0-mapr.jar
/opt/mapr/spark/spark-2.3.2/jars/spark-streaming-kafka-0-9_2.11-2.3.2.0-mapr-1901.jar
/opt/mapr/spark/spark-2.3.2/jars/spark-streaming-kafka-0-10_2.11-2.3.2.0-mapr-1901.jar
/opt/mapr/spark/spark-2.3.2/jars/spark-streaming_2.11-2.3.2.0-mapr-1901.jar
/opt/mapr/spark/spark-2.3.2/jars/spark-sql-kafka-0-10_2.11-2.3.2.0-mapr-1901.jar
Note: Depending on the requirements of your applications, you may also want to download and use the following JAR file (for example, to produce kafka streams): https://repository.mapr.com/nexus/content/groups/mapr-public/org/apache/spark/spark-streaming-kafka-producer_2.11/2.3.2.0-mapr-1901/spark-streaming-kafka-producer_2.11-2.3.2.0-mapr-1901.jar
-
- Create a directory called sbin. Inside that directory, create a Python script called healthcheck.py. This script will be used to determine whether the file system is up or down; for example, from a button that you can click in the Data Connectors tab of the details of a Spark Instance Group.
File:
healthcheck.py
File content:
#!/usr/bin/python
import sys, os, stat
from pyspark import SparkContext
#Create log file
logfilename = 'statusCheck.log'
if len(sys.argv) > 1 :
logfilename = sys.argv[1]
logfile = open(logfilename, 'w')
# Set file permissions
current_perm = stat.S_IMODE(os.lstat(logfilename).st_mode)
os.chmod(logfilename, current_perm | stat.S_IROTH)
sys.stdout = sys.stderr = logfile
try:
sc = SparkContext()
fs = sc._gateway.jvm.org.apache.hadoop.fs.FileSystem.get(sc._gateway.jvm.org.apache.hadoop.conf.Configuration())
except Exception as e:
print("Exception")
exit(1)
if fs.exists(sc._gateway.jvm.org.apache.hadoop.fs.Path('/')) :
print("Online")
sc.stop()
exit(0)
else :
print("Offline")
sc.stop()
exit(1)
-
- Create a file called metadata.yml in your data connector directory as a sibling to the /lib and /sbin directories. We’re going to create a special file that is going to be able to dynamically add some properties and give the user some options. Put the following content in metadata.yml, making sure that your type and version fields match what you have used in previous steps:
Note: By default, setting the properties’ ‘required’ value to ‘true’ lets the properties be defined/edited by the user during Spark Instance Group registration. To prevent the values from being modified by users, use the value ‘hidden’ instead.
Also, be careful when copying/pasting this example content. YAML files require proper indentation with spaces.
These values are based on the default MapR XD installation paths. Feel free to use your own default values and property descriptions.
File:
metadata.yml
File content:
type: MapR_fs
version: 6.1.0
supportedsparkversions:
- 2.3.0
- 2.3.1
timestamp: 0
configurations:
- type: hadoop-coresite
filename: mapr-site.xml
properties:
- name: HiveMetastoreUris
propertykeys:
- hive.metastore.uris
displayname: Hive Metastore URIs
required: true
inputtype: string
allowedpattern: ^(http(s)?:\/\/.)?(www\.)?[-a-zA-Z0-9@:%._\+~#=]{2,256}(\.[a-z]{2,6})?\b([-a-zA-Z0-9@:%_\+.~#?&//=]*)$
defaultvalue: thrift://desiredMapRfsServerDefault:9083
- name: JDBCConnectionURL
propertykeys:
- javax.jdo.option.ConnectionURL
displayname: JDBC Connection
description: JDBC connect string for a JDBC metastore
required: true
inputtype: string
defaultvalue: jdbc:derby:;databaseName=/opt/mapr/hive/hive-2.1/metastore_db;create=true
- name: SASLEnabled
propertykeys:
- hive.metastore.sasl.enabled
displayname: Hive SASL Enabled
required: true
inputtype: string
defaultvalue: true
- name: SASLqop
propertykeys:
- hive.server2.thrift.sasl.qop
displayname: Hive SASL qop
required: true
inputtype: string
defaultvalue: auth-conf
- name: ExecSetugi
propertykeys:
- hive.metastore.execute.setugi
displayname: Hive Execute setugi
required: true
inputtype: string
defaultvalue: false
- name: WebUiUsePam
propertykeys:
- hive.server2.webui.use.pam
displayname: Hive Webui Use Pam
required: true
inputtype: string
defaultvalue: true
- name: WebUiUseSSL
propertykeys:
- hive.server2.webui.use.ssl
displayname: Hive Webui Use SSL
required: true
inputtype: string
defaultvalue: true
- name: WebUiKeystorePath
propertykeys:
- hive.server2.webui.keystore.path
displayname: Hive Webui Keystore Path
required: true
inputtype: string
defaultvalue: /opt/mapr/conf/ssl_keystore
- name: WebUiKeystorePwd
propertykeys:
- hive.server2.webui.keystore.password
displayname: Hive Webui Keystore Password
required: true
inputtype: string
defaultvalue: mapr123
- name: Authentication
propertykeys:
- hive.server2.authentication
displayname: Hive Authentication
required: true
inputtype: string
defaultvalue: MAPRSASL
- name: HbaseRootDir
propertykeys:
- hbase.rootdir
displayname: Hbase Root Dir
required: true
inputtype: string
defaultvalue: maprfs:///hbase
- name: HbaseClusterDist
propertykeys:
- hbase.cluster.distributed
displayname: Hbase Cluster Distributed
required: true
inputtype: string
defaultvalue: true
- name: HbaseZKQuorum
propertykeys:
- hbase.zookeeper.quorum
displayname: Hbase Zookeeper Quorum
required: true
inputtype: string
defaultvalue: desiredZookeeperQuorumHost
- name: HbaseZKClientPort
propertykeys:
- hbase.zookeeper.property.clientPort
displayname: Hbase Zookeeper Client Port
required: true
inputtype: string
defaultvalue: 5181
- name: DfsSupportAppend
propertykeys:
- dfs.support.append
displayname: DFS Support Append
required: true
inputtype: string
defaultvalue: true
- name: HbaseFSUtilMapRFSImpl
propertykeys:
- hbase.fsutil.maprfs.impl
displayname: Hbase FS Util MapR FS Impl
required: true
inputtype: string
defaultvalue: org.apache.hadoop.hbase.util.FSMapRUtils
- name: HbaseRegionServerHandlerCount
propertykeys:
- hbase.regionserver.handler.count
displayname: Hbase Region Server Handler Count
required: true
inputtype: string
defaultvalue: 30
- name: MapRFSThreads
propertykeys:
- fs.mapr.threads
displayname: MapR FS threads
description: Allows file/db client to use this many threads
required: true
inputtype: string
defaultvalue: 64
- name: HbaseMapRDefaultDB
propertykeys:
- mapr.hbase.default.db
displayname: MapR Hbase default db
required: true
inputtype: string
defaultvalue: maprdb
- name: HbaseSecCliProtocolACL
propertykeys:
- security.client.protocol.acl
displayname: Client/Admin Protocol impl ACL
required: hidden
inputtype: string
defaultvalue: "*"
- name: HbaseSecAdminProtocolACL
propertykeys:
- security.admin.protocol.acl
displayname: HMasterInterface protocol impl ACL
required: hidden
inputtype: string
defaultvalue: "*"
- name: HbaseSecMstrRgnProtocolACL
propertykeys:
- security.masterregion.protocol.acl
displayname: HMasterRegionInterface protocol impl ACL
required: hidden
inputtype: string
defaultvalue: "*"
- Restart your ascd service, and check your $EGO_TOP/ascd/logs/ascd.hostname log file (default location) to make sure that the ascd service parsed the data connector files successfully at startup.
- Register a Spark Instance Group:
-
- Add your new Data Connector as an entry in the “Data Connectors” tab on the Spark instance group registration wizard.
Note: With the current example metadata.yml, you will not be able to use the “default FS” feature with this data connector. However, MapR XD does support referencing files with hdfs:///[file], or maprfs:///[file]. Refer to MapR documentation for details.
-
- If using IBM Spectrum Conductor 2.3.0, you can optionally add the Additional Parameters described below in the Spark Configuration wizard for Hive support:
Note: If you are using an earlier version of IBM Spectrum Conductor, or if you do not wish to always use these configuration parameters for all Spark application submissions, you can set them at Spark application submission time using the --conf flag
Parameter: spark.sql.hive.metastore.sharedPrefixes
Parameter Value: com.mysql.jdbc,org.postgresql,com.microsoft.sqlserver,oracle.jdbc,com.mapr.fs.shim.LibraryLoader,com.mapr.security.JNISecurity,com.mapr.fs.jni,com.mapr.fs.shim
Parameter: spark.sql.catalogImplementation
Parameter Value: hive
And that's it! If you want to know more about data connectors, check out our online IBM Knowledge Center.
We'd love to hear from you. If you've got comments or questions, reach out to us on our Slack channel!
#SpectrumComputingGroup