High Performance Computing

High Performance Computing Group

Connect with HPC subject matter experts and discuss how hybrid cloud HPC Solutions from IBM meet today's business needs.

 View Only

How to connect IBM Spectrum Conductor™ Spark packages to MapR file system

By Archive User posted Fri May 10, 2019 02:28 PM

  

Originally posted by: Steve Haertel


We previously introduced Data Connectors in IBM Spectrum Conductor as a way to conveniently connect your Spark applications with your data sources. While we have a list of Data Connectors that come out of the box, you may want to know how to make your own! This blog describes how you can do that -- using MapR XD (formerly MapR-FS) as an example.

If you want to connect Spark to the MapR XD (version 6.1.0 in this example), consider making a Data Connector once per Spark version (2.2.x vs 2.3.x) and let IBM Spectrum Conductor automatically handle the movement of .jar and .xml files.

Here's how you get started on each management host in your cluster:

  1. Log in to your management host as the CLUSTERADMIN and go to the $EGO_CONFDIR/../../conductorspark/conf/dataconnectors/directory. (We want to be the CLUSTERADMIN user so that the files that we create match the ownership/permissions of the existing data connectors.)
     
  2. In the /types directory, create a new type, and give it metadata similar to existing types. The name is important for subsequent steps.
File:
MapR_fs.yml
 
File content:
type: MapR_fs
displayname: MapR file system
maxactive: -1
  1. Create a new directory in the dataconnectors directory with the name you used in step 2, followed by a dash (‘-‘), then by a version number, for example: MapR_fs-6.1.0. This directory will live beside other similarly formatted directories.
     
  2. Inside the new Data Connector directory:
    1. Create a directory called /lib/ and copy the JAR files required for your Spark version to this directory.

 

Spark 2.3.x - Requires JAR files from MapR 6.1.0

Note: Some of these jar files are provided by only the MapR client, but some others are provided by other MapR packages such as mapr-spark, mapr-hbase, and mapr-kafka.

 

/opt/mapr/hadoop/hadoop-2.7.0/share/hadoop/hdfs/hadoop-hdfs-2.7.0-mapr-1808.jar

/opt/mapr/hadoop/hadoop-2.7.0/share/hadoop/common/lib/hadoop-auth-2.7.0-mapr-1808.jar

/opt/mapr/hadoop/hadoop-2.7.0/share/hadoop/common/hadoop-common-2.7.0-mapr-1808.jar

/opt/mapr/lib/json-1.8.jar

/opt/mapr/lib/maprdb-6.1.0-mapr.jar

/opt/mapr/lib/maprfs-6.1.0-mapr.jar

/opt/mapr/lib/mapr-hbase-6.1.0-mapr.jar

/opt/mapr/lib/mapr-ojai-driver-6.1.0-mapr.jar

/opt/mapr/lib/ojai-3.0-mapr-1808.jar

/opt/mapr/lib/ojai-mapreduce-3.0-mapr-1808.jar

/opt/mapr/lib/ojai-scala-3.0-mapr-1808.jar

/opt/mapr/lib/zookeeper-3.4.11-mapr-1808.jar

/opt/mapr/spark/spark-2.3.2/jars/hive-exec-1.2.0-mapr-spark-MEP-6.0.0-1901.jar

/opt/mapr/spark/spark-2.3.2/jars/hive-metastore-1.2.0-mapr-spark-MEP-6.0.0-1901.jar

/opt/mapr/spark/spark-2.3.2/jars/maprdb-spark-2.3.2.0-mapr-1901.jar

/opt/mapr/hbase/hbase-1.1.8/lib/hbase-annotations-1.1.8-mapr-1901.jar

/opt/mapr/hbase/hbase-1.1.8/lib/hbase-client-1.1.8-mapr-1901.jar

/opt/mapr/hbase/hbase-1.1.8/lib/hbase-common-1.1.8-mapr-1901.jar

/opt/mapr/hbase/hbase-1.1.8/lib/hbase-hadoop2-compat-1.1.8-mapr-1901.jar

/opt/mapr/hbase/hbase-1.1.8/lib/hbase-hadoop-compat-1.1.8-mapr-1901.jar

/opt/mapr/hbase/hbase-1.1.8/lib/hbase-it-1.1.8-mapr-1901.jar

/opt/mapr/hbase/hbase-1.1.8/lib/hbase-prefix-tree-1.1.8-mapr-1901.jar

/opt/mapr/hbase/hbase-1.1.8/lib/hbase-procedure-1.1.8-mapr-1901.jar

/opt/mapr/hbase/hbase-1.1.8/lib/hbase-protocol-1.1.8-mapr-1901.jar

/opt/mapr/hbase/hbase-1.1.8/lib/hbase-resource-bundle-1.1.8-mapr-1901.jar

/opt/mapr/hbase/hbase-1.1.8/lib/hbase-rest-1.1.8-mapr-1901.jar

/opt/mapr/hbase/hbase-1.1.8/lib/hbase-server-1.1.8-mapr-1901.jar

/opt/mapr/hbase/hbase-1.1.8/lib/hbase-spark-1.1.8-mapr-1901.jar

/opt/mapr/hbase/hbase-1.1.8/lib/hbase-thrift-1.1.8-mapr-1901.jar
 

Additional JAR files required if using Kafka functionality with MapR:
/opt/mapr/kafka/kafka-1.1.1/libs/connect-api-1.1.1-mapr-1901.jar
/opt/mapr/kafka/kafka-1.1.1/libs/connect-file-1.1.1-mapr-1901.jar
/opt/mapr/kafka/kafka-1.1.1/libs/connect-json-1.1.1-mapr-1901.jar
/opt/mapr/kafka/kafka-1.1.1/libs/connect-runtime-1.1.1-mapr-1901.jar
/opt/mapr/kafka/kafka-1.1.1/libs/connect-transforms-1.1.1-mapr-1901.jar
/opt/mapr/kafka/kafka-1.1.1/libs/kafka-clients-1.1.1-mapr-1901.jar
/opt/mapr/kafka/kafka-1.1.1/libs/kafka-log4j-appender-1.1.1-mapr-1901.jar
/opt/mapr/kafka/kafka-1.1.1/libs/kafka-streams-1.1.1-mapr-1901.jar
/opt/mapr/kafka/kafka-1.1.1/libs/kafka-tools-1.1.1-mapr-1901.jar
/opt/mapr/kafka/kafka-1.1.1/libs/mapr-streams-6.1.0-mapr.jar
/opt/mapr/spark/spark-2.3.2/jars/spark-streaming-kafka-0-9_2.11-2.3.2.0-mapr-1901.jar
/opt/mapr/spark/spark-2.3.2/jars/spark-streaming-kafka-0-10_2.11-2.3.2.0-mapr-1901.jar
/opt/mapr/spark/spark-2.3.2/jars/spark-streaming_2.11-2.3.2.0-mapr-1901.jar
/opt/mapr/spark/spark-2.3.2/jars/spark-sql-kafka-0-10_2.11-2.3.2.0-mapr-1901.jar

Note: Depending on the requirements of your applications, you may also want to download and use the following JAR file (for example, to produce kafka streams): https://repository.mapr.com/nexus/content/groups/mapr-public/org/apache/spark/spark-streaming-kafka-producer_2.11/2.3.2.0-mapr-1901/spark-streaming-kafka-producer_2.11-2.3.2.0-mapr-1901.jar

    1. Create a directory called sbin. Inside that directory, create a Python script called healthcheck.py. This script will be used to determine whether the file system is up or down; for example, from a button that you can click in the Data Connectors tab of the details of a Spark Instance Group.
File:
healthcheck.py
 
File content:
#!/usr/bin/python
 
import sys, os, stat
from pyspark import SparkContext
 
#Create log file
logfilename = 'statusCheck.log'
if len(sys.argv) > 1 :
  logfilename = sys.argv[1]
logfile = open(logfilename, 'w')
 
# Set file permissions
current_perm = stat.S_IMODE(os.lstat(logfilename).st_mode)
os.chmod(logfilename, current_perm | stat.S_IROTH)
 
sys.stdout = sys.stderr = logfile
 
try:
  sc = SparkContext()
  fs = sc._gateway.jvm.org.apache.hadoop.fs.FileSystem.get(sc._gateway.jvm.org.apache.hadoop.conf.Configuration())
except Exception as e:
  print("Exception")
  exit(1)
 
if fs.exists(sc._gateway.jvm.org.apache.hadoop.fs.Path('/')) :
  print("Online")
  sc.stop()
  exit(0)
else :
  print("Offline")
  sc.stop()
  exit(1)
    1. Create a file called metadata.yml in your data connector directory as a sibling to the /lib and /sbin directories. We’re going to create a special file that is going to be able to dynamically add some properties and give the user some options. Put the following content in metadata.yml, making sure that your type and version fields match what you have used in previous steps:

Note: By default, setting the properties’ required value to truelets the properties be defined/edited by the user during Spark Instance Group registration. To prevent the values from being modified by users, use the value ‘hidden’ instead.

Also, be careful when copying/pasting this example content. YAML files require proper indentation with spaces.

These values are based on the default MapR XD installation paths. Feel free to use your own default values and property descriptions.

File:
metadata.yml
 
File content:
type: MapR_fs
version: 6.1.0
supportedsparkversions:
- 2.3.0
- 2.3.1
timestamp: 0
configurations:
- type: hadoop-coresite
  filename: mapr-site.xml
  properties:
  - name: HiveMetastoreUris
    propertykeys:
    - hive.metastore.uris
    displayname: Hive Metastore URIs
    required: true
    inputtype: string
    allowedpattern: ^(http(s)?:\/\/.)?(www\.)?[-a-zA-Z0-9@:%._\+~#=]{2,256}(\.[a-z]{2,6})?\b([-a-zA-Z0-9@:%_\+.~#?&//=]*)$
    defaultvalue: thrift://desiredMapRfsServerDefault:9083
  - name: JDBCConnectionURL
    propertykeys:
    - javax.jdo.option.ConnectionURL
    displayname: JDBC Connection
    description: JDBC connect string for a JDBC metastore
    required: true
    inputtype: string
    defaultvalue: jdbc:derby:;databaseName=/opt/mapr/hive/hive-2.1/metastore_db;create=true
  - name: SASLEnabled
    propertykeys:
    - hive.metastore.sasl.enabled
    displayname: Hive SASL Enabled
    required: true
    inputtype: string
    defaultvalue: true
  - name: SASLqop
    propertykeys:
    - hive.server2.thrift.sasl.qop
    displayname: Hive SASL qop
    required: true
    inputtype: string
    defaultvalue: auth-conf
  - name: ExecSetugi
    propertykeys:
    - hive.metastore.execute.setugi
    displayname: Hive Execute setugi
    required: true
    inputtype: string
    defaultvalue: false
  - name: WebUiUsePam
    propertykeys:
    - hive.server2.webui.use.pam
    displayname: Hive Webui Use Pam
    required: true
    inputtype: string
    defaultvalue: true
  - name: WebUiUseSSL
    propertykeys:
    - hive.server2.webui.use.ssl
    displayname: Hive Webui Use SSL
    required: true
    inputtype: string
    defaultvalue: true
  - name: WebUiKeystorePath
    propertykeys:
    - hive.server2.webui.keystore.path
    displayname: Hive Webui Keystore Path
    required: true
    inputtype: string
    defaultvalue: /opt/mapr/conf/ssl_keystore
  - name: WebUiKeystorePwd
    propertykeys:
    - hive.server2.webui.keystore.password
    displayname: Hive Webui Keystore Password
    required: true
    inputtype: string
    defaultvalue: mapr123
  - name: Authentication
    propertykeys:
    - hive.server2.authentication
    displayname: Hive Authentication
    required: true
    inputtype: string
    defaultvalue: MAPRSASL
  - name: HbaseRootDir
    propertykeys:
    - hbase.rootdir
    displayname: Hbase Root Dir
    required: true
    inputtype: string
    defaultvalue: maprfs:///hbase
  - name: HbaseClusterDist
    propertykeys:
    - hbase.cluster.distributed
    displayname: Hbase Cluster Distributed
    required: true
    inputtype: string
    defaultvalue: true
  - name: HbaseZKQuorum
    propertykeys:
    - hbase.zookeeper.quorum
    displayname: Hbase Zookeeper Quorum
    required: true
    inputtype: string
    defaultvalue: desiredZookeeperQuorumHost
  - name: HbaseZKClientPort
    propertykeys:
    - hbase.zookeeper.property.clientPort
    displayname: Hbase Zookeeper Client Port
    required: true
    inputtype: string
    defaultvalue: 5181
  - name: DfsSupportAppend
    propertykeys:
    - dfs.support.append
    displayname: DFS Support Append
    required: true
    inputtype: string
    defaultvalue: true
  - name: HbaseFSUtilMapRFSImpl
    propertykeys:
    - hbase.fsutil.maprfs.impl
    displayname: Hbase FS Util MapR FS Impl
    required: true
    inputtype: string
    defaultvalue: org.apache.hadoop.hbase.util.FSMapRUtils
  - name: HbaseRegionServerHandlerCount
    propertykeys:
    - hbase.regionserver.handler.count
    displayname: Hbase Region Server Handler Count
    required: true
    inputtype: string
    defaultvalue: 30
  - name: MapRFSThreads
    propertykeys:
    - fs.mapr.threads
    displayname: MapR FS threads
    description: Allows file/db client to use this many threads
    required: true
    inputtype: string
    defaultvalue: 64
  - name: HbaseMapRDefaultDB
    propertykeys:
    - mapr.hbase.default.db
    displayname: MapR Hbase default db
    required: true
    inputtype: string
    defaultvalue: maprdb
  - name: HbaseSecCliProtocolACL
    propertykeys:
    - security.client.protocol.acl
    displayname: Client/Admin Protocol impl ACL
    required: hidden
    inputtype: string
    defaultvalue: "*"
  - name: HbaseSecAdminProtocolACL
    propertykeys:
    - security.admin.protocol.acl
    displayname: HMasterInterface protocol impl ACL
    required: hidden
    inputtype: string
    defaultvalue: "*"
  - name: HbaseSecMstrRgnProtocolACL
    propertykeys:
    - security.masterregion.protocol.acl
    displayname: HMasterRegionInterface protocol impl ACL
    required: hidden
    inputtype: string
    defaultvalue: "*"

 

  1. Restart your ascd service, and check your $EGO_TOP/ascd/logs/ascd.hostname log file (default location) to make sure that the ascd service parsed the data connector files successfully at startup.
     
  2. Register a Spark Instance Group:
     
    1. Add your new Data Connector as an entry in the “Data Connectors” tab on the Spark instance group registration wizard.

Note: With the current example metadata.yml, you will not be able to use the “default FS” feature with this data connector. However, MapR XD does support referencing files with hdfs:///[file], or maprfs:///[file]. Refer to MapR documentation for details.

    1. If using IBM Spectrum Conductor 2.3.0, you can optionally add the Additional Parameters described below in the Spark Configuration wizard for Hive support:

Note: If you are using an earlier version of IBM Spectrum Conductor, or if you do not wish to always use these configuration parameters for all Spark application submissions, you can set them at Spark application submission time using the --conf flag

Parameter: spark.sql.hive.metastore.sharedPrefixes
Parameter Value: com.mysql.jdbc,org.postgresql,com.microsoft.sqlserver,oracle.jdbc,com.mapr.fs.shim.LibraryLoader,com.mapr.security.JNISecurity,com.mapr.fs.jni,com.mapr.fs.shim
 
Parameter: spark.sql.catalogImplementation
Parameter Value: hive

 

And that's it! If you want to know more about data connectors, check out our online IBM Knowledge Center.

 

We'd love to hear from you. If you've got comments or questions, reach out to us on our Slack channel!

 


#SpectrumComputingGroup
0 comments
2 views

Permalink