High Performance Computing

High Performance Computing Group

Connect with HPC subject matter experts and discuss how hybrid cloud HPC Solutions from IBM meet today's business needs.

View Only

Back to Blog List

How to connect IBM Spectrum Conductor™ Spark packages to MapR file system

By Archive User posted Fri May 10, 2019 02:28 PM

Originally posted by: Steve Haertel

We previously introduced Data Connectors in IBM Spectrum Conductor as a way to conveniently connect your Spark applications with your data sources. While we have a list of Data Connectors that come out of the box, you may want to know how to make your own! This blog describes how you can do that -- using MapR XD (formerly MapR-FS) as an example.

If you want to connect Spark to the MapR XD (version 6.1.0 in this example), consider making a Data Connector once per Spark version (2.2.x vs 2.3.x) and let IBM Spectrum Conductor automatically handle the movement of .jar and .xml files.

Here's how you get started on each management host in your cluster:

Log in to your management host as the CLUSTERADMIN and go to the $EGO_CONFDIR/../../conductorspark/conf/dataconnectors/directory. (We want to be the CLUSTERADMIN user so that the files that we create match the ownership/permissions of the existing data connectors.)
In the /types directory, create a new type, and give it metadata similar to existing types. The name is important for subsequent steps.

File:

MapR_fs.yml

File content:

type: MapR_fs

displayname: MapR file system

maxactive: -1

Create a new directory in the dataconnectors directory with the name you used in step 2, followed by a dash (‘-‘), then by a version number, for example: MapR_fs-6.1.0. This directory will live beside other similarly formatted directories.
Inside the new Data Connector directory:
1. Create a directory called /lib/ and copy the JAR files required for your Spark version to this directory.

Spark 2.3.x - Requires JAR files from MapR 6.1.0

Note: Some of these jar files are provided by only the MapR client, but some others are provided by other MapR packages such as mapr-spark, mapr-hbase, and mapr-kafka.

/opt/mapr/hadoop/hadoop-2.7.0/share/hadoop/hdfs/hadoop-hdfs-2.7.0-mapr-1808.jar

/opt/mapr/hadoop/hadoop-2.7.0/share/hadoop/common/lib/hadoop-auth-2.7.0-mapr-1808.jar

/opt/mapr/hadoop/hadoop-2.7.0/share/hadoop/common/hadoop-common-2.7.0-mapr-1808.jar

/opt/mapr/lib/json-1.8.jar

/opt/mapr/lib/maprdb-6.1.0-mapr.jar

/opt/mapr/lib/maprfs-6.1.0-mapr.jar

/opt/mapr/lib/mapr-hbase-6.1.0-mapr.jar

/opt/mapr/lib/mapr-ojai-driver-6.1.0-mapr.jar

/opt/mapr/lib/ojai-3.0-mapr-1808.jar

/opt/mapr/lib/ojai-mapreduce-3.0-mapr-1808.jar

/opt/mapr/lib/ojai-scala-3.0-mapr-1808.jar

/opt/mapr/lib/zookeeper-3.4.11-mapr-1808.jar

/opt/mapr/spark/spark-2.3.2/jars/hive-exec-1.2.0-mapr-spark-MEP-6.0.0-1901.jar

/opt/mapr/spark/spark-2.3.2/jars/hive-metastore-1.2.0-mapr-spark-MEP-6.0.0-1901.jar

/opt/mapr/spark/spark-2.3.2/jars/maprdb-spark-2.3.2.0-mapr-1901.jar

/opt/mapr/hbase/hbase-1.1.8/lib/hbase-annotations-1.1.8-mapr-1901.jar

/opt/mapr/hbase/hbase-1.1.8/lib/hbase-client-1.1.8-mapr-1901.jar

/opt/mapr/hbase/hbase-1.1.8/lib/hbase-common-1.1.8-mapr-1901.jar

/opt/mapr/hbase/hbase-1.1.8/lib/hbase-hadoop2-compat-1.1.8-mapr-1901.jar

/opt/mapr/hbase/hbase-1.1.8/lib/hbase-hadoop-compat-1.1.8-mapr-1901.jar

/opt/mapr/hbase/hbase-1.1.8/lib/hbase-it-1.1.8-mapr-1901.jar

/opt/mapr/hbase/hbase-1.1.8/lib/hbase-prefix-tree-1.1.8-mapr-1901.jar

/opt/mapr/hbase/hbase-1.1.8/lib/hbase-procedure-1.1.8-mapr-1901.jar

/opt/mapr/hbase/hbase-1.1.8/lib/hbase-protocol-1.1.8-mapr-1901.jar

/opt/mapr/hbase/hbase-1.1.8/lib/hbase-resource-bundle-1.1.8-mapr-1901.jar

/opt/mapr/hbase/hbase-1.1.8/lib/hbase-rest-1.1.8-mapr-1901.jar

/opt/mapr/hbase/hbase-1.1.8/lib/hbase-server-1.1.8-mapr-1901.jar

/opt/mapr/hbase/hbase-1.1.8/lib/hbase-spark-1.1.8-mapr-1901.jar

/opt/mapr/hbase/hbase-1.1.8/lib/hbase-thrift-1.1.8-mapr-1901.jar

Additional JAR files required if using Kafka functionality with MapR:

/opt/mapr/kafka/kafka-1.1.1/libs/connect-api-1.1.1-mapr-1901.jar

/opt/mapr/kafka/kafka-1.1.1/libs/connect-file-1.1.1-mapr-1901.jar

/opt/mapr/kafka/kafka-1.1.1/libs/connect-json-1.1.1-mapr-1901.jar

/opt/mapr/kafka/kafka-1.1.1/libs/connect-runtime-1.1.1-mapr-1901.jar

/opt/mapr/kafka/kafka-1.1.1/libs/connect-transforms-1.1.1-mapr-1901.jar

/opt/mapr/kafka/kafka-1.1.1/libs/kafka-clients-1.1.1-mapr-1901.jar

/opt/mapr/kafka/kafka-1.1.1/libs/kafka-log4j-appender-1.1.1-mapr-1901.jar

/opt/mapr/kafka/kafka-1.1.1/libs/kafka-streams-1.1.1-mapr-1901.jar

/opt/mapr/kafka/kafka-1.1.1/libs/kafka-tools-1.1.1-mapr-1901.jar

/opt/mapr/kafka/kafka-1.1.1/libs/mapr-streams-6.1.0-mapr.jar

/opt/mapr/spark/spark-2.3.2/jars/spark-streaming-kafka-0-9_2.11-2.3.2.0-mapr-1901.jar

/opt/mapr/spark/spark-2.3.2/jars/spark-streaming-kafka-0-10_2.11-2.3.2.0-mapr-1901.jar

/opt/mapr/spark/spark-2.3.2/jars/spark-streaming_2.11-2.3.2.0-mapr-1901.jar

/opt/mapr/spark/spark-2.3.2/jars/spark-sql-kafka-0-10_2.11-2.3.2.0-mapr-1901.jar

Note: Depending on the requirements of your applications, you may also want to download and use the following JAR file (for example, to produce kafka streams): https://repository.mapr.com/nexus/content/groups/mapr-public/org/apache/spark/spark-streaming-kafka-producer_2.11/2.3.2.0-mapr-1901/spark-streaming-kafka-producer_2.11-2.3.2.0-mapr-1901.jar

1. Create a directory called sbin. Inside that directory, create a Python script called healthcheck.py. This script will be used to determine whether the file system is up or down; for example, from a button that you can click in the Data Connectors tab of the details of a Spark Instance Group.

File:

healthcheck.py

File content:

#!/usr/bin/python

import sys, os, stat

from pyspark import SparkContext

#Create log file

logfilename = 'statusCheck.log'

if len(sys.argv) > 1 :

logfilename = sys.argv[1]

logfile = open(logfilename, 'w')

# Set file permissions

current_perm = stat.S_IMODE(os.lstat(logfilename).st_mode)

os.chmod(logfilename, current_perm | stat.S_IROTH)

sys.stdout = sys.stderr = logfile

try:

sc = SparkContext()

fs = sc._gateway.jvm.org.apache.hadoop.fs.FileSystem.get(sc._gateway.jvm.org.apache.hadoop.conf.Configuration())

except Exception as e:

print("Exception")

exit(1)

if fs.exists(sc._gateway.jvm.org.apache.hadoop.fs.Path('/')) :

print("Online")

sc.stop()

exit(0)

else :

print("Offline")

sc.stop()

exit(1)

1. Create a file called metadata.yml in your data connector directory as a sibling to the /lib and /sbin directories. We’re going to create a special file that is going to be able to dynamically add some properties and give the user some options. Put the following content in metadata.yml, making sure that your type and version fields match what you have used in previous steps:

Note: By default, setting the properties’ ‘required’ value to ‘true’ lets the properties be defined/edited by the user during Spark Instance Group registration. To prevent the values from being modified by users, use the value ‘hidden’ instead.

Also, be careful when copying/pasting this example content. YAML files require proper indentation with spaces.

These values are based on the default MapR XD installation paths. Feel free to use your own default values and property descriptions.

File:

metadata.yml

File content:

type: MapR_fs

version: 6.1.0

supportedsparkversions:

- 2.3.0

- 2.3.1

timestamp: 0

configurations:

- type: hadoop-coresite

filename: mapr-site.xml

properties:

- name: HiveMetastoreUris

propertykeys:

- hive.metastore.uris

displayname: Hive Metastore URIs

required: true

inputtype: string

allowedpattern: ^(http(s)?:\/\/.)?(www\.)?[-a-zA-Z0-9@:%._\+~#=]{2,256}(\.[a-z]{2,6})?\b([-a-zA-Z0-9@:%_\+.~#?&//=]*)$

defaultvalue: thrift://desiredMapRfsServerDefault:9083

- name: JDBCConnectionURL

propertykeys:

- javax.jdo.option.ConnectionURL

displayname: JDBC Connection

description: JDBC connect string for a JDBC metastore

required: true

inputtype: string

defaultvalue: jdbc:derby:;databaseName=/opt/mapr/hive/hive-2.1/metastore_db;create=true

- name: SASLEnabled

propertykeys:

- hive.metastore.sasl.enabled

displayname: Hive SASL Enabled

required: true

inputtype: string

defaultvalue: true

- name: SASLqop

propertykeys:

- hive.server2.thrift.sasl.qop

displayname: Hive SASL qop

required: true

inputtype: string

defaultvalue: auth-conf

- name: ExecSetugi

propertykeys:

- hive.metastore.execute.setugi

displayname: Hive Execute setugi

required: true

inputtype: string

defaultvalue: false

- name: WebUiUsePam

propertykeys:

- hive.server2.webui.use.pam

displayname: Hive Webui Use Pam

required: true

inputtype: string

defaultvalue: true

- name: WebUiUseSSL

propertykeys:

- hive.server2.webui.use.ssl

displayname: Hive Webui Use SSL

required: true

inputtype: string

defaultvalue: true

- name: WebUiKeystorePath

propertykeys:

- hive.server2.webui.keystore.path

displayname: Hive Webui Keystore Path

required: true

inputtype: string

defaultvalue: /opt/mapr/conf/ssl_keystore

- name: WebUiKeystorePwd

propertykeys:

- hive.server2.webui.keystore.password

displayname: Hive Webui Keystore Password

required: true

inputtype: string

defaultvalue: mapr123

- name: Authentication

propertykeys:

- hive.server2.authentication

displayname: Hive Authentication

required: true

inputtype: string

defaultvalue: MAPRSASL

- name: HbaseRootDir

propertykeys:

- hbase.rootdir

displayname: Hbase Root Dir

required: true

inputtype: string

defaultvalue: maprfs:///hbase

- name: HbaseClusterDist

propertykeys:

- hbase.cluster.distributed

displayname: Hbase Cluster Distributed

required: true

inputtype: string

defaultvalue: true

- name: HbaseZKQuorum

propertykeys:

- hbase.zookeeper.quorum

displayname: Hbase Zookeeper Quorum

required: true

inputtype: string

defaultvalue: desiredZookeeperQuorumHost

- name: HbaseZKClientPort

propertykeys:

- hbase.zookeeper.property.clientPort

displayname: Hbase Zookeeper Client Port

required: true

inputtype: string

defaultvalue: 5181

- name: DfsSupportAppend

propertykeys:

- dfs.support.append

displayname: DFS Support Append

required: true

inputtype: string

defaultvalue: true

- name: HbaseFSUtilMapRFSImpl

propertykeys:

- hbase.fsutil.maprfs.impl

displayname: Hbase FS Util MapR FS Impl

required: true

inputtype: string

defaultvalue: org.apache.hadoop.hbase.util.FSMapRUtils

- name: HbaseRegionServerHandlerCount

propertykeys:

- hbase.regionserver.handler.count

displayname: Hbase Region Server Handler Count

required: true

inputtype: string

defaultvalue: 30

- name: MapRFSThreads

propertykeys:

- fs.mapr.threads

displayname: MapR FS threads

description: Allows file/db client to use this many threads

required: true

inputtype: string

defaultvalue: 64

- name: HbaseMapRDefaultDB

propertykeys:

- mapr.hbase.default.db

displayname: MapR Hbase default db

required: true

inputtype: string

defaultvalue: maprdb

- name: HbaseSecCliProtocolACL

propertykeys:

- security.client.protocol.acl

displayname: Client/Admin Protocol impl ACL

required: hidden

inputtype: string

defaultvalue: "*"

- name: HbaseSecAdminProtocolACL

propertykeys:

- security.admin.protocol.acl

displayname: HMasterInterface protocol impl ACL

required: hidden

inputtype: string

defaultvalue: "*"

- name: HbaseSecMstrRgnProtocolACL

propertykeys:

- security.masterregion.protocol.acl

displayname: HMasterRegionInterface protocol impl ACL

required: hidden

inputtype: string

defaultvalue: "*"

Restart your ascd service, and check your $EGO_TOP/ascd/logs/ascd.hostname log file (default location) to make sure that the ascd service parsed the data connector files successfully at startup.
Register a Spark Instance Group:
1. Add your new Data Connector as an entry in the “Data Connectors” tab on the Spark instance group registration wizard.

Note: With the current example metadata.yml, you will not be able to use the “default FS” feature with this data connector. However, MapR XD does support referencing files with hdfs:///[file], or maprfs:///[file]. Refer to MapR documentation for details.

1. If using IBM Spectrum Conductor 2.3.0, you can optionally add the Additional Parameters described below in the Spark Configuration wizard for Hive support:

Note: If you are using an earlier version of IBM Spectrum Conductor, or if you do not wish to always use these configuration parameters for all Spark application submissions, you can set them at Spark application submission time using the --conf flag

Parameter: spark.sql.hive.metastore.sharedPrefixes

Parameter Value: com.mysql.jdbc,org.postgresql,com.microsoft.sqlserver,oracle.jdbc,com.mapr.fs.shim.LibraryLoader,com.mapr.security.JNISecurity,com.mapr.fs.jni,com.mapr.fs.shim

Parameter: spark.sql.catalogImplementation

Parameter Value: hive

And that's it! If you want to know more about data connectors, check out our online IBM Knowledge Center.

We'd love to hear from you. If you've got comments or questions, reach out to us on our Slack channel!

#SpectrumComputingGroup

0 comments

2 views

Permalink

https://community.ibm.com/community/user/blogs/archive-user/2019/05/10/how-to-connect-ibm-spectrum-conductor-spark-packages-to-mapr-file-system

High Performance Computing

High Performance Computing Group

How to connect IBM Spectrum Conductor™ Spark packages to MapR file system

By Archive User posted Fri May 10, 2019 02:28 PM

Permalink

Additional
Resources

Office

Quick Links

High Performance Computing

High Performance Computing Group

How to connect IBM Spectrum Conductor™ Spark packages to MapR file system

By Archive User posted Fri May 10, 2019 02:28 PM

Permalink

Additional Resources

Office

Quick Links

Additional
Resources