High Performance Computing

High Performance Computing Group

Connect with HPC subject matter experts and discuss how hybrid cloud HPC Solutions from IBM meet today's business needs.

 View Only

Integrating H2O Sparkling Water with IBM Spectrum Conductor with Spark

By Archive User posted Fri January 12, 2018 10:22 AM

  

Originally posted by: Kuan.F.


imageIn IBM Spectrum Conductor with Spark, you can easily integrate 3rd party tools and frameworks as a Spark instance group notebook, or as an independent application framework that shares access among various Spark instance groups.

Notebooks are better for isolating 3rd party tools to a specific user, while independent applications provide better management of 3rd party tools, especially in cases where there are multiple processes or services involved.

H2O is a popular framework that is used in machine learning/deep learning. Sparkling Water is a product that is provided by the same company, built on top of the H2O core platform, that allow users to combine H2O machine learning algorithms with Spark capabilities. This means that you can easily transform data between Spark RDD and the H2OFrame, and leverage an H2O cluster as backend for machine learning/deep learning computation inside of a Spark application.

This blog introduces H2O Sparkling Water backends, and demonstrates how to integrate H2O Sparkling Water into your IBM Spectrum Conductor with Spark cluster as a notebook, or as an independent application framework with corresponding backend. In this blog, we explore using Sparkling Water 2.2, which is supported with Apache Spark versions 2.2.1 and 2.2.0.

Sparkling Water backends

There are two types of backends that are provided by Sparkling Water for Spark applications to work with H2O; internal and external.

With the internal backend mode, H2O leverages Spark executors to launch H2O processes on hosts in order to construct a temporary H2O cluster within a Spark cluster during the lifetime of H2OContext. After the temporary H2O cluster is setup, your Spark application can offload machine learning/deep learning computation to it by invoking H2O APIs. This is the default method, which is the easiest way provided by H2O Sparkling Water. However, because H2O clusters do not support high availability or dynamic scaling, this H2O cluster is vulnerable. You need to disable resource preemption and extend the Executor Idle Time for your Spark instance group so that executor processes are not reclaimed during the idle time.

The Sparkling Water internal backend mode helps you to construct your H2O cluster on demand, and is a good approach for short jobs, but because the H2O core platform does not support high availability or dynamic scaling, the long running workload you are passing to H2O (for example, deep learning model training) might need something stronger. If you want a more stable H2O cluster that sits externally to your Spark application, the Sparkling Water external backend provides reliable computation capabilities.

External backend is not provided by default; you need to follow the instructions provided by H2O to download the extension jar file. After you download the extension jar file and setup a cluster, you must invoke an H2O API to connect to the external H2O cluster in your Spark application before using H2OFrame on your external cluster.

Both internal and external backend have their pros and cons. You need to choose which one to use based on your application and user scenario, and sometimes you might need them to co-exist in your cluster. Fortunately, IBM Spectrum Conductor with Spark provides many integration features. It is very easy for you to build the scenario that you want with H2O Sparkling Water and to launch as many instances as you need.

System requirements

In order to try IBM Spectrum Conductor with Spark with H2O Sparkling Water, ensure that the following prerequisites are met:

  • An IBM Spectrum Conductor with Spark cluster is installed and running
  • The Docker engine is installed and running – In this blog, we use a Docker image to seal the required executable and dependencies to ease the integration. It is okay to use baremetal packages.

Building your Docker images

In order to run H2O with Sparkling Water, you need to have the Docker image with Sparkling Water installed. If you want to use the external backend, you need to include the extended jar file that is included in the Docker image as well.

Many H2O and Sparkling Water Docker images exist in the public domain. In fact, the H2O Sparkling Water package provides scripts and a Dockerfile to build a Docker image. The image comes with H2O Sparkling Water and a Spark distribution together. IBM Spectrum Conductor with Spark provides its own Spark distribution, so you do not require the Spark distribution in that image. To avoid confusion, you can build your own image based on pubic materials. We provide a sample Dockerfile here based on Ubuntu.

In the sample Dockerfile, we have provided the following:

  • The “curl”, “wget”, “net-tools”, “unzip” and “openjdk-8-jre” commands, which are installed on top of the Ubuntu image.
  • H2O Sparkling Water, which is extracted under /home/sparkling-water-version when you build the Dockerfile. The extended jar for external backend is also downloaded after the regular Sparkling Water package is extracted.

    Note: The location where you extract Sparkling Water is important to the management scripts that are required by IBM Spectrum Conductor with Spark. If you are using a Docker image with Sparkling Water that is in a different location, take note of that location and update the management script files accordingly.

To build and distribute the Docker image, complete the following steps:

  1. Build the Dockerfile by running docker build . -t <tag> .
  2. After the Docker image is built, ensure that it is accessible to all the hosts that you want to run inside your cluster. There are many ways to do this.
  • Upload the image to a public Docker registry
  • Upload the image to your own, private Docker registry
  • Save the image as a file, manually distribute it to all hosts, and upload it to an image repository
     

Integrating Sparkling Water as a notebook in a Spark instance group

In order to use Sparkling Water as a notebook for your Spark instance group, you first need to add the notebook management scripts into your IBM Spectrum Conductor with Spark cluster. You can get sample script files here, but please be advised that the default script works with the default location mentioned in the previous section (/home/sparkling-water-version). If you are using a Docker image with a different Sparkling Water location, update the scripts accordingly.

image

After the notebook management package is added, you can create a Spark instance group with the management package, assign the notebook to users, and launch instances.

image

 

Integrating Sparkling Water as an independent application framework

In order to start a Sparkling Water cluster as an independent application framework inside your IBM Spectrum Conductor with Spark cluster, you need to register it as an application instance. A yaml file is required for registration. You can find a sample yaml file here. This sample yaml file allows you to input the cluster name and working port for your external cluster. It also embeds scripts to construct the H2O cluster in a Docker command. Just like in the notebook integration scenario, you might need to update the Sparkling Water installation location for your image if you are not using the image that is built by our Dockerfile.

When everything is ready, you can follow the wizard to complete the application instance registration.

image

When the application is started, use the link in the application output section to go directly to Flow (H2O’s web UI).

Go try it!

Now that you know the two ways of integrating H2O Sparkling Water with IBM Spectrum Conductor with Spark, you can download the evaluation version here! If you have any questions, post them in our forum, or join us on Slack!


#SpectrumComputingGroup
0 comments
1 view

Permalink