High Performance Computing

High Performance Computing Group

Connect with HPC subject matter experts and discuss how hybrid cloud HPC Solutions from IBM meet today's business needs.

 View Only

Use IBM PowerAI in a distributed environment with IBM Spectrum Conductor with Spark

By Archive User posted Wed May 24, 2017 03:56 PM

  

Originally posted by: Helena.


IBM PowerAI is a powerful deep learning and machine learning platform that can be run in a Dockerized distributed environment using IBM Spectrum Conductor with Spark. Using PowerAI with IBM Spectrum Conductor with Spark on your IBM Power Systems machines helps to better manage your resources, simplifies driver dependencies via a Docker engine, and saves you time when running or rerunning containerized PowerAI workloads.

 image

 

Additionally, it makes it easy to:

  • Run a Nvidia GPU based workload that has several different types of deep learning workloads (TensorFlow, Caffe, Torch) which all have different resource requirements
  • Have a multitenant environment with multiple levels of code such as a testing environment and a production environment
  • Run across heterogeneous toolkit environments independent of the driver

By utilizing PowerAI with IBM Spectrum Conductor with Spark you can:

  • Run machine learning and deep learning workloads in a Dockerized context while dissociating the coupling between driver and toolkit environments.  This facilitates host level updates such as driver updates without having impact on the tenant images.
  • The appropriate driver is mounted from the host to the image at runtime which allows for a seamless user experience.

Most importantly, the advantage over other solutions is that you have a single pane to manage all types of workloads, not just GPU.  This helps eliminate silos and maximize your compute assets.  Workload scheduling capabilities to accommodate heterogeneous workloads (for example, both training and inference) and support for multiple Spark, machine learning and deep learning frameworks, and notebook versions.

 

Example

The following example demonstrates how, using a CUDA base Docker image, a TensorFlow service can be made available in a distributed IBM Spectrum Conductor with Spark environment. In this example, each host has IBM Spectrum Conductor with Spark 2.2, Docker, CUDA 8.0 driver and libraries installed.

1. Create a Docker image based on your version of CUDA using the docker build -t cuda-base command where the cuda-base is a Docker file which uses a similar format:

FROM ppc64le/ubuntu:16.04
    MAINTAINER EF
    LABEL com.nvidia.volumes.needed="nvidia_driver"
    ENV CUDA_VERSION 8.0.61
    LABEL com.nvidia.cuda.version="${CUDA_VERSION}"
    ENV CUDA_PKG_VERSION 8-0=$CUDA_VERSION-1
    ENV DEBIAN_FRONTEND noninteractive
    ENV CUDA_REPO_URL http://developer.download.nvidia.com/compute/cuda/repos/ubuntu1604/ppc64el/cuda-repo-ubuntu1604_8.0.61-1_ppc64el.deb
    ENV NVML_REPO_URL http://developer.download.nvidia.com/compute/machine-learning/repos/ubuntu1604/ppc64el/nvidia-machine-learning-repo-ubuntu1604_1.0.0-1_ppc64el.deb

    WORKDIR /tmp
    RUN apt-get update && apt-get -y install zip unzip openssh-server ssh infiniband-diags perftest libibverbs-dev libmlx4-dev libmlx5-dev sudo iptables curl wget vim python && apt-get clean
    RUN curl -O ${CUDA_REPO_URL} && dpkg --install *.deb && rm -rf *.deb
    RUN curl -O ${NVML_REPO_URL} && dpkg --install *.deb && rm -rf *.deb
    RUN apt-get update && apt-get install -y --no-install-recommends \
            cuda-nvrtc-$CUDA_PKG_VERSION \
            cuda-nvgraph-$CUDA_PKG_VERSION \
            cuda-cusolver-$CUDA_PKG_VERSION \
            cuda-cublas-$CUDA_PKG_VERSION \
            cuda-cufft-$CUDA_PKG_VERSION \
            cuda-curand-$CUDA_PKG_VERSION \
            cuda-cusparse-$CUDA_PKG_VERSION \
            cuda-npp-$CUDA_PKG_VERSION \
            cuda-cudart-$CUDA_PKG_VERSION && \
        ln -s cuda-8.0 /usr/local/cuda && \
        rm -rf /var/lib/apt/lists/*
    RUN echo "/usr/local/cuda/lib64" >> /etc/ld.so.conf.d/cuda.conf && \
        ldconfig
    RUN echo "/usr/local/nvidia/lib" >> /etc/ld.so.conf.d/nvidia.conf && \
        echo "/usr/local/nvidia/lib64" >> /etc/ld.so.conf.d/nvidia.conf
    ENV PATH /usr/local/nvidia/bin:/usr/local/cuda/bin:${PATH}
    ENV LD_LIBRARY_PATH /usr/local/nvidia/lib:/usr/local/nvidia/lib64
    RUN apt-get update && apt-get install -y --no-install-recommends \
            cuda-samples-$CUDA_PKG_VERSION && \
        rm -rf /var/lib/apt/lists/*
    ENV CUDNN_VERSION 5.1.10
    LABEL com.nvidia.cudnn.version="${CUDNN_VERSION}"
    RUN apt-get update && apt-get install -y --no-install-recommends \
                libcudnn5=$CUDNN_VERSION-1+cuda8.0 && \
        rm -rf /var/lib/apt/lists/*

 

2. Push your Docker image to your private registry.

3. Create a TensorFlow specific image which will be used to provision TensorFlow and build deviceQuery from the CUDA samples.

FROM cuda-base
    MAINTAINER EF
    RUN curl -O https://public.dhe.ibm.com/software/server/POWER/Linux/mldl/ubuntu/mldl-repo-network_3.4.0_ppc64el.deb
&& dpkg -i mldl-repo-*.deb && apt-get update && apt-get -y install tensorflow
    WORKDIR /usr/local/cuda/samples/1_Utilities/deviceQuery
    RUN make

4. Publish the TensorFlow image to your private registry.

5. Using IBM Spectrum Conductor with Spark create and register a TensorFlow service profile.

image

Example of the TensorFlow service profile that was registered:

image

6. Once the TensorFlow service is registered, start the service.

image

7. Create an application instance for TensorFlow, here, the application instance is named tf.

image

8. To get started with TensorFlow, run the base TensorFlow test.

a. Locate the container name.

# docker ps  

b. Execute the container.

# docker exec -it container_name /bin/bash

c. Source your environment.

# source /opt/DL/tensorflow/bin/tensorflow-activate

d. Run the base TensorFlow test.

# tensorflow-test

 

For more information on TensorFlow or other PowerAI software, see IBM PowerAI or contact the IBM sales team.


#SpectrumComputingGroup
0 comments
0 views

Permalink