High Performance Computing

High Performance Computing Group

Connect with HPC subject matter experts and discuss how hybrid cloud HPC Solutions from IBM meet today's business needs.

View Only

Back to Blog List

Running TensorFlow benchmark with Horovod across IBM Power servers in containers in an LSF cluster

By John Welch posted Fri December 20, 2019 07:07 PM

This blog builds on my previous article on Compiling Open MPI with IBM Spectrum LSF in a Docker container image and extends the concept to include TensorFlow plus Horovod and is specifically written for only the IBM Power server platform. Since 2016 LSF 10.1 has been providing deep container integration, which has made it easier to build and maintain environments running containerized workloads. Additionally, LSF integrates with most MPI implementations, including Open MPI, via an adaptable and scalable distributed application framework. LSF job submissions are extremely flexible including affinity, topology directives and other resources requirements which are integrated with Open MPI. This blog will explorer building a custom NVIDIA Docker container, which will allow running the TensorFlow benchmark using Horovod across multiple servers and multiple GPUs

Prerequisites

This blog assumes you have installed IBM Spectrum LSF on Power Little Endian platform (or linux3.10-glibc2.17-ppc64le) and Docker and NVIDIA Docker and CUDA and both are up and running on nodes in your cluster. To start with you will need the following:

Component	Version	Edition
NVIDIA GPUs		CUDA-Enabled
Red Hat Linux Server	7.6	Enterprise
IBM Spectrum LSF	10.1.0.8+	Standard Edition or Suite
Docker		Community or Enterprise Edition
Docker Engine	1.13.1+
NVIDIA Container Runtime		https://github.com/NVIDIA/nvidia-container-runtime
CUDA	10.1

Verify your Docker Engine version with this command:

$ docker version | grep Version

Build a new TensorFlow and Horovod Docker container with Open MPI compile with LSF

Login as a user with the ability to run docker commands. The steps below assume your working directory (pwd) will remain the same through out the steps below.

Prepare minimal LSF files for Open MPI compile

The goal is to prepare the minimal files from your LSF environment necessary to compile Open MPI with LSF inside a Docker container. Copy the script below and paste into a file called mktmplsf.sh. This script will generate a directory called "lsf" with the LSF libraries, include files and configuration file. The files in the "lsf" directory will be used in the next step. mktmplsf.sh

#!/bin/sh
if [ -z $LSF_LIBDIR ] ; then
   echo "Source your LSF profile (profile.lsf or cshrc.lsf) and run this script again"
   exit 1
fi

TMPDIR=lsf
LIB_LIST="libbat.a libbat.so liblsf.a liblsf.so"

# Build directory structure
mkdir $TMPDIR
mkdir $TMPDIR/10.1
mkdir $TMPDIR/10.1/include
mkdir $TMPDIR/10.1/include/lsf
mkdir $TMPDIR/10.1/lib
mkdir $TMPDIR/conf

# Copy files
for LIB in $LIB_LIST
do
   cp -p $LSF_LIBDIR/$LIB $TMPDIR/10.1/lib
done

cp -p $LSF_LIBDIR/../../include/lsf/ls[bf]*.h $TMPDIR/10.1/include/lsf

echo "LSF_INCLUDEDIR=/tmp/lsf/10.1/include" > $TMPDIR/conf/lsf.conf

Here are the steps to run the script and see the directories and files created:

$ chmod +x mktmplsf.sh
$ ./mktmplsf.sh
$ ls -R lsf
lsf:
10.1 conflsf/10.1:
include lib
lsf/10.1/include:
lsf
lsf/10.1/include/lsf:
lsbatch.h lsf.h
lsf/10.1/lib:
libbat.a libbat.so liblsf.a liblsf.so
lsf/conf:
lsf.conf
$

Create a Dockerfile

Copy the text below and paste into a file called Dockerfile.

# The base of this file originated from https://github.com/horovod/horovod/blob/master/Dockerfile.gpu and
# was then modified from x86_64 to support ppc64le platform
FROM docker.io/ibmcom/tensorflow-ppc64le:1.14.0-gpu-py3

# TensorFlow version is tightly coupled to CUDA and cuDNN so it should be selected carefully
ENV TENSORFLOW_VERSION=1.14.0
ENV PYTORCH_VERSION=1.2.0
ENV TORCHVISION_VERSION=0.4.0
ENV CUDNN_VERSION=7.6.0.64-1+cuda10.0
ENV NCCL_VERSION=2.4.7-1+cuda10.0
ENV MXNET_VERSION=1.5.0

# Python 2.7 or 3.6 is supported by Ubuntu Bionic out of the box
ARG python=3.6
ENV PYTHON_VERSION=${python}

# Set default shell to /bin/bash
SHELL ["/bin/bash", "-cu"]

# Removed these options –allow-downgrades –allow-change-held-packages –no-install-recommends
RUN apt-get update && apt-get install -y --allow-downgrades --allow-change-held-packages --no-install-recommends\
   build-essential \
   cmake \
   g++-4.8 \
   git \
   curl \
   vim \
   wget \
   ca-certificates \
   libjpeg-dev \
   libpng-dev \
   python${PYTHON_VERSION} \
   python${PYTHON_VERSION}-dev \
   librdmacm1 \
   libibverbs1 \
   ibverbs-providers \
   libcudnn7=${CUDNN_VERSION} \
   libnccl2=${NCCL_VERSION} \
   libnccl-dev=${NCCL_VERSION} \
   libffi-dev \
   libssl-dev

RUN if [[ "${PYTHON_VERSION}" == "3.6" ]]; then \
   apt-get install -y python${PYTHON_VERSION}-distutils; \
fi
RUN ln -s /usr/bin/python${PYTHON_VERSION} /usr/bin/python

RUN curl -O https://bootstrap.pypa.io/get-pip.py && \
python get-pip.py && \
rm get-pip.py

# Copy in minimal LSF components
ADD lsf /tmp/lsf

# Setup the temporary LSF paths
ENV LSF_ENVDIR /tmp/lsf/conf
ENV LSF_LIBDIR /tmp/lsf/10.1/lib

# Compile openmpi with LSF
RUN mkdir /tmp/openmpi && \
cd /tmp/openmpi && \
wget https://www.open-mpi.org/software/ompi/v4.0/downloads/openmpi-4.0.1.tar.gz && \
tar zxf openmpi-4.0.1.tar.gz && \
cd openmpi-4.0.1 && \
./configure --prefix=/usr/local/mpi --enable-orterun-prefix-by-default --disable-getpwuid --with-lsf --with-cuda && \
make -j $(nproc) all && \
make install && \
ldconfig && \
rm -rf /tmp/openmpi

# Install Horovod, temporarily using CUDA stubs
RUN ldconfig /usr/local/cuda/targets/ppc64le-linux/lib/stubs && \
HOROVOD_GPU_ALLREDUCE=NCCL HOROVOD_WITH_TENSORFLOW=1 \
pip install --no-cache-dir horovod && \
ldconfig

# Install OpenSSH for MPI to communicate between containers
RUN apt-get install -y --no-install-recommends openssh-client openssh-server && \
mkdir -p /var/run/sshd

# Allow OpenSSH to talk to containers without asking for confirmation
RUN cat /etc/ssh/ssh_config | grep -v StrictHostKeyChecking > /etc/ssh/ssh_config.new && \
echo " StrictHostKeyChecking no" >> /etc/ssh/ssh_config.new && \
mv /etc/ssh/ssh_config.new /etc/ssh/ssh_config

# Download examples
RUN apt-get install -y --no-install-recommends subversion && \
svn checkout https://github.com/horovod/horovod/trunk/examples && \
rm -rf /examples/.svn

# Note, when running an LSF job in a container, your LSF_LIBDIR will be set according to your LSF installation
# The supplied example removes /tmp later, we need to keep the .so files
RUN cp /tmp/lsf/10.1/lib/libbat.so /tmp/lsf/10.1/lib/liblsf.so /usr/lib
ENV LD_LIBRARY_PATH $LD_LIBRARY_PATH:/usr/local/mpi/lib:/usr/local/mpi/lib/openmpi
RUN ldconfig

# Install hello_world as a test app
RUN mkdir /tmp/hello-world
WORKDIR /tmp/hello-world
ENV PATH /usr/local/mpi/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
RUN git clone https://github.com/wesleykendall/mpitutorial && \
cd mpitutorial/tutorials/mpi-hello-world/code && \
make && \
cp /tmp/hello-world/mpitutorial/tutorials/mpi-hello-world/code/mpi_hello_world /usr/local/bin

# Install the TensorFlow benchmark
WORKDIR /usr/local
RUN git clone https://github.com/tensorflow/benchmarks

# Get the gpu_bind.sh and tf_cnn_benchmark_post_total.sh
RUN mkdir /usr/local/scripts
WORKDIR /usr/local/scripts
RUN wget https://raw.githubusercontent.com/IBMSpectrumComputing/lsf-integrations/master/Spectrum%20LSF%20Application%20Center/Misc_MLDL_Examples/scripts/gpu_bind.sh && chmod a+x gpu_bind.sh
RUN wget https://raw.githubusercontent.com/IBMSpectrumComputing/lsf-integrations/master/Spectrum%20LSF%20Application%20Center/Misc_MLDL_Examples/scripts/tf_cnn_benchmark_post_total.sh && chmod a+x tf_cnn_benchmark_post_total.sh

# Clean up, commented these out for now…
RUN rm -rf /tmp/lsf
RUN rm -rf /tmp/hello-world

WORKDIR "/examples"

Build a new Docker container

Use the command below to create the new container. It will take several minutes to perform all the steps to create the new container image, which will be called "docker.io/ibmcom/tensorflow-ppc64le:1.14.0-gpu-py3-horovod". Note, both the Dockerfile file and lsf directory should be in your current working directory.

# ls Dockerfile
Dockerfile
# ls lsf
10.1 conf
# docker build -t docker.io/ibmcom/tensorflow-ppc64le:1.14.0-gpu-py3-horovod .
Sending build context to Docker daemon 32.35 GB
Step 1/37 : FROM docker.io/ibmcom/tensorflow-ppc64le:1.14.0-gpu-py3
—> e0487c8dd429
…
…
…
Removing intermediate container 3075a6602fbb
Step 37/37 : WORKDIR “/examples”
—> 81ce7a428ea9
Removing intermediate container 4daa694cabde
Successfully built 81ce7a428ea9
#

Now, run the docker images command and your new container image should be there unless the docker build command failed.

# docker images | grep horovod
docker.io/ibmcom/tensorflow-ppc64le 1.14.0-gpu-py3-horovod 81ce7a428ea9 3 minutes ago 3.95 GB
#

You can repeat the above docker build process on every Docker NVIDIA enabled compute node in your LSF cluster or you can use other methods such as publishing the container to your internal Docker Registry or use the docker save and the docker load commands.

Setting up LSF with Docker

1). Prepare IBM Spectrum LSF to run jobs in Docker container by following these steps: LSF docker integration instruction.

2). Configure LSF Docker Application profile for the new Docker container image by adding the following lines (while changing the LSF_TOP to your LSF TOP direction location) to the end of lsb.applications file (and then run badmin reconfig or badmin mbdrestart on the LSF Master):

Begin Application
NAME = tensorflow_horovod
DESCRIPTION = Example TensorFlow Horovod application
CONTAINER = docker[image(ibmcom/tensorflow-ppc64le:1.14.0-gpu-py3-horovod) \
   options(--rm --net=host --ipc=host \
     -v /etc/passwd:/etc/passwd \
     -v /etc/group:/etc/group \
   ) starter(root) ]
EXEC_DRIVER = context[user(lsfadmin)] \
   starter[LSF_TOP/10.1/linux3.10-glibc2.17-ppc64le/etc/docker-starter.py] \
   controller[LSF_TOP/10.1/linux3.10-glibc2.17-ppc64le/etc/docker-control.py] \
   monitor[LSF_TOP/10.1/linux3.10-glibc2.17-ppc64le/etc/docker-monitor.py]
End Application

Testing the new container with LSF


$ bsub -app tensorflow_horovod -Is /bin/bash
Job <19773> is submitted to default queue <interactive>.
<<Waiting for dispatch ...>>
<<Starting on ac922c>>

________                               _______________                
___  __/__________________________________  ____/__  /________      __
__  /  _  _ \_  __ \_  ___/  __ \_  ___/_  /_   __  /_  __ \_ | /| / /
_  /   /  __/  / / /(__  )/ /_/ /  /   _  __/   _  / / /_/ /_ |/ |/ / 
/_/    \___//_/ /_//____/ \____//_/    /_/      /_/  \____/____/|__/


You are running this container as user with ID 1150 and group 491,
which should map to the ID and group for your user on the Docker host. Great!

tf-docker ~ > exit
exit
$

Testing the new container with MPI Hello World

Make sure MPI is working as expected before attempting to run the TensorFlow benchmark across nodes.

Example of running MPI hello world on a single node with 1 message

$ bsub -app tensorflow_horovod -I mpirun /usr/local/bin/mpi_hello_world
Job <19774> is submitted to default queue <interactive>.
<<Waiting for dispatch …>>
<<Starting on ac922c>>
Hello world from processor ac922c, rank 0 out of 1 processors
$

Example of running MPI hello world on a single node with 2 job slots or 2 messages.

$ bsub -app tensorflow_horovod -I -n 2 -R "span[hosts=1]" mpirun /usr/local/bin/mpi_hello_world
Job <19775> is submitted to default queue <interactive>.
<<Waiting for dispatch …>>
<<Starting on ac922c>>
Hello world from processor ac922c, rank 0 out of 2 processors
Hello world from processor ac922c, rank 1 out of 2 processors
$

Example of running MPI hello world on 2 nodes with 1 message per node.

$ bsub -app tensorflow_horovod -I -n 2 -R "span[ptile=1]" mpirun /usr/local/bin/mpi_hello_world
Job <19776> is submitted to default queue <interactive>.
<<Waiting for dispatch …>>
<<Starting on ac922c>>
Hello world from processor ac922c, rank 0 out of 2 processors
Hello world from processor ac922b, rank 1 out of 2 processors
$

Testing the new container with requests for GPUs

Example job requesting 1 GPU and showing nvidia-smi output


$ bsub -app tensorflow_horovod -gpu "num=1:mode=exclusive_process" -I nvidia-smi
Job <19777> is submitted to default queue <interactive>.
<<Waiting for dispatch ...>>
<<Starting on ac922b>>
Thu Dec 19 05:21:46 2019       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 418.39       Driver Version: 418.39       CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla V100-SXM2...  On   | 00000004:04:00.0 Off |                    0 |
| N/A   36C    P0    37W / 300W |      0MiB / 16130MiB |      0%   E. Process |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+
$

Example job requesting 2 GPUs and showing nvidia-smi output


$ bsub -app tensorflow_horovod -gpu "num=2:mode=exclusive_process" -I nvidia-smi
Job <19778> is submitted to default queue <interactive>.
<<Waiting for dispatch ...>>
<<Starting on ac922b>>
Thu Dec 19 05:26:00 2019       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 418.39       Driver Version: 418.39       CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla V100-SXM2...  On   | 00000004:04:00.0 Off |                    0 |
| N/A   36C    P0    37W / 300W |      0MiB / 16130MiB |      0%   E. Process |
+-------------------------------+----------------------+----------------------+
|   1  Tesla V100-SXM2...  On   | 00000004:05:00.0 Off |                    0 |
| N/A   39C    P0    37W / 300W |      0MiB / 16130MiB |      0%   E. Process |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+
$

Testing the new container with TensorFlow benchmark on a single compute node

Example TensorFlow benchmark with 1 GPU on a single compute node


$ bsub -app tensorflow_horovod -gpu "num=1:mode=exclusive_process" -I -e stderr%J.txt python /usr/local/benchmarks/scripts/tf_cnn_benchmarks/tf_cnn_benchmarks.py --num_gpus=1 --model resnet50 --batch_size 128 --num_batches=10 
Job <19779> is submitted to default queue
<<Waiting for dispatch ...>>
<<Starting on ac922c>>
TensorFlow:  1.14
Model:       resnet50
Dataset:     imagenet (synthetic)
Mode:        training
SingleSess:  False
Batch size:  128 global
             128 per device
Num batches: 10
Num epochs:  0.00
Devices:     ['/gpu:0']
NUMA bind:   False
Data format: NCHW
Optimizer:   sgd
Variables:   parameter_server
==========
Generating training model
Initializing graph
Running warm up
Done warm up
Step	Img/sec	total_loss
1	images/sec: 416.0 +/- 0.0 (jitter = 0.0)	7.972
10	images/sec: 413.8 +/- 0.6 (jitter = 1.4)	7.856
----------------------------------------------------------------
total images/sec: 413.61
----------------------------------------------------------------
$

Example TensorFlow benchmark with 4 GPU on a single compute node


$  bsub -app tensorflow_horovod -gpu "num=4:mode=exclusive_process" -I -e stderr%J.txt python /usr/local/benchmarks/scripts/tf_cnn_benchmarks/tf_cnn_benchmarks.py --num_gpus=4 --model resnet50 --batch_size 128 --num_batches=10
Job <19780> is submitted to default queue
<<Waiting for dispatch ...>>
<<Starting on ac922c>>
TensorFlow:  1.14
Model:       resnet50
Dataset:     imagenet (synthetic)
Mode:        training
SingleSess:  False
Batch size:  512 global
             128 per device
Num batches: 10
Num epochs:  0.00
Devices:     ['/gpu:0', '/gpu:1', '/gpu:2', '/gpu:3']
NUMA bind:   False
Data format: NCHW
Optimizer:   sgd
Variables:   parameter_server
==========
Generating training model
Initializing graph
Running warm up
Done warm up
Step	Img/sec	total_loss
1	images/sec: 1568.3 +/- 0.0 (jitter = 0.0)	7.870
10	images/sec: 1568.6 +/- 3.1 (jitter = 7.7)	7.861
----------------------------------------------------------------
total images/sec: 1562.40
----------------------------------------------------------------
$

A few notes on the above examples:

1) Above was tested with NVIDIA Tesla V100 GPUs with 16 GB RAM. You will likely need to decrease the batch size parameter value if you have less RAM on your GPUs or you may want to try increasing RAM if your GPUs have more RAM.

2) The number of batches is specifically small in above examples for testing. Increase the num_batches value to have the benchmark run for a longer period of time.

3) If problems with the above jobs running, check the standard error file, which is stderr<JOBID>.txt.

Testing the new container with TensorFlow benchmark with Horovod

A few notes on the examples below:

1) For the benchmark use your fastest network, which should be 10Gb or faster or potentially Infiniband. The examples use a 40Gb ethernet network. For the btl_tcp_if_include and HOROVOD_GLOO_IFACE parameter values replace my network interface, which is "enP48p1s0f0", with your fastest network interface available on your compute nodes.

2) The mpirun command has several debugging option enabled.

3) If problems with the jobs below, check the standard error file, which is stderr<JOBID>.txt.

4) Depending on the number of GPUs per node on your servers, adjust the first example (on a single compute node) based on the table below

GPUs per node	-n	ptile	-gpu num
1	1	1	1
2	2	2	2
4	4	4	4

5) Depending on the number of GPUs per node on your servers, adjust the second example (on two compute nodes) based on the table below

GPUs per node	-n	ptile	-gpu num
1	2	1	1
2	4	2	2
4	8	4	4

Example TensorFlow benchmark using Horovod with 4 GPUs on a single compute node


$ bsub -app tensorflow_horovod -n 4 -R "span[ptile=4]" -gpu "num=4:mode=exclusive_process" -I -e stderr%J.txt mpirun -mca btl_tcp_if_include enP48p1s0f0 -x HOROVOD_GLOO_IFACE=enP48p1s0f0 -mca btl ^openib -mca pml ob1 -x NCCL_IB_DISABLE=1 -mca plm_base_verbose 10 -x NCCL_DEBUG=WARN /usr/local/scripts/gpu_bind.sh python /usr/local/benchmarks/scripts/tf_cnn_benchmarks/tf_cnn_benchmarks.py --variable_update=horovod --num_gpus=1 --model resnet50 --batch_size 128 --num_batches=10
Job <19781> is submitted to default queue
<<Waiting for dispatch ...>>
<<Starting on ac922c>>
TensorFlow:  1.14
Model:       resnet50
Dataset:     imagenet (synthetic)
Mode:        training
SingleSess:  False
Batch size:  128 global
             128 per device
Num batches: 10
Num epochs:  0.00
Devices:     ['horovod/gpu:0']
NUMA bind:   False
Data format: NCHW
Optimizer:   sgd
Variables:   horovod
==========
Generating training model
TensorFlow:  1.14
Model:       resnet50
Dataset:     imagenet (synthetic)
Mode:        training
SingleSess:  False
Batch size:  128 global
             128 per device
Num batches: 10
Num epochs:  0.00
Devices:     ['horovod/gpu:0']
NUMA bind:   False
Data format: NCHW
Optimizer:   sgd
Variables:   horovod
==========
Generating training model
TensorFlow:  1.14
Model:       resnet50
Dataset:     imagenet (synthetic)
Mode:        training
SingleSess:  False
Batch size:  128 global
             128 per device
Num batches: 10
Num epochs:  0.00
Devices:     ['horovod/gpu:0']
NUMA bind:   False
Data format: NCHW
Optimizer:   sgd
Variables:   horovod
==========
Generating training model
TensorFlow:  1.14
Model:       resnet50
Dataset:     imagenet (synthetic)
Mode:        training
SingleSess:  False
Batch size:  128 global
             128 per device
Num batches: 10
Num epochs:  0.00
Devices:     ['horovod/gpu:0']
NUMA bind:   False
Data format: NCHW
Optimizer:   sgd
Variables:   horovod
==========
Generating training model
Initializing graph
Initializing graph
Initializing graph
Initializing graph
Running warm up
Running warm up
Running warm up
Running warm up
NCCL version 2.4.7+cuda10.0
NCCL version 2.4.7+cuda10.0
NCCL version 2.4.7+cuda10.0
NCCL version 2.4.7+cuda10.0
Done warm up
Step	Img/sec	total_loss
Done warm up
Step	Img/sec	total_loss
Done warm up
Step	Img/sec	total_loss
1	images/sec: 411.2 +/- 0.0 (jitter = 0.0)	7.972
1	images/sec: 409.4 +/- 0.0 (jitter = 0.0)	7.972
Done warm up
Step	Img/sec	total_loss
1	images/sec: 408.0 +/- 0.0 (jitter = 0.0)	7.972
1	images/sec: 405.4 +/- 0.0 (jitter = 0.0)	7.972
10	images/sec: 409.3 +/- 0.5 (jitter = 1.6)	7.856
----------------------------------------------------------------
total images/sec: 409.12
----------------------------------------------------------------
10	images/sec: 408.4 +/- 0.3 (jitter = 1.1)	7.856
----------------------------------------------------------------
total images/sec: 408.25
----------------------------------------------------------------
10	images/sec: 407.8 +/- 0.3 (jitter = 0.8)	7.856
----------------------------------------------------------------
total images/sec: 407.55
----------------------------------------------------------------
10	images/sec: 406.2 +/- 0.6 (jitter = 2.7)	7.856
----------------------------------------------------------------
total images/sec: 406.03
----------------------------------------------------------------
$

Example TensorFlow benchmark using Horovod with 4 GPUs (2 GPUs per node) on two compute nodes


$ bsub -app tensorflow_horovod -n 4 -R "span[ptile=2]" -gpu "num=2:mode=exclusive_process" -I -e stderr%J.txt mpirun -mca btl_tcp_if_include enP48p1s0f0 -x HOROVOD_GLOO_IFACE=enP48p1s0f0 -mca btl ^openib -mca pml ob1 -x NCCL_IB_DISABLE=1 -mca plm_base_verbose 10 -x NCCL_DEBUG=WARN /usr/local/scripts/gpu_bind.sh python /usr/local/benchmarks/scripts/tf_cnn_benchmarks/tf_cnn_benchmarks.py --variable_update=horovod --num_gpus=1 --model resnet50 --batch_size 128 --num_batches=10
Job <19782> is submitted to default queue
<<Waiting for dispatch ...>>
<<Starting on ac922b>>
TensorFlow:  1.14
Model:       resnet50
Dataset:     imagenet (synthetic)
Mode:        training
SingleSess:  False
Batch size:  128 global
             128 per device
Num batches: 10
Num epochs:  0.00
Devices:     ['horovod/gpu:0']
NUMA bind:   False
Data format: NCHW
Optimizer:   sgd
Variables:   horovod
==========
Generating training model
TensorFlow:  1.14
Model:       resnet50
TensorFlow:  1.14
Model:       resnet50
Dataset:     imagenet (synthetic)
Mode:        training
SingleSess:  False
Batch size:  128 global
             128 per device
Dataset:     imagenet (synthetic)
Mode:        training
SingleSess:  False
Batch size:  128 global
             128 per device
Num batches: 10
Num batches: 10
Num epochs:  0.00
Devices:     ['horovod/gpu:0']
NUMA bind:   False
Data format: NCHW
Num epochs:  0.00
Devices:     ['horovod/gpu:0']
NUMA bind:   False
Data format: NCHW
Optimizer:   sgd
Variables:   horovod
==========
Optimizer:   sgd
Variables:   horovod
==========
Generating training model
Generating training model
TensorFlow:  1.14
Model:       resnet50
Dataset:     imagenet (synthetic)
Mode:        training
SingleSess:  False
Batch size:  128 global
             128 per device
Num batches: 10
Num epochs:  0.00
Devices:     ['horovod/gpu:0']
NUMA bind:   False
Data format: NCHW
Optimizer:   sgd
Variables:   horovod
==========
Generating training model
Initializing graph
Initializing graph
Initializing graph
Initializing graph
Running warm up
Running warm up
Running warm up
Running warm up
NCCL version 2.4.7+cuda10.0
NCCL version 2.4.7+cuda10.0
NCCL version 2.4.7+cuda10.0
NCCL version 2.4.7+cuda10.0
Done warm up
Step	Img/sec	total_loss
Done warm up
Step	Img/sec	total_loss
Done warm up
Step	Img/sec	total_loss
1	images/sec: 410.5 +/- 0.0 (jitter = 0.0)	7.972
1	images/sec: 405.9 +/- 0.0 (jitter = 0.0)	7.972
Done warm up
Step	Img/sec	total_loss
1	images/sec: 407.6 +/- 0.0 (jitter = 0.0)	7.972
1	images/sec: 407.6 +/- 0.0 (jitter = 0.0)	7.972
10	images/sec: 410.4 +/- 0.6 (jitter = 1.7)	7.856
----------------------------------------------------------------
total images/sec: 410.27
----------------------------------------------------------------
10	images/sec: 406.6 +/- 0.6 (jitter = 1.6)	7.856
----------------------------------------------------------------
total images/sec: 406.45
----------------------------------------------------------------
10	images/sec: 407.8 +/- 0.6 (jitter = 1.5)	7.856
----------------------------------------------------------------
total images/sec: 407.56
----------------------------------------------------------------
10	images/sec: 407.7 +/- 0.4 (jitter = 1.0)	7.856
----------------------------------------------------------------
total images/sec: 407.55
----------------------------------------------------------------
$

Conclusion

Now, you have a new container image that is ready to run TensorFlow with Horovod across multiple nodes using an LSF cluster. Please leave comments or feedback on the above information and if you would like the article to include X86_64 equivalents.
#SpectrumComputingGroup

0 comments

47 views

Permalink

https://community.ibm.com/community/user/blogs/john-welch/2019/12/20/running-tensorflow-benchmark-with-horovod-across-i

High Performance Computing

High Performance Computing Group

Running TensorFlow benchmark with Horovod across IBM Power servers in containers in an LSF cluster

By John Welch posted Fri December 20, 2019 07:07 PM

Prerequisites

Build a new TensorFlow and Horovod Docker container with Open MPI compile with LSF

Prepare minimal LSF files for Open MPI compile

Create a Dockerfile

Build a new Docker container

Setting up LSF with Docker

Testing the new container with LSF

Testing the new container with MPI Hello World

Example of running MPI hello world on a single node with 1 message

Example of running MPI hello world on a single node with 2 job slots or 2 messages.

Example of running MPI hello world on 2 nodes with 1 message per node.

Testing the new container with requests for GPUs

Example job requesting 1 GPU and showing nvidia-smi output

Example job requesting 2 GPUs and showing nvidia-smi output

Testing the new container with TensorFlow benchmark on a single compute node

Example TensorFlow benchmark with 1 GPU on a single compute node

Example TensorFlow benchmark with 4 GPU on a single compute node

Testing the new container with TensorFlow benchmark with Horovod

Example TensorFlow benchmark using Horovod with 4 GPUs (2 GPUs per node) on two compute nodes

Conclusion

Permalink

Additional
Resources

Office

Quick Links

High Performance Computing

High Performance Computing Group

Running TensorFlow benchmark with Horovod across IBM Power servers in containers in an LSF cluster

By John Welch posted Fri December 20, 2019 07:07 PM

Prerequisites

Build a new TensorFlow and Horovod Docker container with Open MPI compile with LSF

Prepare minimal LSF files for Open MPI compile

Create a Dockerfile

Build a new Docker container

Setting up LSF with Docker

Testing the new container with LSF

Testing the new container with MPI Hello World

Example of running MPI hello world on a single node with 1 message

Example of running MPI hello world on a single node with 2 job slots or 2 messages.

Example of running MPI hello world on 2 nodes with 1 message per node.

Testing the new container with requests for GPUs

Example job requesting 1 GPU and showing nvidia-smi output

Example job requesting 2 GPUs and showing nvidia-smi output

Testing the new container with TensorFlow benchmark on a single compute node

Example TensorFlow benchmark with 1 GPU on a single compute node

Example TensorFlow benchmark with 4 GPU on a single compute node

Testing the new container with TensorFlow benchmark with Horovod

Example TensorFlow benchmark using Horovod with 4 GPUs (2 GPUs per node) on two compute nodes

Conclusion

Permalink

Additional Resources

Office

Quick Links

Additional
Resources