High Performance Computing

High Performance Computing Group

Connect with HPC subject matter experts and discuss how hybrid cloud HPC Solutions from IBM meet today's business needs.

 View Only

Running TensorFlow benchmark with Horovod across IBM Power servers in containers in an LSF cluster

By John Welch posted Fri December 20, 2019 07:07 PM

  

This blog builds on my previous article on Compiling Open MPI with IBM Spectrum LSF in a Docker container image and extends the concept to include TensorFlow plus Horovod and is specifically written for only the IBM Power server platform.  Since 2016 LSF 10.1 has been providing deep container integration, which has made it easier to build and maintain environments running containerized workloads.  Additionally, LSF integrates with most MPI implementations, including Open MPI, via an adaptable and scalable distributed application framework.  LSF job submissions are extremely flexible including affinity, topology directives and other resources requirements which are integrated with Open MPI.  This blog will explorer building a custom NVIDIA Docker container, which will allow running the TensorFlow benchmark using Horovod across multiple servers and multiple GPUs

Prerequisites

This blog assumes you have installed IBM Spectrum LSF on Power Little Endian platform (or linux3.10-glibc2.17-ppc64le) and Docker and NVIDIA Docker and CUDA and both are up and running on nodes in your cluster. To start with you will need the following:
Component Version Edition
NVIDIA GPUs CUDA-Enabled
Red Hat Linux Server 7.6 Enterprise
IBM Spectrum LSF 10.1.0.8+ Standard Edition or Suite
Docker Community or Enterprise Edition
Docker Engine 1.13.1+
NVIDIA Container Runtime https://github.com/NVIDIA/nvidia-container-runtime
CUDA 10.1
Verify your Docker Engine version with this command:
$ docker version | grep Version

Build a new TensorFlow and Horovod Docker container with Open MPI compile with LSF

Login as a user with the ability to run docker commands. The steps below assume your working directory (pwd) will remain the same through out the steps below.

Prepare minimal LSF files for Open MPI compile

The goal is to prepare the minimal files from your LSF environment necessary to compile Open MPI with LSF inside a Docker container. Copy the script below and paste into a file called mktmplsf.sh. This script will generate a directory called "lsf" with the LSF libraries, include files and configuration file. The files in the "lsf" directory will be used in the next step. mktmplsf.sh
#!/bin/sh if [ -z $LSF_LIBDIR ] ; then echo "Source your LSF profile (profile.lsf or cshrc.lsf) and run this script again" exit 1 fi TMPDIR=lsf LIB_LIST="libbat.a libbat.so liblsf.a liblsf.so" # Build directory structure mkdir $TMPDIR mkdir $TMPDIR/10.1 mkdir $TMPDIR/10.1/include mkdir $TMPDIR/10.1/include/lsf mkdir $TMPDIR/10.1/lib mkdir $TMPDIR/conf # Copy files for LIB in $LIB_LIST do cp -p $LSF_LIBDIR/$LIB $TMPDIR/10.1/lib done cp -p $LSF_LIBDIR/../../include/lsf/ls[bf]*.h $TMPDIR/10.1/include/lsf echo "LSF_INCLUDEDIR=/tmp/lsf/10.1/include" > $TMPDIR/conf/lsf.conf
Here are the steps to run the script and see the directories and files created:
$ chmod +x mktmplsf.sh
$ ./mktmplsf.sh
$ ls -R lsf
lsf:
10.1 conf

lsf/10.1:
include lib

lsf/10.1/include:
lsf

lsf/10.1/include/lsf:
lsbatch.h lsf.h

lsf/10.1/lib:
libbat.a libbat.so liblsf.a liblsf.so

lsf/conf:
lsf.conf
$

Create a Dockerfile

Copy the text below and paste into a file called Dockerfile.
# The base of this file originated from https://github.com/horovod/horovod/blob/master/Dockerfile.gpu and # was then modified from x86_64 to support ppc64le platform FROM docker.io/ibmcom/tensorflow-ppc64le:1.14.0-gpu-py3 # TensorFlow version is tightly coupled to CUDA and cuDNN so it should be selected carefully ENV TENSORFLOW_VERSION=1.14.0 ENV PYTORCH_VERSION=1.2.0 ENV TORCHVISION_VERSION=0.4.0 ENV CUDNN_VERSION=7.6.0.64-1+cuda10.0 ENV NCCL_VERSION=2.4.7-1+cuda10.0 ENV MXNET_VERSION=1.5.0 # Python 2.7 or 3.6 is supported by Ubuntu Bionic out of the box ARG python=3.6 ENV PYTHON_VERSION=${python} # Set default shell to /bin/bash SHELL ["/bin/bash", "-cu"] # Removed these options –allow-downgrades –allow-change-held-packages –no-install-recommends RUN apt-get update && apt-get install -y --allow-downgrades --allow-change-held-packages --no-install-recommends\ build-essential \ cmake \ g++-4.8 \ git \ curl \ vim \ wget \ ca-certificates \ libjpeg-dev \ libpng-dev \ python${PYTHON_VERSION} \ python${PYTHON_VERSION}-dev \ librdmacm1 \ libibverbs1 \ ibverbs-providers \ libcudnn7=${CUDNN_VERSION} \ libnccl2=${NCCL_VERSION} \ libnccl-dev=${NCCL_VERSION} \ libffi-dev \ libssl-dev RUN if [[ "${PYTHON_VERSION}" == "3.6" ]]; then \ apt-get install -y python${PYTHON_VERSION}-distutils; \ fi RUN ln -s /usr/bin/python${PYTHON_VERSION} /usr/bin/python RUN curl -O https://bootstrap.pypa.io/get-pip.py && \ python get-pip.py && \ rm get-pip.py # Copy in minimal LSF components ADD lsf /tmp/lsf # Setup the temporary LSF paths ENV LSF_ENVDIR /tmp/lsf/conf ENV LSF_LIBDIR /tmp/lsf/10.1/lib # Compile openmpi with LSF RUN mkdir /tmp/openmpi && \ cd /tmp/openmpi && \ wget https://www.open-mpi.org/software/ompi/v4.0/downloads/openmpi-4.0.1.tar.gz && \ tar zxf openmpi-4.0.1.tar.gz && \ cd openmpi-4.0.1 && \ ./configure --prefix=/usr/local/mpi --enable-orterun-prefix-by-default --disable-getpwuid --with-lsf --with-cuda && \ make -j $(nproc) all && \ make install && \ ldconfig && \ rm -rf /tmp/openmpi # Install Horovod, temporarily using CUDA stubs RUN ldconfig /usr/local/cuda/targets/ppc64le-linux/lib/stubs && \ HOROVOD_GPU_ALLREDUCE=NCCL HOROVOD_WITH_TENSORFLOW=1 \ pip install --no-cache-dir horovod && \ ldconfig # Install OpenSSH for MPI to communicate between containers RUN apt-get install -y --no-install-recommends openssh-client openssh-server && \ mkdir -p /var/run/sshd # Allow OpenSSH to talk to containers without asking for confirmation RUN cat /etc/ssh/ssh_config | grep -v StrictHostKeyChecking > /etc/ssh/ssh_config.new && \ echo " StrictHostKeyChecking no" >> /etc/ssh/ssh_config.new && \ mv /etc/ssh/ssh_config.new /etc/ssh/ssh_config # Download examples RUN apt-get install -y --no-install-recommends subversion && \ svn checkout https://github.com/horovod/horovod/trunk/examples && \ rm -rf /examples/.svn # Note, when running an LSF job in a container, your LSF_LIBDIR will be set according to your LSF installation # The supplied example removes /tmp later, we need to keep the .so files RUN cp /tmp/lsf/10.1/lib/libbat.so /tmp/lsf/10.1/lib/liblsf.so /usr/lib ENV LD_LIBRARY_PATH $LD_LIBRARY_PATH:/usr/local/mpi/lib:/usr/local/mpi/lib/openmpi RUN ldconfig # Install hello_world as a test app RUN mkdir /tmp/hello-world WORKDIR /tmp/hello-world ENV PATH /usr/local/mpi/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin RUN git clone https://github.com/wesleykendall/mpitutorial && \ cd mpitutorial/tutorials/mpi-hello-world/code && \ make && \ cp /tmp/hello-world/mpitutorial/tutorials/mpi-hello-world/code/mpi_hello_world /usr/local/bin # Install the TensorFlow benchmark WORKDIR /usr/local RUN git clone https://github.com/tensorflow/benchmarks # Get the gpu_bind.sh and tf_cnn_benchmark_post_total.sh RUN mkdir /usr/local/scripts WORKDIR /usr/local/scripts RUN wget https://raw.githubusercontent.com/IBMSpectrumComputing/lsf-integrations/master/Spectrum%20LSF%20Application%20Center/Misc_MLDL_Examples/scripts/gpu_bind.sh && chmod a+x gpu_bind.sh
RUN wget https://raw.githubusercontent.com/IBMSpectrumComputing/lsf-integrations/master/Spectrum%20LSF%20Application%20Center/Misc_MLDL_Examples/scripts/tf_cnn_benchmark_post_total.sh && chmod a+x tf_cnn_benchmark_post_total.sh

# Clean up, commented these out for now… RUN rm -rf /tmp/lsf RUN rm -rf /tmp/hello-world WORKDIR "/examples"

Build a new Docker container

Use the command below to create the new container. It will take several minutes to perform all the steps to create the new container image, which will be called "docker.io/ibmcom/tensorflow-ppc64le:1.14.0-gpu-py3-horovod". Note, both the Dockerfile file and lsf directory should be in your current working directory.

# ls Dockerfile
Dockerfile
# ls lsf
10.1 conf
# docker build -t docker.io/ibmcom/tensorflow-ppc64le:1.14.0-gpu-py3-horovod .
Sending build context to Docker daemon 32.35 GB
Step 1/37 : FROM docker.io/ibmcom/tensorflow-ppc64le:1.14.0-gpu-py3
—> e0487c8dd429



Removing intermediate container 3075a6602fbb
Step 37/37 : WORKDIR “/examples”
—> 81ce7a428ea9
Removing intermediate container 4daa694cabde
Successfully built 81ce7a428ea9
#
Now, run the docker images command and your new container image should be there unless the docker build command failed.
# docker images | grep horovod
docker.io/ibmcom/tensorflow-ppc64le 1.14.0-gpu-py3-horovod 81ce7a428ea9 3 minutes ago 3.95 GB
#
You can repeat the above docker build process on every Docker NVIDIA enabled compute node in your LSF cluster or you can use other methods such as publishing the container to your internal Docker Registry or use the docker save and the docker load commands.

Setting up LSF with Docker

1). Prepare IBM Spectrum LSF to run jobs in Docker container by following these steps: LSF docker integration instruction.

2). Configure LSF Docker Application profile for the new Docker container image by adding the following lines (while changing the LSF_TOP to your LSF TOP direction location) to the end of lsb.applications file (and then run badmin reconfig or badmin mbdrestart on the LSF Master):

Begin Application
NAME = tensorflow_horovod
DESCRIPTION = Example TensorFlow Horovod application
CONTAINER = docker[image(ibmcom/tensorflow-ppc64le:1.14.0-gpu-py3-horovod) \
   options(--rm --net=host --ipc=host \
     -v /etc/passwd:/etc/passwd \
     -v /etc/group:/etc/group \
   ) starter(root) ]
EXEC_DRIVER = context[user(lsfadmin)] \
   starter[LSF_TOP/10.1/linux3.10-glibc2.17-ppc64le/etc/docker-starter.py] \
   controller[LSF_TOP/10.1/linux3.10-glibc2.17-ppc64le/etc/docker-control.py] \
   monitor[LSF_TOP/10.1/linux3.10-glibc2.17-ppc64le/etc/docker-monitor.py]
End Application

Testing the new container with LSF

$ bsub -app tensorflow_horovod -Is /bin/bash Job <19773> is submitted to default queue <interactive>. <<Waiting for dispatch ...>> <<Starting on ac922c>> ________ _______________ ___ __/__________________________________ ____/__ /________ __ __ / _ _ \_ __ \_ ___/ __ \_ ___/_ /_ __ /_ __ \_ | /| / / _ / / __/ / / /(__ )/ /_/ / / _ __/ _ / / /_/ /_ |/ |/ / /_/ \___//_/ /_//____/ \____//_/ /_/ /_/ \____/____/|__/ You are running this container as user with ID 1150 and group 491, which should map to the ID and group for your user on the Docker host. Great! tf-docker ~ > exit exit $

Testing the new container with MPI Hello World

Make sure MPI is working as expected before attempting to run the TensorFlow benchmark across nodes.

Example of running MPI hello world on a single node with 1 message

$ bsub -app tensorflow_horovod -I mpirun /usr/local/bin/mpi_hello_world
Job <19774> is submitted to default queue <interactive>.
<<Waiting for dispatch …>>
<<Starting on ac922c>>
Hello world from processor ac922c, rank 0 out of 1 processors
$

Example of running MPI hello world on a single node with 2 job slots or 2 messages.

$ bsub -app tensorflow_horovod -I -n 2 -R "span[hosts=1]" mpirun /usr/local/bin/mpi_hello_world
Job <19775> is submitted to default queue <interactive>.
<<Waiting for dispatch …>>
<<Starting on ac922c>>
Hello world from processor ac922c, rank 0 out of 2 processors
Hello world from processor ac922c, rank 1 out of 2 processors
$

Example of running MPI hello world on 2 nodes with 1 message per node.

$ bsub -app tensorflow_horovod -I -n 2 -R "span[ptile=1]" mpirun /usr/local/bin/mpi_hello_world
Job <19776> is submitted to default queue <interactive>.
<<Waiting for dispatch …>>
<<Starting on ac922c>>
Hello world from processor ac922c, rank 0 out of 2 processors
Hello world from processor ac922b, rank 1 out of 2 processors
$

Testing the new container with requests for GPUs

Example job requesting 1 GPU and showing nvidia-smi output

$ bsub -app tensorflow_horovod -gpu "num=1:mode=exclusive_process" -I nvidia-smi Job <19777> is submitted to default queue <interactive>. <<Waiting for dispatch ...>> <<Starting on ac922b>> Thu Dec 19 05:21:46 2019 +-----------------------------------------------------------------------------+ | NVIDIA-SMI 418.39 Driver Version: 418.39 CUDA Version: 10.1 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | |===============================+======================+======================| | 0 Tesla V100-SXM2... On | 00000004:04:00.0 Off | 0 | | N/A 36C P0 37W / 300W | 0MiB / 16130MiB | 0% E. Process | +-------------------------------+----------------------+----------------------+ +-----------------------------------------------------------------------------+ | Processes: GPU Memory | | GPU PID Type Process name Usage | |=============================================================================| | No running processes found | +-----------------------------------------------------------------------------+ $

Example job requesting 2 GPUs and showing nvidia-smi output

$ bsub -app tensorflow_horovod -gpu "num=2:mode=exclusive_process" -I nvidia-smi Job <19778> is submitted to default queue <interactive>. <<Waiting for dispatch ...>> <<Starting on ac922b>> Thu Dec 19 05:26:00 2019 +-----------------------------------------------------------------------------+ | NVIDIA-SMI 418.39 Driver Version: 418.39 CUDA Version: 10.1 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | |===============================+======================+======================| | 0 Tesla V100-SXM2... On | 00000004:04:00.0 Off | 0 | | N/A 36C P0 37W / 300W | 0MiB / 16130MiB | 0% E. Process | +-------------------------------+----------------------+----------------------+ | 1 Tesla V100-SXM2... On | 00000004:05:00.0 Off | 0 | | N/A 39C P0 37W / 300W | 0MiB / 16130MiB | 0% E. Process | +-------------------------------+----------------------+----------------------+ +-----------------------------------------------------------------------------+ | Processes: GPU Memory | | GPU PID Type Process name Usage | |=============================================================================| | No running processes found | +-----------------------------------------------------------------------------+ $

Testing the new container with TensorFlow benchmark on a single compute node

Example TensorFlow benchmark with 1 GPU on a single compute node

$ bsub -app tensorflow_horovod -gpu "num=1:mode=exclusive_process" -I -e stderr%J.txt python /usr/local/benchmarks/scripts/tf_cnn_benchmarks/tf_cnn_benchmarks.py --num_gpus=1 --model resnet50 --batch_size 128 --num_batches=10 Job <19779> is submitted to default queue <<Waiting for dispatch ...>> <<Starting on ac922c>> TensorFlow: 1.14 Model: resnet50 Dataset: imagenet (synthetic) Mode: training SingleSess: False Batch size: 128 global 128 per device Num batches: 10 Num epochs: 0.00 Devices: ['/gpu:0'] NUMA bind: False Data format: NCHW Optimizer: sgd Variables: parameter_server ========== Generating training model Initializing graph Running warm up Done warm up Step Img/sec total_loss 1 images/sec: 416.0 +/- 0.0 (jitter = 0.0) 7.972 10 images/sec: 413.8 +/- 0.6 (jitter = 1.4) 7.856 ---------------------------------------------------------------- total images/sec: 413.61 ---------------------------------------------------------------- $

Example TensorFlow benchmark with 4 GPU on a single compute node

$ bsub -app tensorflow_horovod -gpu "num=4:mode=exclusive_process" -I -e stderr%J.txt python /usr/local/benchmarks/scripts/tf_cnn_benchmarks/tf_cnn_benchmarks.py --num_gpus=4 --model resnet50 --batch_size 128 --num_batches=10 Job <19780> is submitted to default queue <<Waiting for dispatch ...>> <<Starting on ac922c>> TensorFlow: 1.14 Model: resnet50 Dataset: imagenet (synthetic) Mode: training SingleSess: False Batch size: 512 global 128 per device Num batches: 10 Num epochs: 0.00 Devices: ['/gpu:0', '/gpu:1', '/gpu:2', '/gpu:3'] NUMA bind: False Data format: NCHW Optimizer: sgd Variables: parameter_server ========== Generating training model Initializing graph Running warm up Done warm up Step Img/sec total_loss 1 images/sec: 1568.3 +/- 0.0 (jitter = 0.0) 7.870 10 images/sec: 1568.6 +/- 3.1 (jitter = 7.7) 7.861 ---------------------------------------------------------------- total images/sec: 1562.40 ---------------------------------------------------------------- $

A few notes on the above examples:

1) Above was tested with NVIDIA Tesla V100 GPUs with 16 GB RAM. You will likely need to decrease the batch size parameter value if you have less RAM on your GPUs or you may want to try increasing RAM if your GPUs have more RAM.

2) The number of batches is specifically small in above examples for testing. Increase the num_batches value to have the benchmark run for a longer period of time.

3) If problems with the above jobs running, check the standard error file, which is stderr<JOBID>.txt.

Testing the new container with TensorFlow benchmark with Horovod

A few notes on the examples below:

1) For the benchmark use your fastest network, which should be 10Gb or faster or potentially Infiniband. The examples use a 40Gb ethernet network. For the btl_tcp_if_include and HOROVOD_GLOO_IFACE parameter values replace my network interface, which is "enP48p1s0f0", with your fastest network interface available on your compute nodes.

2) The mpirun command has several debugging option enabled.

3) If problems with the jobs below, check the standard error file, which is stderr<JOBID>.txt.

4) Depending on the number of GPUs per node on your servers, adjust the first example (on a single compute node) based on the table below
GPUs per node -n ptile -gpu num
1 1 1 1
2 2 2 2
4 4 4 4

5) Depending on the number of GPUs per node on your servers, adjust the second example (on two compute nodes) based on the table below
GPUs per node -n ptile -gpu num
1 2 1 1
2 4 2 2
4 8 4 4

Example TensorFlow benchmark using Horovod with 4 GPUs on a single compute node
$ bsub -app tensorflow_horovod -n 4 -R "span[ptile=4]" -gpu "num=4:mode=exclusive_process" -I -e stderr%J.txt mpirun -mca btl_tcp_if_include enP48p1s0f0 -x HOROVOD_GLOO_IFACE=enP48p1s0f0 -mca btl ^openib -mca pml ob1 -x NCCL_IB_DISABLE=1 -mca plm_base_verbose 10 -x NCCL_DEBUG=WARN /usr/local/scripts/gpu_bind.sh python /usr/local/benchmarks/scripts/tf_cnn_benchmarks/tf_cnn_benchmarks.py --variable_update=horovod --num_gpus=1 --model resnet50 --batch_size 128 --num_batches=10 Job <19781> is submitted to default queue <<Waiting for dispatch ...>> <<Starting on ac922c>> TensorFlow: 1.14 Model: resnet50 Dataset: imagenet (synthetic) Mode: training SingleSess: False Batch size: 128 global 128 per device Num batches: 10 Num epochs: 0.00 Devices: ['horovod/gpu:0'] NUMA bind: False Data format: NCHW Optimizer: sgd Variables: horovod ========== Generating training model TensorFlow: 1.14 Model: resnet50 Dataset: imagenet (synthetic) Mode: training SingleSess: False Batch size: 128 global 128 per device Num batches: 10 Num epochs: 0.00 Devices: ['horovod/gpu:0'] NUMA bind: False Data format: NCHW Optimizer: sgd Variables: horovod ========== Generating training model TensorFlow: 1.14 Model: resnet50 Dataset: imagenet (synthetic) Mode: training SingleSess: False Batch size: 128 global 128 per device Num batches: 10 Num epochs: 0.00 Devices: ['horovod/gpu:0'] NUMA bind: False Data format: NCHW Optimizer: sgd Variables: horovod ========== Generating training model TensorFlow: 1.14 Model: resnet50 Dataset: imagenet (synthetic) Mode: training SingleSess: False Batch size: 128 global 128 per device Num batches: 10 Num epochs: 0.00 Devices: ['horovod/gpu:0'] NUMA bind: False Data format: NCHW Optimizer: sgd Variables: horovod ========== Generating training model Initializing graph Initializing graph Initializing graph Initializing graph Running warm up Running warm up Running warm up Running warm up NCCL version 2.4.7+cuda10.0 NCCL version 2.4.7+cuda10.0 NCCL version 2.4.7+cuda10.0 NCCL version 2.4.7+cuda10.0 Done warm up Step Img/sec total_loss Done warm up Step Img/sec total_loss Done warm up Step Img/sec total_loss 1 images/sec: 411.2 +/- 0.0 (jitter = 0.0) 7.972 1 images/sec: 409.4 +/- 0.0 (jitter = 0.0) 7.972 Done warm up Step Img/sec total_loss 1 images/sec: 408.0 +/- 0.0 (jitter = 0.0) 7.972 1 images/sec: 405.4 +/- 0.0 (jitter = 0.0) 7.972 10 images/sec: 409.3 +/- 0.5 (jitter = 1.6) 7.856 ---------------------------------------------------------------- total images/sec: 409.12 ---------------------------------------------------------------- 10 images/sec: 408.4 +/- 0.3 (jitter = 1.1) 7.856 ---------------------------------------------------------------- total images/sec: 408.25 ---------------------------------------------------------------- 10 images/sec: 407.8 +/- 0.3 (jitter = 0.8) 7.856 ---------------------------------------------------------------- total images/sec: 407.55 ---------------------------------------------------------------- 10 images/sec: 406.2 +/- 0.6 (jitter = 2.7) 7.856 ---------------------------------------------------------------- total images/sec: 406.03 ---------------------------------------------------------------- $

Example TensorFlow benchmark using Horovod with 4 GPUs (2 GPUs per node) on two compute nodes

$ bsub -app tensorflow_horovod -n 4 -R "span[ptile=2]" -gpu "num=2:mode=exclusive_process" -I -e stderr%J.txt mpirun -mca btl_tcp_if_include enP48p1s0f0 -x HOROVOD_GLOO_IFACE=enP48p1s0f0 -mca btl ^openib -mca pml ob1 -x NCCL_IB_DISABLE=1 -mca plm_base_verbose 10 -x NCCL_DEBUG=WARN /usr/local/scripts/gpu_bind.sh python /usr/local/benchmarks/scripts/tf_cnn_benchmarks/tf_cnn_benchmarks.py --variable_update=horovod --num_gpus=1 --model resnet50 --batch_size 128 --num_batches=10 Job <19782> is submitted to default queue <<Waiting for dispatch ...>> <<Starting on ac922b>> TensorFlow: 1.14 Model: resnet50 Dataset: imagenet (synthetic) Mode: training SingleSess: False Batch size: 128 global 128 per device Num batches: 10 Num epochs: 0.00 Devices: ['horovod/gpu:0'] NUMA bind: False Data format: NCHW Optimizer: sgd Variables: horovod ========== Generating training model TensorFlow: 1.14 Model: resnet50 TensorFlow: 1.14 Model: resnet50 Dataset: imagenet (synthetic) Mode: training SingleSess: False Batch size: 128 global 128 per device Dataset: imagenet (synthetic) Mode: training SingleSess: False Batch size: 128 global 128 per device Num batches: 10 Num batches: 10 Num epochs: 0.00 Devices: ['horovod/gpu:0'] NUMA bind: False Data format: NCHW Num epochs: 0.00 Devices: ['horovod/gpu:0'] NUMA bind: False Data format: NCHW Optimizer: sgd Variables: horovod ========== Optimizer: sgd Variables: horovod ========== Generating training model Generating training model TensorFlow: 1.14 Model: resnet50 Dataset: imagenet (synthetic) Mode: training SingleSess: False Batch size: 128 global 128 per device Num batches: 10 Num epochs: 0.00 Devices: ['horovod/gpu:0'] NUMA bind: False Data format: NCHW Optimizer: sgd Variables: horovod ========== Generating training model Initializing graph Initializing graph Initializing graph Initializing graph Running warm up Running warm up Running warm up Running warm up NCCL version 2.4.7+cuda10.0 NCCL version 2.4.7+cuda10.0 NCCL version 2.4.7+cuda10.0 NCCL version 2.4.7+cuda10.0 Done warm up Step Img/sec total_loss Done warm up Step Img/sec total_loss Done warm up Step Img/sec total_loss 1 images/sec: 410.5 +/- 0.0 (jitter = 0.0) 7.972 1 images/sec: 405.9 +/- 0.0 (jitter = 0.0) 7.972 Done warm up Step Img/sec total_loss 1 images/sec: 407.6 +/- 0.0 (jitter = 0.0) 7.972 1 images/sec: 407.6 +/- 0.0 (jitter = 0.0) 7.972 10 images/sec: 410.4 +/- 0.6 (jitter = 1.7) 7.856 ---------------------------------------------------------------- total images/sec: 410.27 ---------------------------------------------------------------- 10 images/sec: 406.6 +/- 0.6 (jitter = 1.6) 7.856 ---------------------------------------------------------------- total images/sec: 406.45 ---------------------------------------------------------------- 10 images/sec: 407.8 +/- 0.6 (jitter = 1.5) 7.856 ---------------------------------------------------------------- total images/sec: 407.56 ---------------------------------------------------------------- 10 images/sec: 407.7 +/- 0.4 (jitter = 1.0) 7.856 ---------------------------------------------------------------- total images/sec: 407.55 ---------------------------------------------------------------- $

Conclusion

Now, you have a new container image that is ready to run TensorFlow with Horovod across multiple nodes using an LSF cluster. Please leave comments or feedback on the above information and if you would like the article to include X86_64 equivalents.
#SpectrumComputingGroup
0 comments
47 views

Permalink