High Performance Computing

High Performance Computing Group

Connect with HPC subject matter experts and discuss how hybrid cloud HPC Solutions from IBM meet today's business needs.

 View Only

Now available, Dependency Validation Tool for IBM Spectrum Conductor Deep Learning Impact 1.1

By Archive User posted Thu February 08, 2018 02:13 PM

  

Originally posted by: KJ9F_Mikhail_Genkin


Now available, a new utility tool for IBM Spectrum Conductor Deep Learning Impact 1.1 — the Dependency Validation Tool (DVT). DVT is an open source tool that is now available for download. Use this tool prior to using IBM Spectrum Conductor Deep Learning Impact; to ensure that your cluster is ready for deep learning workloads. To obtain a free trial version of IBM Spectrum Conductor Deep Learning Impact, see Free Trial.

 

IBM Spectrum Conductor Deep Learning Impact 1.1 leverages open source deep learning frameworks that are installed separately on the cluster to import models, datasets and run training jobs. However, to fully leverage the capabilities of TensorFlow and Caffe, cluster machines must also be equipped with properly configured GPUs. DVT will check that not only you have deep learning frameworks installed but it will check that you have the right versions installed along with checking that your cluster is GPU enabled.  

 

In this blog, I will focus on how you can ensure that your deep learning frameworks are installed correctly, and how you can install and run DVT so that you can confidently start using IBM Spectrum Conductor Deep Learning Impact 1.1. As this blog only addresses the installation of supported deep learning frameworks, you can learn about other IBM Spectrum Conductor Deep Learning Impact prerequisites (such as supported operating systems and hardware requirements) in IBM Knowledge Center at https://www.ibm.com/support/knowledgecenter/SSWQ2D_1.1.0/in/installation-requirements.html.

 

IBM Spectrum Conductor Deep Learning Impact 1.1 supports both POWER8 and Linux x86 platforms. On the POWER8 platform you can use the following open source frameworks, available with IBM PowerAI R5: TensorFlow 1.4, IBM Caffe 1.0, and Caffe 1.0. On Linux x86 you can use the following open source deep learning frameworks: TensorFlow 1.1 and Caffe 1.0. It is important to note that you must install at least one framework; making sure that it is the right version, and that it is properly configured. I also encourage you to carefully review the support statement in IBM Knowledge Center at https://www.ibm.com/support/knowledgecenter/SSWQ2D_1.1.0/gs/supported-dl-frameworks.html.

 

Before I proceed further into discussing how these frameworks must be installed, let’s talk a bit about the cluster. To fully leverage the capabilities of TensorFlow and Caffe cluster machines must be equipped with properly configured GPUs. IBM Spectrum Conductor Deep Learning Impact can efficiently schedule analytic tasks on CPUs, while scheduling deep learning tasks on GPUs. In the context of describing the cluster in this blog, I will refer to compute hosts that are equipped with GPUs, and used to schedule deep learning tasks, as workers. Hosts that that are not equipped with GPUs, and used only as management hosts, I will refer to as masters.

 

When installing frameworks, you must make sure to install the same, supported version of a deep learning framework on all workers. Deep learning frameworks do not need to be installed on the masters. When installing IBM Power AI R5 deep learning frameworks on POWER8 workers,  it is important to install each framework’s Python dependencies using the script available in the following folder: /opt/DL/<framework-name>/bin/install_dependencies. For the sequence of commands required to prepare your IBM PowerAI R5 environment for deep learning, see IBM Knowledge Center at http://www.ibm.com/support/knowledgecenter/SSWQ2D_1.1.0/in/installation-prepare-power-master.html.

 

When installing deep learning frameworks on Linux x86 workers, you must download TensorFlow and Caffe from their respective open source websites, and install them following instructions given on those sites, keeping in mind again that the frameworks must be installed on each worker in the cluster. It is important to install the open source frameworks under /opt/DL directory and configure them to work with available GPUs. Note that TensorFlow and Caffe depend on several packages that need to be installed to enable them to use GPUs: NVIDIA CUDA 8.0, NVIDIA CUDA Deep Learning Network (cuDNN) 6.0 and NVIDIA Collective Communications Library (NCCL) 1.0.

 

After installing deep learning frameworks on all workers, you can now install and run DVT to make sure that your cluster is ready. Note that you can run DVT either before or after IBM Spectrum Conductor with Spark and IBM Spectrum Conductor Deep Learning Impact installations, but before attempting to use IBM Spectrum Conductor Deep Learning Impact.

 

Before installing and running DVT, you need to designate one machine as the host from which you will run the tool. Usually this is one of your masters. For the purposes of this blog I will refer to this machine as the validation host. You do not need to copy the DVT to all masters and workers that form your cluster, but you do need to make sure that passwordless SSH is enabled between your validation host and all masters and workers that you would like DVT to check.

 

To install DVT, you need to get the DVT tool onto your validation host. To do this, you can either clone the DVT Git repository on your validation host, or download the .gz from http://git.ng.bluemix.net/ibmcws-deep-learning-samples/ibmcws-deep-learning-dependency-checker-tool/ and unzip it in any directory on the validation host. For the purposes of this blog I will refer to the directory where you cloned or unzipped DVT as DVT Home.

 

To run DVT, see the README file available with the tool, here: https://git.ng.bluemix.net/ibmcws-deep-learning-samples/ibmcws-deep-learning-dependency-checker-tool/blob/master/README.md. Running DVT, generates a report summary that tells you whether:

  • Connectivity or access permission issues were encountered on any of the cluster machines
  • There are enabled GPUs available on all workers
  • TensorFlow is installed on all workers, is at the right version on all workers, and that installations on all workers are able to execute a sample application
  • Caffe is installed on all workers, is at the right version on all workers, and that installations on all workers respond to API

 

The DVT report summary is generated in subdirectory under your DVT Home. The report directory name will have a date and timestamp. This way you can run DVT, examine the report, make configuration changes to your systems, and rerun DVT if needed to make sure that problems indicated in the initial run have been resolved.

 

The DVT report subdirectory contains two types of files. The *.rep files are report files containing information about masters and workers on your system. The *.log files are log files, capturing any errors that might have happened during the execution of the tool. When the DVT finishes executing it will display the report_summary.rep file in the console window. Below, is an example of the summary report that gets issued:

image

 

The report summary also contains a listing of suspect workers. Suspect workers are workers on which DVT found some problems with GPU, TensorFlow, or Caffe installations. If the suspect workers list is not empty, the next step would be to review the detailed report and log file for that worker. These files will be found in the report directory, and will contain the hostname of the worker in the file name. Below is an example of the detailed report for one of the workers:

image

 

Note that running DVT and getting a clean report does not guarantee that IBM Spectrum Conductor Deep Learning Impact will be able to leverage your cluster error-free. It does, however, help rule out the most common TensorFlow and Caffe configuration errors. I highly recommend that you use DVT as an integral part of your cluster installation and configuration process.

 

 


For more information about IBM Spectrum Conductor Deep Learning Impact, see:

www.ibm.biz/DeepLearningImpact

 

 


#SpectrumComputingGroup
0 comments
0 views

Permalink