High Performance Computing Group

High Performance Computing Group

Connect with HPC subject matter experts and discuss how hybrid cloud HPC Solutions from IBM meet today's business needs.

 View Only

IBM Spectrum LSF support for deep learning distributed frameworks

  • 1.  IBM Spectrum LSF support for deep learning distributed frameworks

    Posted Mon November 18, 2019 04:09 PM

    Background

    Deep learning (also known as deep structured learning, hierarchical learning or deep machine learning) is a branch of machine learning based on a set of algorithms that attempt to model high level abstractions in data by using a deep graph with multiple processing layers, composed of multiple linear and non-linear transformations. Deep learning algorithms are based on distributed representations. The underlying assumption behind distributed representations is that observed data are generated by the interactions of factors organized in layers. Deep learning adds the assumption that these layers of factors correspond to levels of abstraction or composition. Varying numbers of layers and layer sizes can be used to provide different amounts of abstraction.

    TensorFlow is an open source software library for numerical computation using data flow graphs (deep learning). Nodes in the graph represent mathematical operations, while the graph edges represent the multidimensional data arrays (tensors) communicated between them. The flexible architecture allows you to deploy computation to one or more CPUs or GPUs in a desktop, server, or mobile device with a single API.

    Caffe is a deep learning framework made with expression, speed, and modularity in mind. It is developed by the Berkeley Vision and Learning Center (BVLC) and by community contributors.

    IBM Spectrum LSF is a workload management platform and job scheduler for distributed environments. It can be used to run batch jobs on networked UNIX, Linux, and Windows systems on many different hardware architectures.

    Challenges

    Current deep learning distribution framework has limited distribution resources management ability. How to manage more and more complicated resources and launch complicated workload in deep learning environment? If you already have an IBM Spectrum LSF or Apache Spark environment, how to integrate IBM Spectrum LSF with the deep learning distribution framework you use now?

    IBM Spectrum LSF, the well known workload management solution for distributed environments can help you manage deep learning workload and environment.

    Our Experience

    We select three popular deep learning distribution frameworks for this integration.

    1) TensorFlow

    From v0.8, Tensorflow natively supports cross-node distribution through grpc and protobuf. IBM Spectrum LSF selects the compute nodes, GPUs and other resources. IBM Spectrum LSF launches parameter server jobs and worker jobs on selected nodes and GPUs. IBM Spectrum LSF starts a ps job on the first execution host and uses the IBM Spectrum LSF command blaunch to launch worker jobs on other selected hosts, as illustrated below.

    tensorflowhttps://developer.ibm.com/storage/wp-content/uploads/sites/91/2017/01/tensorflow-300x229.png 300w, https://developer.ibm.com/storage/wp-content/uploads/sites/91/2017/01/tensorflow-768x585.png 768w" sizes="(max-width: 916px) 100vw, 916px">

    2) Caffe Parallel

    Caffe parallel is a faster framework for deep learning. It is forked from the BVLC/caffe (master branch : https://github.com/BVLC/caffe). For more details, visit here. The main achievement of caffe-parallel is data-parallel via MPI.

    Caffe parallel depends on MPI, which is one of most popular applications of IBM Spectrum LSF resource and job management. Running Caffe parallel is just like running any other MPI job under IBM Spectrum LSF control.

    IBM Spectrum LSF selects execution hosts and GPUs, and uses mpirun to launch the Caffe parallel application on each execution host, as shown below.

    caffe_parallelhttps://developer.ibm.com/storage/wp-content/uploads/sites/91/2017/01/caffe_parallel-300x226.png 300w, https://developer.ibm.com/storage/wp-content/uploads/sites/91/2017/01/caffe_parallel-768x580.png 768w" sizes="(max-width: 926px) 100vw, 926px">

    3) Caffe on Spark

    CaffeOnSpark brings deep learning to Hadoop and Spark clusters. By combining salient features from the Caffe deep learning framework and big-data frameworks Apache Spark and Apache Hadoop, CaffeOnSpark enables distributed deep learning on a cluster of GPU and CPU servers.

    As a distributed extension of Caffe, CaffeOnSpark supports neural network model training, testing, and feature extraction. Caffe users can now perform distributed learning using their existing LMDB data files with minor adjustments to their network configuration.

    CaffeOnSpark is a Spark package for deep learning. It is complementary to non-deep learning libraries MLlib and Spark SQL. CaffeOnSpark's Scala API provides Spark applications with an easy mechanism to invoke deep learning over distributed datasets.

    IBM Spectrum LSF selects hosts and GPUs for CaffeOnSpark, and starts the Spark cluster before the Caffe job starts. After job finishes, IBM Spectum LSF stops the Spark cluster based on selected hosts. IBM Spectrum LSF treats the whole Spark cluster as a normal job, as illustrated below.

    caffeonsparkhttps://developer.ibm.com/storage/wp-content/uploads/sites/91/2017/01/caffeonspark-300x225.png 300w, https://developer.ibm.com/storage/wp-content/uploads/sites/91/2017/01/caffeonspark-768x576.png 768w" sizes="(max-width: 931px) 100vw, 931px">

    Resources

    Example scripts tensorflow.jf, caffeparallel.jf and caffeonspark.jf can be found here

    Conclusion

    TensorFlow, Caffe-parallel and CaffeOnSpark can run as jobs under IBM Spectrum LSF control. IBM Spectrum LSF can manage the host and GPU resources that deep learning applications require, and collect resource usage for deep learning applications.



    ------------------------------
    GEORGE GAO
    ------------------------------

    #SpectrumComputingGroup