Cloud Pak for Data

 View Only

How to choose a system architecture for large scale machine learning model training - Series 1: A reliable and scalable architecture

By Harris Yang posted Mon January 18, 2021 09:31 PM

  

How to choose a system architecture for large scale machine learning model training

Series 1: A reliable and scalable architecture

Author: Catherine Wu (wuyan@cn.ibm.com)

Table of Content

1. Background
1.1 What is a large scale machine learning model?
2. A reliable and scalable architecture - what to consider
3. Measurement


1. Background
It's widely accepted by business leaders that AI and ML have great potentials to transform their business by reducing costs, managing risks, and accelerating growth. However, the majority of AI initiatives haven't shown the return as expected. Data Scientist teams are struggling with moving AI from ad-hoc experiments to production to realize the full value of these technologies
One of the most pressing challenges faced by Data Scientists is to choose best-fit system architecture to train large scale machine learning models. In this paper, we present two typical scenarios after summarizing a large number of real customer cases, and provide the recommended system architecture based on our experiments. 
1.1 What is a large scale machine learning model?
A large scale machine learning model could be a model training on big data (100G+ or billions of records), a big model (with 10,000+ features and parameters), many models (10,000+), or a combination of any two of the above. 
During the training phase of developing AI algorithms, scalability is all about how much data can be analyzed and the speed at which it can be analyzed. This performance can be improved with distributed algorithms and distributed processing.
In the deployment phase of AI projects, scalability has more to do with the number of concurrent users or applications that can hit the model at once.
In this whitepaper, we focus on the solution for the former one - the training phase of large scale machine learning. In Section 2, we explain what factors to consider when you choose system architecture. Then with all these factors considered, in Section 3, the measurements or evaluation criteria are proposed to decide what ML/DL framework best fits your needs. In Section 4, we provide two typical scenarios and a decision tree to navigate the solution space. In the last section, the recommended solutions to specific use cases are presented.
2. A reliable and scalable architecture - what to consider 
In order to design a reliable and scalable system architecture to run training jobs efficiently and effectively, first you need to consider the following aspects. 
    1) Are you building machine learning models, or deep learning models, or both?
    2) Which ML/DL framework do you adopt in your projects? 
    • Scikit-Learn - Scikit-learn is a Python library used for machine learning, which consists of a set of simple and efficient tools for data mining and data analysis. It does not support distributed training. This framework is built on top of several popular Python packages, namely NumPy, SciPy, and matplotlib.
    • Spark MLLib - The key benefit of MLlib is that it allows data scientists to focus on their data problems and models instead of solving the complexities surrounding distributed data (such as infrastructure, configurations, and so on).
      • Scalability: Ability to run the same ML code on your laptop and on a big cluster seamlessly without breaking down. This enables businesses use the same workflows as their user base and data sets grow.
      • The data processing done by Spark MLLib comes at a price of memory blockages, as in-memory capabilities of processing can lead to large consumption of memory.
      • The caching algorithm is not in-built in Spark MLLib. The caching mechanism needs to be manually set up.
    • XGBoost - XGBoost is an implementation of gradient boosted decision trees designed for speed and performance.
      • Parallelization of tree construction using all of your CPU cores during training.
      • Distributed Computing for training very large models using a cluster of machines.
      • Out-of-Core Computing for very large data sets that don’t fit into memory.
      • Cache Optimization of data structures and algorithm to make best use of hardware.
    • SPSS - IBM SPSS is a software suite that uses predictive intelligence so businesses can acquire the best results using the provided actionable data. It is a user-friendly and flexible solution that allows all users with different skill levels to implement predictive analysis to both small and large projects.
    • TensorFlow - TensorFlow is good for advanced projects, such as creating multilayer neural networks. It’s used in voice/image recognition and text-based apps. It supports distributed training.
    • PyTorch - PyTorch is mainly used to train deep learning models quickly and effectively, so it’s the framework of choice for a large number of researchers. It supports distributed training.
    • Caffe and Caffe 2 - Caffe is a framework implemented in C++ that has a useful Python interface. In May 2018 Caffe2 has been merged into the PyTorch 1.0 stable version. The two fabulous engines join forces.
      • It offers pre-trained models for building demo apps.
      • It’s fast, scalable, and lightweight.
    • Keras - Keras is a minimalistic Python-based library that can be run on top of TensorFlow.
      • It has built-in support for training on multiple GPUs.
      • It can be turned into Tensorflow estimators and trained on clusters of GPUs on Google Cloud.
      • It can be run on Spark MLLib.
    • Others
    3) How many models you need to train concurrently?
    • Less than 10
    • 10 - 100
    • 100 - 1000
    • 1000 - 10000
    • 10000+
    4) How many features are in each model? 
    • 0-100
    • 100-1000
    • 1000 -5000
    • 5000+
    5) Are you using massive amount of data to train one model? Is it necessary to use all the data to achieve better performance? 
    • 0- 20 G
    • 20G - 100G
    • 100G - 500G
    • 500G+
    6) How many concurrent users access your Data Science platform? At the core how many run-time environments?
    • 0-20
    • 20-50
    • 50-100
    • 100+
    7) What's the expected training frequency and training completion time?
    • Frequency 
      • Daily
      • Weekly
      • Bi-Weekly
      • Monthly
      • Ad-Hoc
    • Training on time 
      • Less than one hour 
      • 2-4 hours
      • 4-12 hours
      • Days
      • Weeks
    8) Where are your training data?
    • Database
    • Local Files
    • Object Storage
    • Hadoop 
    • Edge device
    • Others
    9) What's your overall budget for platform? 
    • Hardware
    • Software 
    10) What's the expected format of exporting models (such as PMML)? Do you need remote deployment on edge node?
3. Measurement
  • The degree of matching functions - Does the platform provide the capabilities the business need? For example, collaboration, visual modeling, distributed training, virtual training, AutoAI, and more.
  • The level of openness - Is the platform open or vendor lock-in?
  • The difficulty level of integration - How easy the platform can integrate with the existing IT systems? 
  • The difficulty level of extension - How easy the platform can be extended to add customized capabilities?
  • Time on training - How long it takes to run typical models training workload? What's the bottom line to satisfy business need?
  • Storage required - How much storage are required to process data and run training workload? What type of storage?
  • Budget needed - How much will the solution cost in terms of hardware and software license? 
  • People and skillset - How many Data Scientists and Machine Learning Engineer needed need and what are required skills level? The cost of human resource is often neglected. 

#CloudPakforDataGroup
0 comments
20 views

Permalink