Cloud Pak for Data

 View Only

How to choose a system architecture for large scale machine learning model training - Series 3: Train multiple models concurrently - Mult-process training in single notebook

By GUANG MING ZHANG posted Wed January 20, 2021 12:20 AM

  
This is series 3 of the series : How to choose a system architecture for large scale machine learning model training.

This solution can be used when the user needs to train different models of an algorithm based on different data, and the single model has a relatively long training time, e.g., greater than 10 minutes, and a small amount of data, and the total training time exceeds the specified threshold .
Entry Criteria:
  1. Total model training time > The specified threshold
  2. Total model training data volume < 1/3 node memory
As mentioned in Section 2, the specified threshold is an acceptable training duration for the user, such as one hour or one day. you can choose an appropriate threshold for your situation, and depending on whether the training duration exceeds the specified to decide whether to use mult-process training.
As mentioned in Section 2, the total amount of data trained by the model should be less than 1/3 node memory mainly in terms of the amount of data a Notebook can be trained on. 
Solution:
The solution is divided into the following steps.
  1. Slice the data, with each slice used to train a model  - See Series 2 for details
  2. Use the notebook to build model training algorithms - See Series 2 for details
  3. Migrate model training codes to the worker notebook - See Series 2 for details
  4. Setup multiple CPUs notebook environment runtime
  5. Configure mult-process mode and run the entire 'worker' notebook for training 
1 Setup multiple CPUs notebook environment runtime
For this solution, we should configure a multiple CPUs notebook environment runtime to let the notebook run training in parallel.
In the diagram above, we added two new environments. One is with 6vCPU, the other is with 3vCPU. 
2 Configure the parameter and start training
The worker framework supports two modes of operation.
  1. single-process multi-loop
  2. multi-process multi-loop
In this solution, we use the second approach. The difference between the two ways is achieved by a configuration parameter is_multiple_processors. In this solution, we need to set is_multiple_processors to True.
In addition, we should specify the data slice and how many CPUs are used for the training. Note: the number of CPUs should not exceed the vCPU number defined by the notebook runtime environment.
The two parameters will be overridden if you use 'controller' notebook to call 'worker' notebook in the next solution. During that time, the two parameters will be redefined in 'controller' notebook and passed to 'worker' notebook.
You can then run the entire notebook to start mult-process training.

#CloudPakforDataGroup
0 comments
11 views

Permalink