Cloud Pak for Data

 View Only

How to choose a system architecture for large scale machine learning model training - Series 4:Train multiple models concurrently - Multiple notebooks training

By GUANG MING ZHANG posted Wed January 20, 2021 01:07 AM

  
This is Series 4 of the series How to choose a system architecture for large scale machine learning model training.

This solution can be used when the user needs to train different models of an algorithm based on different data, and the single model has a relatively long training time, e.g., greater than 10 minutes, and a
large amount of data or a large amount total training time which one notebook could not contain.
Entry Criteria:
  1. Total model training time > The specified threshold 
  2. Total model training data volume > 1/3 node memory
As above, the specified threshold is an acceptable training duration for the user, such as one hour or one day. 
The total amount of data trained by the model exceeds 1/3 node memory mainly in terms of the amount of data a Notebook can be trained on.  If it exceeds,  we should consider to train using the multiple notebooks solution.
Solution:
  1. High Level Design

In the diagram above, at the top is a controller that designs and schedules the whole pipeline. Data is sliced, e.g. into Data source 1, Data source 2 etc. Each slice of data will correspond to a training task. The scheduler creates m workers. each worker has a training queue, which can hold multiple tasks. The controller puts the training task into each queue. the worker gets the task from its training queue and complete the training. Then it takes out a new training task to complete until all the tasks in the queue have been trained.
2. Low Level Design

In the diagram above,
controller is implemented through a notebook. controller takes the role to create and run a pipeline. Worker is implemented by creating a Notebook-based job. The controller will create a number of training tasks based on the slicing data, and assign the corresponding training tasks to each workers. In each worker , we could choose either multi-process or single-process training approach to train models.
The solution is divided into the following steps.
  1. Slice the data, with each slice used to train a model  - See Series 2 for details
  2. Use the notebook to build model training algorithms - See Series 2 for details
  3. Migrate model training algorithms to the 'worker' notebook - See Series 2 for details
  4. Setup multiple cpus notebook environment runtime - See Series 3 for details
  5. Configure mult-process mode  - See Series 3 for details
  6. Create the worker training job
  7. Configure training parameters inside the controller and start training jobs.
Create the worker training job
1. In the editor of Jupter Notebook, we can create a new job by clicking New job button in the upper right corner as shown below, which opens the Create a job' page.
2. On the Create a job page, fill in 'worker' to Job name.


3. In the Configure tab, keep the current setting.
4. Proceed to the last tab and click 'Create' button to create the job.

5. In the Jobs tab, you'll find the new job 'worker'.

2  Configure controller parameters and run training
1. Open the controller notebook and set the configuration parameters.
The meaning of each configuration parameter is as follows.
  • workers: How many training workers?
  • cpus: How many processes are enabled in each worker?
  • job_name: The name of the job defined above.
  • data_split : Defines slicing data for each worker
2. Run the following codes to start training.
In the above program, the get_job_id method is the first to get the internal job id. Then the start_job method is to start a job. Finally, start multiple workers through a loop.

#CloudPakforDataGroup
0 comments
7 views

Permalink