Cloud Pak for Data

Come for answers. Stay for best practices. All we’re missing is you.

View Only

Back to Blog List

How to choose a system architecture for large scale machine learning model training - Series 3: Train multiple models concurrently - Mult-process training in single notebook

By GUANG MING ZHANG posted Wed January 20, 2021 12:20 AM

This is series 3 of the series : How to choose a system architecture for large scale machine learning model training.

This solution can be used when the user needs to train different models of an algorithm based on different data, and the single model has a relatively long training time, e.g., greater than 10 minutes, and a small amount of data, and the total training time exceeds the specified threshold .

Entry Criteria：

Total model training time > The specified threshold
Total model training data volume < 1/3 node memory

As mentioned in Section 2, the specified threshold is an acceptable training duration for the user, such as one hour or one day. you can choose an appropriate threshold for your situation, and depending on whether the training duration exceeds the specified to decide whether to use mult-process training.

As mentioned in Section 2, the total amount of data trained by the model should be less than 1/3 node memory mainly in terms of the amount of data a Notebook can be trained on.

Solution:

The solution is divided into the following steps.

Slice the data, with each slice used to train a model - See Series 2 for details
Use the notebook to build model training algorithms - See Series 2 for details
Migrate model training codes to the worker notebook - See Series 2 for details
Setup multiple CPUs notebook environment runtime
Configure mult-process mode and run the entire 'worker' notebook for training

1 Setup multiple CPUs notebook environment runtime

For this solution, we should configure a multiple CPUs notebook environment runtime to let the notebook run training in parallel.

In the diagram above, we added two new environments. One is with 6vCPU, the other is with 3vCPU.

2 Configure the parameter and start training

The worker framework supports two modes of operation.

single-process multi-loop
multi-process multi-loop

In this solution, we use the second approach. The difference between the two ways is achieved by a configuration parameter is_multiple_processors. In this solution, we need to set is_multiple_processors to True.

In addition, we should specify the data slice and how many CPUs are used for the training. Note: the number of CPUs should not exceed the vCPU number defined by the notebook runtime environment.

The two parameters will be overridden if you use 'controller' notebook to call 'worker' notebook in the next solution. During that time, the two parameters will be redefined in 'controller' notebook and passed to 'worker' notebook.

You can then run the entire notebook to start mult-process training.

#CloudPakforDataGroup

0 comments

11 views

Permalink

https://community.ibm.com/community/user/blogs/guang-ming-zhang1/2021/01/20/how-to-choose-a-system-architecture-for-large-scal

Cloud Pak for Data

Cloud Pak for Data

How to choose a system architecture for large scale machine learning model training - Series 3: Train multiple models concurrently - Mult-process training in single notebook

By GUANG MING ZHANG posted Wed January 20, 2021 12:20 AM

Permalink

Additional
Resources

Office

Quick Links

Cloud Pak for Data

Cloud Pak for Data

How to choose a system architecture for large scale machine learning model training - Series 3: Train multiple models concurrently - Mult-process training in single notebook

By GUANG MING ZHANG posted Wed January 20, 2021 12:20 AM

Permalink

Additional Resources

Office

Quick Links

Additional
Resources