This is Series 4 of the series How to choose a system architecture for large scale machine learning model training.
This solution can be used when the user needs to train different models of an algorithm based on different data, and the single model has a relatively long training time, e.g., greater than 10 minutes, and a large amount of data or a large amount total training time which one notebook could not contain.
Entry Criteria:
- Total model training time > The specified threshold
- Total model training data volume > 1/3 node memory
As above, the specified threshold is an acceptable training duration for the user, such as one hour or one day.
The total amount of data trained by the model exceeds 1/3 node memory mainly in terms of the amount of data a Notebook can be trained on. If it exceeds, we should consider to train using the multiple notebooks solution.
Solution:
- High Level Design
#CloudPakforDataGroup