Cloud Pak for Data

Cloud Pak for Data

Come for answers. Stay for best practices. All we’re missing is you.

 View Only

How to choose a system architecture for large scale machine learning model training - Series 5: Apache Spark service in Cloud Pak for Data 3.5

By Harris Yang posted Thu January 14, 2021 05:39 AM

  

How to choose a system architecture for large scale machine learning model training


Series 5: Apache Spark service in Cloud Pak for Data 3.5


Table of Content
1. Train ML/DL modes from Big Data
2. Decision tree
3. Apache Spark service in Cloud Pak for Data
3.1 Setup Spark service in IBM Cloud Pak for Data
3.2 Define Spark environment for Jupyter Notebooks
3.3 Develop models in Jupyter Notebooks


1. Train ML/DL models from Big Data

Today Data Science team and data scientist are playing an essential role in most enterprises because of the unimaginable data complexity and the explosive growth of data volume. The prevalence of data leads to both data complexity and data volume, and it also creates more opportunities for big data analytics and insights. 
  • More than 16 million text messages are sent every minute
  • More than 100 million spam emails are sent every minute
  • More than one million tinder swipes are produced every minute.
Data Science team is exploring different advanced algorithms on both ML and DL to tackle the data complexity challenge, but what about processing the huge data volume and developing models on big data? Huge data volume becomes one of the big challenges and a common scenario for enterprise Data Science team. For example, the data scientists in the risk management department of a large development bank need to develop a risk model against over 500GB transaction records data generated from the upfront business unit. 
Many parallelization technology are available now in IBM Cloud Pak for Data for data science team to process big data volume and develop models, but we still need to dive deep in the training data set properties from different facets before picking up the right solution. These facets come from the checklist in the previous section of this article to help us collect all the necessary information of the training data set.
Still take the risk model in a large development bank as an example, the following traning data set properties are laid out.
  1. The risk model is developed from ML algorithms such as logistic regression, random forest, decision tree, and without DL.
  2. The model framework can be Scikit-Learn, Spark MLLib or SPSS modeler.
  3. Currently there is only one or two models to be trained.
  4. The risk model is trained against a limited number of features from the transaction record and each feature has its business context and meaning.
  5. The total training data set is more than 500GB structured data stored in RMDB
  6. The department Data Science team is responsible for developing the models and less than 10 data scientists and business analysts are involved.
  7. Since the risk model has very restrict audit and management process, every a couple of months the team need to retrain the models
  8. The training data is stored in RMDB in the enterprise.
  9. The purchasing budget is responsible for the enterprise IT department.
  10. The risk model need to be saved in the PMML format and can be deployed into a central model deployment platform in the enterprise
Note: The central model deployment platform is provided by a local software vendor.
After collecting all the above big training data set and generating the big data profile, we can move forward to follow the decision tree in the Section 4.3 and trace down to the right distributed architecture and pick up the right technical solution for the large scale training data set in IBM Cloud Pak for Data.

2. Decision tree
The following diagram is a summary of our decision tree. The decision tree is made up of several branches, first divided by concurrent training, then either by training time of training data size. You can follow the tree to find the right solution from the six solutions provided on the leaf node. 
decision-tree.png
In this blog we will introduce the solution of training model against Apache Spark service in Cloud Pak for Data 3.5. That is the branch leading to the leaf node Local Spark in the above decision tree diagram.

3. Apache Spark Service in Cloud Pak for Data
IBM Cloud Pak for Data provides containerized Spark service to run a variety of workloads:
  • Watson Studio Notebooks that call Apache Spark APIs
  • Spark applications that run Spark SQL
  • Data transformation jobs
  • Data science jobs
  • Machine learning jobs
Each time submit a job, a dedicated Spark cluster is created for the job and you can also specify the size of the Spark driver, the size of the executor, and the number of executors for the job. This enables you to achieve predictable and consistent performance.
When a job completes, IBM Cloud Pak for Data is automatically cleaned up so that the resources are available for other jobs. The spark service also includes interfaces that enable you to analyze the performance of the Spark applications and debug problems.
Entry criteria:
  • The target models are typical Machine Learning models supported by Spark.
  • The size of training data set, measured by the large one from the size of the raw data or the data after feature engineering, falls into the range of 20 to 100 GB

3.1 Setup Spark service in IBM Cloud Pak for Data
Spark environments are not available by default. The administrator must install the Analytics Engine Powered by Apache Spark service on the IBM Cloud Pak for Data platform. To determine whether the service is installed, open the Services catalog and check whether the service is enabled.
cpd3_5-spark-service.png
The IT administrator can refer IBM Cloud Pak for Data knowledge center to install and setup Apache Spark service: https://www.ibm.com/support/knowledgecenter/en/SSQNUZ_3.5.0/svc-spark/spark-install.html#svc-install__connected-section

After the Spark service is enabled for IBM Cloud Pak for Data, data scientists can see the predefined Spark running environments in an analytics project.
cpd3_5-spark-runtime.png
3.2 Define Spark environment for Jupyter Notebooks
Data scientists can develop machine learning, deep learning models, or model flows with Spark runtimes by associating the Notebook with a Spark service or environment. With Spark environments, data scientists can configure the size of the Spark driver as well as the size and the number of the executors. The default Spark environment definitions can allow data scientists to quickly get started with Spark. If the default Spark environment definitions can't meet the resource requirement, data scientists can define customized Spark environment definitions with specified configuration of Spark driver and executors and scale up the numbers of executors.
Below are the steps for creating customized Spark environment definitions:
1. Go to the working analytics project and switch to the Environments tab.
2. Click New environment definition.
3. In the New environment page, provides the name of the environment.
4. Select Spark as environment type and select configuration for Spark driver and executor.
5. Scale up the number of executors.
6. Select the software version.
7. Click Create to generate the new environment.
With the customized Spark environment definition, data scientists can associate it with Jupyter Notebooks for developing models on big training data sets.
cpd3_5-myspark-runtime.png
3.3 Develop models in Jupyter Notebooks
With either default Spark environment definitions or customized Spark environment definitions, data scientists can create Jupyter Notebooks associated with Spark environment to develop models.
cpd3_5-myspark-notebook.png
After finishing the Jupyter Notebook model training, data scientists can stop the associated kernel so that the Spark service can be terminated and the resource can be freed up into the resource pool of IBM Cloud Pak for Data.



#CloudPakforDataGroup
0 comments
29 views

Permalink