The following diagram is a summary of our decision tree. The decision tree is made up of several branches, first divided by concurrent training, then either by training time of training data size. You can follow the tree to find the right solution from the six solutions provided on the leaf node.
In this blog we will introduce the solution of training model against Apache Spark service in Cloud Pak for Data 3.5. That is the branch leading to the leaf node Local Spark in the above decision tree diagram.
3. Apache Spark Service in Cloud Pak for Data
IBM Cloud Pak for Data provides containerized Spark service to run a variety of workloads:
- Watson Studio Notebooks that call Apache Spark APIs
- Spark applications that run Spark SQL
- Data transformation jobs
- Data science jobs
- Machine learning jobs
Each time submit a job, a dedicated Spark cluster is created for the job and you can also specify the size of the Spark driver, the size of the executor, and the number of executors for the job. This enables you to achieve predictable and consistent performance.
When a job completes, IBM Cloud Pak for Data is automatically cleaned up so that the resources are available for other jobs. The spark service also includes interfaces that enable you to analyze the performance of the Spark applications and debug problems.
Entry criteria:
- The target models are typical Machine Learning models supported by Spark.
- The size of training data set, measured by the large one from the size of the raw data or the data after feature engineering, falls into the range of 20 to 100 GB
3.1 Setup Spark service in IBM Cloud Pak for Data
Spark environments are not available by default. The administrator must install the Analytics Engine Powered by Apache Spark service on the IBM Cloud Pak for Data platform. To determine whether the service is installed, open the Services catalog and check whether the service is enabled.

The IT administrator can refer IBM Cloud Pak for Data knowledge center to install and setup Apache Spark service: https://www.ibm.com/support/knowledgecenter/en/SSQNUZ_3.5.0/svc-spark/spark-install.html#svc-install__connected-section
After the Spark service is enabled for IBM Cloud Pak for Data, data scientists can see the predefined Spark running environments in an analytics project.

3.2 Define Spark environment for Jupyter Notebooks
Data scientists can develop machine learning, deep learning models, or model flows with Spark runtimes by associating the Notebook with a Spark service or environment. With Spark environments, data scientists can configure the size of the Spark driver as well as the size and the number of the executors. The default Spark environment definitions can allow data scientists to quickly get started with Spark. If the default Spark environment definitions can't meet the resource requirement, data scientists can define customized Spark environment definitions with specified configuration of Spark driver and executors and scale up the numbers of executors.
Below are the steps for creating customized Spark environment definitions:
1. Go to the working analytics project and switch to the Environments tab.
2. Click New environment definition.
3. In the New environment page, provides the name of the environment.
4. Select Spark as environment type and select configuration for Spark driver and executor.
5. Scale up the number of executors.
6. Select the software version.
7. Click Create to generate the new environment.
With the customized Spark environment definition, data scientists can associate it with Jupyter Notebooks for developing models on big training data sets.

3.3 Develop models in Jupyter Notebooks
With either default Spark environment definitions or customized Spark environment definitions, data scientists can create Jupyter Notebooks associated with Spark environment to develop models.

After finishing the Jupyter Notebook model training, data scientists can stop the associated kernel so that the Spark service can be terminated and the resource can be freed up into the resource pool of IBM Cloud Pak for Data.