IBM Open Data Analytics for z/OS (IzODA) is an IBM product geared towards providing a data analytics ecosystem that was previously unavailable to z/OS. This enables analytics applications to be developed within a popular interface while also being directly connected to z/OS and its enterprise data. The Jupyter notebook environment is one such example. This piece of software provides a web interface that data scientists can use to gain insight on the many facets of their businesses through analytical queries against any data source from any system.
With IzODA, z/OS sysprogs are now able to install and configure a Jupyter environment using Anaconda’s package management capabilities, and then integrate that environment with either the z/OS Spark’s Scala JVM-based framework or Anaconda’s Python-based analytics stack to retrieve any of the z/OS data sources that are supported by the Optimized Data Layer. We previously blogged our own introduction to IBM Open Data Analytics for z/OS so if you’d like a product overview or more information, refer to IzODA Overview.
We have implemented our own IzODA-based Jupyter Notebook environment and hope to offer some hints and tips learned so far from our experiences to help you get the most out of it. The IzODA Jupyter ecosystem provides two components to work with IzODA on z/OS. JupyterHub runs on a Linux server to host a multi-user Jupyter Notebook web interface for users to interact with. The Jupyter Kernel Gateway (JKG) runs on z/OS and allows JupyterHub to communicate with z/OS. This connection is necessary because it is used to transfer executable code written by data scientists in Jupyter Notebook over to z/OS to be executed. The output is then sent back to JupyterHub and displayed to the user. If you'd like more information on the entire IzODA infrastructure and where JupyterHub and Jupyter Kernel Gateway fit in, please visit the IzODA GitHub Page. In this post, we'll focus on Jupyter Kernel Gateway since most of its configuration happens on the z/OS side. If you have not installed Jupyter Kernel Gateway yet, we’ve created a video to give you a step-by-step guide to help get you started that you can find here.
One of the first things you'll find is that Jupyter Kernel Gateway installs into a user filesystem directory. To control its operations, such as to start or stop it, you may have to manually run USS shell scripts. Furthermore, these bash scripts may also require some environmental configuration to be done. This combination of shell-based user-centric tasks may seem problematic in an environment that may want to use a centralized configuration and started tasks to control and automate the start and shut down of things that run on their systems. A simple approach would be to automate the JKG scripts with shell-based utilities such as cron or init.d by trusted user IDs. In our case, we felt the need to integrate JKG’s start up and shutdown with started tasks controlled by our System Automation policy so we created a BPXBATCH-based started task that, in turn, calls our own JKG controlling bash script. Here is a sample of our JCL started task, which very simply calls our bash script:
//JKG PROC ACTION='START'
//*PEND
//******************************************************************
// SET SPATH='/scripts'
// SET SCRIPT='jkgw.sh'
//******************************************************************
//* Program Usage: /S JKG
//* S JKG - Start MLz Jupyter Kernal Gateway
//* S JKG,ACTION=STOP - Stop MLz Jupyter Kernal Gateway
//*
//******************************************************************
//JKG EXEC PGM=BPXBATCH,REGION=0M,TIME=NOLIMIT,
// PARM='SH nohup &SPATH/&SCRIPT &ACTION'
//******************************************************************
//STDOUT DD SYSOUT=*
//STDERR DD SYSOUT=*
//******************************************************************
Our bash script, named jkgw.sh, sets some of the required environment variables such as _BPX_JOBNAME to make our JKG task easily identifiable and then starts Jupyter Kernel Gateway in the background using the command:
nohup python $PYTHONHOME/bin/jupyter-kernelgateway > $LOGNAME 2>&1 &
This will send the JKG daemon’s output to the file defined by the $LOGNAME environmental variable. To shutdown Jupyter Kernel Gateway, we execute the following bash command to retrieve the PID of the running JKG instance:
PID=$(COLUMNS=500 ps -ojobname,ruser,pid,ppid,stime,tty,vsz,state=State -oargs -e | grep jupyter-kernelgateway | grep -v grep | awk ' { print $3 } ')
Once we have the PID, we can simply kill the process.
Another important tip to mention at this point is that when a user creates a Jupyter notebook within the web interface and starts executing Spark queries, the notebook connects to z/OS Apache Spark to create a new Spark application for that Jupyter notebook. This z/OS Spark application, along with the Spark Executor task(s) that are created to execute the application queries, will not be terminated when Jupyter Kernel Gateway is shut down. This results in the Spark executors continuing to run, consuming Spark configured resources even if they are not doing any sort of computation. The same thing occurs if a Jupyter notebook is kept running and not shut down from the web interface. This can potentially lead to many Jupyter notebooks consuming Apache Spark resources that they are not using, simply because old notebooks are kept running. These long running tasks may also delay the shutdown of OMVS during system shutdown. To reduce the impact of this behavior, we added a step to our shutdown script that looks for all spark applications that are running on behalf of notebook kernels and kills those as well. In the kernel.json files for the different kernels we have configured, we added the _BPX_JOBNAME variable to identify the kernel tasks by job name with the line "_BPX_JOBNAME": "NOTEBOOK", which allows us to keep track of all notebooks. We then use this to find the PID for all of the running notebooks with the following command
notebooks=$(COLUMNS=500 ps -ojobname,ruser,pid,ppid,stime,tty,vsz,state=State -oargs -e | grep NOTEBOOK | grep –v grep | awk ' { print $3 } ')
We then iterate over the output and kill all of the processes. This will shut down the Jupyter notebooks and their corresponding Apache Spark applications.
Speaking of Apache Spark, by default Jupyter notebooks will run Spark in local mode. This means that each notebook will create its own master and worker tasks and basically have its own Spark environment brought up. In our environment, we prefer to run Spark in standalone cluster mode. This means that whenever a request for a new Spark application comes in, it will run on a master and workers that are already initialized. We prefer this to reduce overhead and it also allows us to track our Spark applications better and customize our settings rather than use defaults. To enable this, we had to add two statements to each kernel definition. When Jupyter Kernel Gateway is installed, a directory for each kernel type is created. As an example, we currently have configured kernels for apache_toree_pyspark, apache_toree_scala, and python3. Each of these are represented in directories that hold a kernel.json file that we customized to suit our environment. To have the kernels use our Spark configurations and also run in cluster mode, we added the two following lines in the "env" section of each kernel.json:
"SPARK_CONF_DIR": "/spark/configuration/directory",
"SPARK_OPTS": "--master=spark://SPARK_IP_ADDRESS:SPARK_PORT",
Substitute "/spark/configuration/directory" with your own personal directory where your spark-defaults.conf and spark-env.sh files are located and also change SPARK_IP_ADDRESS and SPARK_PORT to the instance of z/OS Apache Spark already running in cluster mode. Once these are included, if you go to the Spark Web UI, you should see Spark applications that to show up as you create Jupyter notebooks and run analytical queries.
Code within Jupyter notebooks is often used to establish connections with data sources. These connections usually require passwords and having them as plain text in the Jupyter notebooks is never a recommended solution for numerous reasons. Data scientists may want to share notebooks or make presentations using their notebooks to allow for live demos and displaying passwords provides a security risk. To prevent this, we use environment variables that are set directly in the same kernel.json files in which we put SPARK_CONF_DIR and SPARK_OPTS. We simply add our password variables and values within the "env" section as such
"MDS_PASS": "password123”,
By doing this, users creating Jupyter notebooks no longer have to type out the password right in their notebooks. In reality, they don’t even need to know the passwords and when the passwords change, it only needs to be changed once within the kernel.json files. Using the environment variables within the notebooks varies depending on the kernel. For example, each Python Jupyter notebook has to have “import sys” at the top and then the environment variables can be accessed using os.environ.get("VARIABLE_NAME "), which in our example is MDS_PASS. In Scala Jupyter notebooks, there is no need for an import statement. Simply enter System.getenv("VARIABLE_NAME").
You’ve now configured JKG and your Jupyter notebooks are working as planned. Once your data scientists have written their analytical queries right in the notebooks, you can use a tool named nbconvert to automate the running of the notebooks created. Nbconvert is a piece of software you can get from the IzODA Anaconda channel that will run the notebooks for you through a simple command. Nbconvert runs on z/OS so if you’ve created notebooks on the web interface of JupyterHub you’ll need to export them as notebooks (.ipynb) to a USS directory on the z/OS machine where JKG runs. Once that’s done, you’ll want to tag the .ipynb files as ASCII using the chtag -tc ISO8859-1 FILE.ipynb command and then run the following command to execute the notebooks
jupyter nbconvert --to notebook --execute FILE.ipynb --allow-errors --output outputNotebook.ipynb
This will execute your FILE.ipynb Jupyter Notebook and save the input and output in a new notebook called outputNotebook.ipynb.
Once configured, Jupyter notebooks provide an easy-to-use, robust analytical environment for data scientists to more easily make sense of your enterprise data. We hope the tips and tricks discussed above help you integrate the Jupyter notebook ecosystem within your existing z/OS enterprise environment. This blog post is a work in progress so we’ll update it if we find anything new worth sharing. If you have any questions, comments or suggestions about anything discussed, please feel free to post a comment below.
Author: zPET Team