zPET - IBM Z and z/OS Platform Evaluation and Test - Group home

Data Privacy Passports: Converting Jupyter Notebooks

  

z/OS Platform Evaluation and Test (zPET) runs customer-like workloads in a Parallel Sysplex environment to perform the final verification of new IBM Z hardware and software. For more information about zPET, please check out our community.

IBM Data Privacy Passports is an exciting new offering that brings the security to the eligible data itself. In zPET, we were able to successfully integrate the product into our environment. This blog is part of a series of blogs that captures our journey.

Table of contents
  1. zPET's Initial Setup and Replication Strategy
  2. Creating the Infrastructure for IBM Data Privacy Passports
  3. Identifying the Data Privacy Passports Flow using REST APIs and Hive Commands
  4. Creating a Java Driver to Automatically Protect Data
  5. Converting Existing Python Jupyter Notebooks to Use IBM Data Privacy Passports

Overview

Through a central control policy of IBM Data Privacy Passports, you can control data access, sharing, and use for eligible data. With this new technology, the way that consumers connect to and query the data might change, depending on job role. One role that zPET saw impacted was the data scientist. zPET created a policy that protected and enforced various fields, and data scientists needed to update existing Jupyter notebooks in order to connect to Data Privacy Passports instead of the data source directly. While this article covers reading data through Data Privacy Passports, it’s important to note that if the necessary columns are not protected or enforced, a query directly to the data source will yield better performance.

Requirements

This example uses a Python Jupyter notebook to connect to Data Privacy Passports. The following packages were newly added to the existing notebook to now connect to the Data Privacy Passports controller instead of the data source directly:
impyla==0.15.0
thrift==0.11.0
thriftpy==0.3.9
thrift-sasl==0.2.1

Note:
  • The above packages and versions were used in this notebook because we found that newer versions of some introduced problems.
  • The Jupyter notebook uses an x86 kernel since not all of the required packages are available on z/OS at this time (April 2020). zPET has requested these packages be made available on z/OS, and we are hopeful that they will soon be available.

Converting a Notebook

zPET had existing Jupyter notebooks that accessed data sources over JDBC using both Mainframe Data Service (MDS) and Data Virtualization Manager for z/OS (DVM). This approach only needed Spark to be imported:
import os

from pyspark import SparkContext
from pyspark.sql import SQLContext

However, connecting to Data Privacy Passports with Python requires impyla:
import os

from impala.dbapi import connect
from impala.util import as_pandas
from pyspark import SparkContext
from pyspark.sql import SQLContext

With the new packages imported, we can use impyla to connect to Data Privacy Passports, read data, and convert that data into a Spark DataFrame. Plugging this new flow into our existing notebooks will allow us to leave the rest of the notebook untouched.

Original notebooks used a SQLContext to read from a data source with JDBC. Different options about the connection could be supplied. zPET had one example that used DVM to query the STORE_ID column from a table called STOREDIM. This was done with something like:
store_dim = spark.read \
      .format("jdbc") \
      .option("driver", "com.rs.jdbc.dv.DvDriver") \
      .option("url", “jdbc:rs:dv://{}:{};DBTY=DVS; UID={};PWD={};CompressionType=UNCOMPRESSED;MXBU=4MB".format(HOSTNAME, PORT, USER, PASSWORD) \
      .option("dbtable", "STOREDIM") \
      .option("fetchSize", "10000") \
      .load() \
      .select("STORE_ID") \
      .withColumnRenamed("STORE_ID", "SDSTORE_ID")

Converting this to an impyla call can be done in just a few steps:
query = 'SELECT STORE_ID FROM DVM.STOREDIM'

# Connect to Data Privacy Passports
conn = connect(host=HOSTNAME,
               port=PORT,
               database='default',
               user=USER,
               password=PASSWORD,
               use_ssl=SECURE_CONN,
               auth_mechanism='PLAIN')
cur = conn.cursor()
cur.execute(query)

# Convert the cursor object to a Pandas dataframe, and then that to a Spark dataframe
pandas_df = as_pandas(cur)
store_dim = sqlContext.createDataFrame(pandas_df)
store_dim = store_dim.withColumnRenamed("STORE_ID", "SDSTORE_ID")

# Close the connection
cur.close()
conn.close()

In the new code, the connection is made to the Data Privacy Passports Controller instead of DVM. Using the provided cursor object, we are able to execute a SQL query. While Spark supports reading using SQL, our previous notebook did not use that. Instead of a table as an option() and columns in a select(), we query directly with a SQL statement.

It was easy to create a Pandas DataFrame from the cursor object, and for some notebooks, working directly with Pandas may be permissible. However, our particular notebook capitalized on benefits provided by Spark, so we took the extra step to convert the Pandas DataFrame to a Spark DataFrame. After closing the cursor and connection, we end with our data in a Spark DataFrame, just as the previous notebook does.

Note: The Data Privacy Passports data administrator created a policy that included DVM as a dbview called "DVM." In that dbview, the URL, driver, and data source credentials were specified.

Conclusion

The zPET team found the process of converting Jupyter notebooks to ones that utilized Data Privacy Passports to be relatively straightforward. The data scientist needed to work with the data administrator to get an alias created and install new Python packages, but those were the only changes needed outside of the notebook. One issue that we did encounter was that some of our tables exceeded the size that is recommended to be protected with Data Privacy Passports. Trying to access these tables resulted in some failures in the notebook that often had misleading error messages. For these tables, the data scientists did not need data that was protected or enforced, so we were able to query the data source directly.

References

IBM Data Privacy Passports - https://www.ibm.com/us-en/marketplace/data-privacy-passports
impyla - https://github.com/cloudera/impyla

Authors

Trent Balta (Trent.Balta@ibm.com)
Michael Cohoon (mtcohoon@us.ibm.com)
Torin Reilly (treilly@us.ibm.com)