Using IBM Open Enterprise Python for z/OS and ZOAU to Work With Datasets

Back to Blog List

Using IBM Open Enterprise Python for z/OS and ZOAU to Work With Datasets

Like

On IBM z/OS, it isn't uncommon to store and operate on data stored within datasets. With the release of IBM Open Enterprise for Python and IBM Z Open Automation Utilities it's easy to work with these datasets in your Python codebase.

IBM Z Open Automation Utilities is a set of tools designed to help bridge the gap between a traditional z/OS system and IBM z/OS Unix System Services by providing equivalent utilities to many Unix commands that can operate on z/OS specific file types such as dls – dataset ls, dtouch – dataset touch and dcp – dataset cp. In addition to these tools there are also Python language bindings to make use of these z/OS specific file types in Python, which will be the primary focus of this example.

Requirements:

Setup:

Install IBM Open Enterprise for Python (either SMP/E or PAX)
Install IBM Z Open Automation Utilities (either SMP/E or PAX)

Enable the ZOAU python bindings by exporting the following environment variables:

export ZOAU_ROOT=<path to zoau install>
export PATH=${ZOAU_ROOT}/bin:$PATH
export PYTHONPATH=${ZOAU_ROOT}/lib/
export LIBPATH=${ZOAU_ROOT}/lib

Run the following to confirm successful installation and configuration:

Note: ZOAU_ROOT is typically /usr/lpp/IBM/zoautil if installed through SMP/E

# confirm python is functioning correctly
$ python3 -–version
Python 3.8.5

# confirm zoau is accessible
$ zoaversion 
2020/06/19 14:12:11 CUT V1.0.3

# confirm that zoau is importable to python
$ python3 -c "import zoautil_py;print(zoautil_py.__name__)"
zoautil_py.python38

The Scenario:

Imagine you have two Datasets you would like to perform some analysis on. In the past this might be a job left to the realm of COBOL or PL/1, but with ZOA Utilities and Python we can do this in fewer lines of code than ever before.

The Data layout:

The input:

Dataset1(username) = "id"

Dataset2(id) = "<first name>,<last name>,<age>,<height>,<weight>"

Two PDSE datasets: one with the username for member name containing one numerical id, and another using the same id as a member name containing a single comma separated list of data. However, we may also have unknown value for <weight> (represented by -1).
We wish to write a utility that can combine this data together, estimate the unknown values and finally output them in a single dataset.

The desired output:

Output(username) = "<first name>,<last name>,<age>,<height>,<weight>"

One PDSE dataset, with username for the member names, and a comma separated list of data as members.

In this example the two datasets are called TEST.USERNAME and USER.DATA, for the first and second datasets respectively.

$ mls TEST.USERNAME #mls acts like ls for members of a dataset
HENRYH
JANICEW
PHILIP
USER123
WILLIAM

# sample member of the first dataset
$ cat "//'TEST.USERNAME(HENRYH)'"
ID0483

$ mls USER.DATA 
ID0843
ID0487
ID0161
ID0388
ID0250

# Sample member of the second dataset
$ cat "//'USER.DATA(ID0843)'"
HENRY,HARRISON,27,125,162

# Sample member of the second dataset (note the missing weight)
$ cat "//'USER.DATA(ID0483)'"
JANICE,ANDERSON,60,-1,198

The Code:

Now that we understand the problem, let's start writing the code Start by importing required modules. We'll need sys, which we will use to read in command line arguments, and Numpy, which we'll import using import numpy as np. This does two things; first it imports Numpy for us to use, and second, it allows us to use the identifier np in place of numpy for cleaner code. Using the same trick we'll do import zoautil.Datasets as Datasets. This allows us to manipulate and work with datasets via ZOA Utilities (or Z Open Automation Utilities).

import sys
import numpy as np
#zoau lets us interact with z/OS
import zoautil_py.Datasets as Datasets

Next, we'll write our main function. In python this is done with an if statement, checking if the built-in symbol __name__ is __main__. We also use sys.argv to access the arguments to the program (the first argument is dataset 1, second is dataset 2 and optionally, the third argument is the name of the output dataset). Then we make calls to the 3 functions we will create later; parse_dataset, fill_in_missing and write_output.

if __name__ == "__main__":

    if(len(sys.argv) < 3):
        print("Usage {} <Input Dataset 1> <Input Dataset 2> [Output Dataset]".format(sys.argv[0]))
        sys.exit() 

    inputds1 = sys.argv[1]
    inputds2 = sys.argv[2]

    #assigns the 3rd argument if it exists, otherwise creates random name
    if len(sys.argv) == 4:
        output = sys.argv[3]
    else: 
        output = Datasets.temp_name() #creates a temporary dataset name
  Datasets.create(output)
        print("No output specified. using: {}".format(output))

    records = parse_dataset(inputds1,inputds2)
    complete_records = fill_in_missing(records)
    write_output(output, complete_records)

Before we start filling in the 3 functions, let's define a class that we will use to contain each dataset member in memory. In Python __init__ is our constructor, with two arguments (self is implicitly passed in, similar to this in C++), the first argument is username this is the name of our members that contain the data we wish to use, and the second is the comma separated line of data from our second dataset. Notice that age, height, weight get converted to integers (this will be important when we estimate missing parameters). Also note the __str__ function. This is known as a "magic" method and we will take a closer look at it later.

#python in-memory storage of a record.
class Record:

    def __init__(self, username, line=""):
        
        self.username = username

        # _ means we can ignore that value
        name,surname,age,height,weight = line.split(',')
        
        self.name    = name        
        self.surname = surname
        self.age     = int(age)
        self.height  = int(height)
        self.weight  = int(weight)


    def __str__(self):

        return f"{self.name},{self.surname},{self.age},{self.height},{self.weight}\n"

Now we'll start writing the code that interacts with the datasets in the function called parse_dataset that reads in both datasets and fills in the associated Record objects for us to manipulate. To do this we use the method list_members, which takes in a string representing the dataset to list members of, in this case we take the name of our first dataset (passed in as ds1_name).

def parse_dataset(ds1_name, _name):

    #table of usernames to IDs 
    id_map = {}
    usernames = Datasets.list_members(ds1_name + "(*)")

    for username in usernames.split('\n'):
        # Using Python f-strings
        dataset_name = f"{ds1_name}({username})"
        id_map[Datasets.read(dataset_name).strip()] = username 

    #list all members of the dataset using wildcard (*)
    Names = Datasets.list_members(ds2_name + "(*)")
    records = []

    for member_name in Names.split('\n'):
        dataset_name = f"{ds2_name}({member_name})"
        records.append(Record(id_map[member_name], Datasets.read(dataset_name)))

    return records

We also create a "dictionary" called id_map. Dictionaries in python are key value pairs, similar to a map in C++, this will be used to keep track of usernames (member names of our first dataset) and numeric ids stored in each member of our dataset. We can then use the method Datasets.list_members from ZOAU to get a newline separated string of member names (i.e. of the form “name1\nname2\nname3”). To iterate over all the member names, python strings have a built-in method split returning a list of strings separated based on the string provide, in this case “\n”. Inside the loop we make use of "f-strings" f-strings in python allow the user to specify a string using expressions that will be substituted at runtime, in our case the variables: ds1_name and username. We also use Datasets.read to read in the data (in this case our numerical id) from the member specified. For the second dataset we do an analogous procedure, except instead of building our id_map, we'll create an instance of the Record class from before and append that to a list.

Now it's time to move on to fill_in_missing. Once we've read the contents of our datasets into memory, we can perform some analytics. For this example we will use least squares linear regression to estimate some missing data.

def fill_in_missing(records):

    x_train = []
    y_train = []

    for record in records:
        # -1 is a place holder value meaning the value was never recorded
        if(record.weight != -1):
            x_train.append([record.height, record.age])
            y_train.append(record.weight)

    X = np.array(x_train)
    Y = np.array(y_train)

    # this looks complicated but it’s really just linear regression
    # see: https://en.wikipedia.org/wiki/Ordinary_least_squares
    theta = np.dot(np.dot(np.linalg.inv(np.dot(X.T,X)), X.T), Y)

    for record in records:
        if(record.weight == -1):
            v = np.array([record.height, record.age])
            # for consistency we convert the value to an integer
            record.weight = int(np.dot(theta,v))

    return records

We start by iterating over our list of records and storing into a list all of our "training examples" i.e. data that does not have missing parameters (reminder this is represented as a -1). We pass these lists into "np.array" to convert them into NumPy vectors and using the formula:

This may look complicated but is actually just the closed form solution to least squared linear regression. Once we have our estimator we can iterate over the records and fill in the missing values.

After we've read in our data and performed some work on it, we can output the data.

def write_output(output_ds, records):

    for record in records:
        member_name = f"{output_ds}({record.username})"
        Datasets.write(member_name, str(record))

The process of doing this is simple. We loop over our list of records. Using our f-string idiom we can construct the member name and using Datasets.write write out our data. This is where the __str__ method we defined earlier becomes relevant. This method defines how our Record class should be converted to a string so the line str(record) is actually equivalent to record.__str__().

Finally, we can combine all 3 of our functions together at the end of __main__ and our program is ready to use.

    records = parse_dataset(inputds1,inputds2)
    complete_records = fill_in_missing(records)
    write_output(output, complete_records)

Usage Example:

Running the program with the same datasets shown in "the data layout", we get the following

# without a provided output, a random name is chosen
$ IBMUSER:/u/ibmuser: >python3 main.py TEST.USERNAME USER.DATA
No output specified. using: MVSTMP.P3952203.T0705980.C0000000

$ mls MVSTMP.P3952203.T0705980.C0000000
HENRYH
JANICEW
PHILIP
USER123
WILLIAM

$ cat "// MVSTMP.P3952203.T0705980.C0000000(HENRYH)’”
HENRY,HARRISON,27,125,162

# Note: previously this user had a missing parameter. 

$ cat "// MVSTMP.P3952203.T0705980.C0000000(JANICEW)’”
JANICE,ANDERSON,60,148,198

Full Code Listing:

#!/usr/bin/env python3

import sys
#zoau lets us interact with z/OS
import zoautil_py.Datasets as Datasets
import numpy as np

#python in-memory storage of a record.
class Record:

    def __init__(self, username, line=""):
        
        self.username = username

        # _ means we can ignore that value
        name,surname,age,height,weight = line.split(',')
        
        self.name    = name        
        self.surname = surname 
        self.age     = int(age)
        self.height  = int(height)
        self.weight  = int(weight)

    def __str__(self):

        return f"{self.name},{self.surname},{self.age},{self.height},{self.weight}\n"

def fill_in_missing(records):

    x_train = []
    y_train = []

    for record in records:
        # -1 is a place holder value meaning the value was never recorded
        if(record.weight != -1):
            x_train.append([record.height, record.age])
            y_train.append(record.weight)

    X = np.array(x_train)
    Y = np.array(y_train)

    # this looks complicated but its really just linear regression
    # see: https://en.wikipedia.org/wiki/Ordinary_least_squares
    theta = np.dot(np.dot(np.linalg.inv(np.dot(X.T,X)), X.T), Y)

    for record in records:
        if(record.weight == -1):
            v = np.array([record.height, record.age])
            # for consistency we convert the value to an integer
            record.weight = int(np.dot(theta,v))

    return records

def parse_dataset(ds1_name, ds2_name):

    #table of usernames to IDs 
    id_map = {}
    Datasets.list_members(ds1_name + "(*)")
    usernames = Datasets.list_members(ds1_name + "(*)")

    for username in usernames.split('\n'):
        # Using Python f strings
        dataset_name = f"{ds1_name}({username})"
        id_map[Datasets.read(dataset_name).strip()] = username 


    #list all members of the dataset using wildcard (*)
    Names = Datasets.list_members(ds2_name + "(*)")
    records = []

    for member_name in Names.split('\n'):
        dataset_name = "{}({})".format(ds2_name, member_name)  
        records.append(Record(id_map[member_name], Datasets.read(dataset_name)))

    return records

def write_output(output_ds, records):

    for record in records:
        member_name = f"{output_ds}({record.username})"
        Datasets.write(member_name, str(record))

if __name__ == "__main__":

    if(len(sys.argv) < 3):
        print("Usage {} <Input Dataset 1> <Input Dataset 2> [Output Dataset]".format(sys.argv[0]))
        sys.exit()

    inputds1 = sys.argv[1]
    inputds2 = sys.argv[2]

    #assigns the 3rd argument if it exist, otherwise creates random name
    if len(sys.argv) == 4:
        output = sys.argv[3]
    else: 
        output = Datasets.temp_name()
        Datasets.create(output)
        print("No output specified using: {}".format(output))

    records = parse_dataset(inputds1,inputds2)
    complete_records = fill_in_missing(records)
    write_output(output, complete_records)

# confirm python is functioning correctly
$ python3 -–version
Python 3.8.5

# confirm zoau is accessible
$ zoaversion 
2020/06/19 14:12:11 CUT V1.0.3

# confirm that zoau is importable to python
$ python3 -c "import zoautil_py;print(zoautil_py.__name__)" 
zoautil_py.python38

Python - Group home

Using IBM Open Enterprise Python for z/OS and ZOAU to Work With Datasets