zPET - IBM Z and z/OS Platform Evaluation and Test - Group home

Real time SMF records scoring through WML for z/OS

  

In a previous blog, we gave an overview of Watson Machine Learning for z/OS (WML for z/OS or WLMz) and mentioned we had implemented some SMF real time scoring use cases.  In this blog, we’ll introduce our experiences with SMF real time scoring through WMLz.

SMF records are an important data source on z/OS. They provide system and job-related information that can be used for analyzing resource usage, billing users and maintaining system security. Besides the risk introduced from moving sensitive data like SMF outside the scope of your z/OS security administrator’s control, offloading SMF data from your z/OS systems to doing machine learning analysis can introduce long delays due to the data movement and will impact the results' timeliness. Leveraging the data gravity advantage of WMLz to deploy machine learning modeling and scoring of SMF real time records right on z/OS can bring more timely analytic results for businesses that depend on SMF data.

SMF data is an easy target for our environment because of the volume of records we produce due to the operational approach we need to take to do our “customer-like” testing activities. In our parallel sysplex, a tremendous number of SMF records are generated every second. In this blog, we will showcase a WMLz use case that focuses on a CF structure activity model which is used to score CF performance metrics and provide warnings when we deviate from that model. The source data is CF activity data from SMF type 74 subtype 4 records. From the machine learning perspective, the type of machine learning employed for this use case are Supervised and Regression.  The core idea is to collect historical data and filter the obviously abnormal records, treat the remaining data as normal data and ensure they match normal distribution. Then we use a 95% interval normal distribution threshold to build the baseline threshold model. In reviewing the historical SMF data we’ve collected along the way, we found that the CF activity records showed CF Synchronous Request Service times were periodically distributed. For the sake of simplicity, we initially decided to use only some of the TIME features as our model features.

We employed the WMLz IDE’s Jupyter Notebook interface to develop our own machine learning process through the following procedural steps: 1.Data collection,  2. Model training,  3. Saving and deploying the model, 4. Scoring new CF activity against the model.

1. Data collection

In our sysplex, SMF Type 74 records are written to a DASD-only log stream, and we use batch jobs to dump SMF records from their log streams every day for each system.  One GDG dataset is produced for every LPAR for one day's worth of SMF records.  We used the Python dsdbc library to access a local MDS (Mainframe Data Service) server to read the SMF dump datasets, and then extracted the required columns and saved the records to csv files.

 

Below, we provide sample code to show how we accomplished these steps:

 

#import dsdbc library and other required libraries

import dsdbc

import pandas as pd

 

#Create connection to MDS server AZKA

conn = dsdbc.connect(SSID="AZKA")

cursor = conn.cursor()

 

#Rename the columns name we need 

col3={'SMF74DTE':'DATE','SMF74TME':'TIME','CHILD_KEY':'KEY','R744FNAM':'CF_NAME','R744SNAM':'STRC_NAME','R744STYP':'TYPE','R744SFLG':'E_FLAG','R744SSRC':'SYNC_RATE','R744SSTM':'SYNC_AVGSEVR','R744SARC':'ASYNC_RATE','R744SATM':'ASYNC_AVGSEVR','R744SSTA':'ASYNC_CHG','R744SQTM':'ASYNC_DEL'}

 

#The core function to extract the required column from SMF dataset and save them to csv files

def writetocsv(dataset):

    sql1="SELECT SMF_TIME,SMF_SID,SMF74DTE,SMF74TME,CHILD_KEY,R744FNAM FROM    

SMF_07404__"+dataset+" PARENT JOIN SMF_07404_R744FLCF__"+dataset+" FLCF ON PARENT.CHILD_KEY=FLCF.PARENT_KEY"

    sql2="SELECT CHILD_KEY,R744SNAM,R744STYP,R744SFLG,R744SSRC,R744SSTM,R744SARC,R744SATM,R744SSTA,R744SQTM FROM SMF_07404__"+dataset+" PARENT JOIN SMF_07404_R744SREQ__"+dataset+" SREQ ON PARENT.CHILD_KEY=SREQ.PARENT_KEY"

    data1=pd.read_sql(sql1,conn)

    data2=pd.read_sql(sql2,conn)

    data3=pd.merge(data1,data2,on='CHILD_KEY')

    data3.rename(columns=col3, inplace = True)

    sorteddata=data3.sort_values(by='SMF_TIME')

    sorteddata.to_csv("/mlz/appdev/data/RMF744/RMF744_"+dataset.split('_')[2]+"_"+dataset[-8:]+".csv")

 

We set up a timer through System Automation for z/OS to automatically submit the data collection job every day.  After several months, there were thousands of csv files generated which became our source data.  The table content of part of one csv file is displayed below in which TYPE represents structure types:  1 represents Unserialized List structure, 2 represents Serialized List structure, 3 represents Lock structure, and 4 represents Cache structure; SYN_RATE represents SYNC request rate; and SYNC_AVGSEVR represents the average SYNC request service time. 


 

2. Model training

 

We combined all the csv files in one pandas data frame and did the data pre-process which eliminates the entries with a null SYNC_RATE value. Then we summed the records of all LPARs as the overall value for the sysplex at each timestamp and made SYNC_AVGSEVR value to be divided by SYNC_RATE value to calculate the SYNCTIME which represented the average SYNC service time of each SYNC request. We split every day's records into slots of 3 hours each.  In every slot, we calculated a threshold value as the base line value which would be used as the training label.

The processed source data then looks like the following:

        
i. Filtering the abnormal records
As there are more than 100 structures in our sysplex, we chose one structure to build its SYNC service time base line model.  We plotted the SYNC service time distribution chart and determined the filter value for the structure. In the example below, we use a value of 30 to eliminate the prickly spots which are obvious abnormal records. The remaining records should represent the normal distribution of SYNC service time for this structure.

 

ii. Calculating the95% normal distribution interval

We were configured with 30-minute intervals for our SMF Type 74 records. In each 3 hours slot, there were 6 records and we found the 6 records approximatively matched normal distribution. We used the top 95% distribution value as the base line value for this slot which means most of the SYNC service time should not exceed this value.  The sample code below shows how we calculated the base line value of 95% normal distribution interval:

 

#Fileter the records of Structure strc 

filterdata=sync_train_data_dbwggbp[sync_train_data_dbwggbp['STRC_NAME']==strc]

#Calculate the mean and std of records in each 3 hours interval  

   for time in threehourslist:

        filterstring=(filterdata['3HSLOTTIME'] == time)

        if filterdata[filterstring].shape[0]>0:

            mean=np.mean(filterdata[filterstring].SYNCTIME)

            std=np.std(filterdata[filterstring].SYNCTIME)

            slotvalue=filterdata[filterstring]['3HSLOT'].values[0]

            conf_intveral = stats.norm.interval(0.95, loc=mean, scale=std)

#conf_intveral[1] is the calculated 95% threshold of SYNCTIME

 

After calculation, the final training source data looked like the screenshot below:

       

iii. Model Training

At last, we were able to use the training source data to build, train and label the dataset using the sample code below . As we mentioned in the beginning, considering the workloads in our sysplex were running periodically, the SYNC service time should be periodically distributed. For the sake of simplicity initially we only used several TIME features as the model features. We constructed time related features called dayofweek, dayofyear and dayofmonth.  The combined final features are ['dayofweek','dayofyear','dayofmonth','3HSLOT'].

 

#the function to build TIME features

def create_features(df,label):

    #Creates time series features from datetime index

    df['date'] = df.index

    df['dayofweek'] = df['date'].apply(lambda x:x.dayofweek)

    df['dayofyear'] = df['date'].apply(lambda x:x.dayofyear)

    df['dayofmonth'] = df['date'].apply(lambda x:x.day)

    X = df[['dayofweek','dayofyear','dayofmonth','3HSLOT']]

    y = df[label]

    return X, y   

x_train, y_train = create_features(DBWGGBP_BaseLine3H,'THRESHOLD')

 

After the x_train and y_train data are ready, we used the xgboost and scikit-learn libraries to train the model.  As with any general machine learning model training process, we first implemented a parameters optimization step that used a cross validation method, and once the selected optimized parameters are determined, the model and pipeline can be generated.

 

#import the xgboost library and required sklearn library

from sklearn.grid_search import GridSearchCV

import xgboost as xgb

 

#Cross validation to select best parameters

……

model=xgb.XGBRegressor(**other_params)

opt=GridSearchCV(model,cv_params,scoring='r2',cv=5)

opt.fit(x_train,y_train)

……

 

#generate the model and pipeline for final save

model = xgb.XGBRegressor(learning_rate=0.05, n_estimators=100, max_depth=4, min_child_weight=5, subsample=1, colsample_bytree=0.8,reg_alpha=0.5, reg_lambda=0.03)

pipeline = Pipeline([('xgb',model)])

tentModelxgb = pipeline.fit(x_train, y_train)

 

iiii. Model evaluation

We used the model to generate the predicted baseline SYNC service time values of training data and compared them with the real label values. The chart below showed they matched the distribution of real records very well.


3. Saving and deploying the Model:

 

After we completed the training process, and the model and pipeline were generated, the model needed to be saved and deployed in WML for z/OS for future scoring usage.  

i.Saving the model

In Jupyter Notebook IDE, we needed to click "Insert project context" on the head bar and the following code snippet was generated and added in the cell to create a project context containing projectName, notebookName, authToken, repositoryIp.

 

import dsx_core_utils

from dsx_core_utils import ProjectContext

# pc context contains projectName, notebookName, authToken, repositoryIp

pc = ProjectContext.ProjectContext('CF_Activity_Baseline_Model_Plex1', 'CF_Time_Series_Model_DBWGGBP_Hui', 'Bearer eyJ0eXAiOiJKV1QiLCJhbGciOiJSUzI1NiJ9.eyJ1c2VybmFtZSI6Imh1aXdhbmciLCJyb2xlIjoibWxhZG0sZGV2dXNlcixhcHB1c2VyLHN5c2FkbSxpbnN0YWxsYWRtIiwidWlkIjoiMTAxMiIsImlhdCI6MTU5MzY1NDg2MywiZXhwIjoxNTkzNzAxNjYzfQ.xpFrp-jAO5MWrhJWWGqd0y2aXuqSxpNb7FxHEWh1pEBRkrJybNjWYGmJ5w8adNg7zgcrQ-wX9y3Q1wF3vi7iGkoJrNQoLyed7HRpEGs2AljN-de_AUgSfyzWGJbjk7CTlhvOvNisx2wq8Ql1c7zutVNRYnBjaIUypTzmHeahYP_n6yJx6aQyrAXDaZH36rqzCSd9xOdm2LGMg7ZQ-2HWXrMahY_DX_6y3nJTvOPDY83kpGt4pxTKoeSL4oF2CmCjsbQ0IxABohSkeVZqbHHnSYbT0QUCVFkFr01KS996DAITA9jKA-WDZO3-oXxT1IkczhVVhdSMfjlfHGVzf4VHOA','mlz.xxx.ibm.com')

# The projectName is the current project name

projectName = pc.projectName

# The notebookName is the current notebook name

notebookName = pc.nbName

# The authToken is the token generated by user management, which can access the backend service run

authToken = pc.authToken

# The metaService is the backend service with https endpoint

metaService = 'https://' + pc.repositoryIp

 

Then we executed the cell in Jupyter notebook and used the following sample code to save the model with a system scope.

 

# save the model

from repository_v3.mlrepository import MetaNames

from repository_v3.mlrepository import MetaProps

from repository_v3.mlrepositoryclient import MLRepositoryClient

from repository_v3.mlrepositoryartifact import MLRepositoryArtifact

 

metaservicePath = "https://mlz.xxx.ibm.com"

client = MLRepositoryClient(metaservicePath)

client.authorize_with_token(authToken)

client = MLRepositoryClient(metaservicePath)

client.authorize_with_token(authToken)

props1 = MetaProps(

        {MetaNames.AUTHOR_NAME:"Hui Wang",

         MetaNames.AUTHOR_EMAIL:"cdlwhui@cn.ibm.com",

         MetaNames.MODEL_META_PROJECT_ID: projectName,

         MetaNames.MODEL_META_ORIGIN_TYPE: "notebook",

         MetaNames.SCOPE: "system",

         MetaNames.MODEL_META_ORIGIN_ID: notebookName})

input_artifact = MLRepositoryArtifact(pipeline, name="Plex1_CF_DBWG_GBP25_3H_V3", 

      meta_props=props1, training_data=x_train,training_target=y_train)

client.models.save(artifact=input_artifact)

print("model saved successfully")


ii.Deploy model

After the model was saved, the model could be seen on WMLz’s Model Management Dashboard - Models panel.  You can click the model name and open the model information page to view the model detailed information.

 

We then needed to deploy it by clicking ACTIONS - Deploy. The deployed model can be seen on the Deployments panel:


Also, you can click the deployment name to see the Scoring Endpoint URL. This is the final model scoring RESTful API address
.

 

4. Scoring real time CF activity against the model

 

After the model was deployed, we built an application that would score SMF 74 records in real time against it. For more information on how to fetch SMF records in real time, refer to our previous blog here. Once it receives one entry of real time SMF records, the records will be sent to the WML for z/OS scoring service end point URL through the HTTP POST method and the response would be the scoring result. The scoring process can be built using a Python script like the following sample code below:

 

#The authorization url used to get the token is combined by your WMLz webUI address + "/auth/generateToken"

authurl='https://mlz.xxx.ibm.com/auth/generateToken'

authdata={

        "username": "user",

        "password": "password"

        }

authheaders = {'Content-Type': 'application/json'} 

#Get the token used for authorization

def getauthtoken():

    authresponse = requests.post(url=authurl, headers=authheaders, data=json.dumps(authdata), verify=False)

    authtoken=authresponse.json()['token']

    return authtoken

authtoken=getauthtoken()

 

#Scoring url.

scoringurl='https://mlzxx.xxx.ibm.com:14731/iml/v2/scoring/online/3d45b43d-8663-406f-bad5-2503faf234b9'

scoringheaders = {'Content-Type': 'application/json','authorization':authtoken}

 

#Core function to send the real time SMF data to scoring service and get the response which are the predicted baseline value.

def getscore(interval,mode,strcname):

    feature=create_features(interval)

    scoringresponse = requests.post(url=scoringurl, headers=scoringheaders, data=json.dumps(feature), verify=False)

    return scoringresponse.json()[0]['prediction']

 

We submit the python script through a BPXBATCH job and the output is like the messages below. The job calculates the structure’s SYN request service time of the current interval according to the SMF records and sends the generated time features of that time to the scoring server. Then it prints the response from the scoring server which shows as the base line value. If the SYN service time is less than the scored base line value, it would say “normal”, otherwise, it would send a warning message to remind users to check if there is any abnormality related with that CF structure.

 

DSNDBWG_GBP8K1: SYNC_RATE=216141.0 SYNC_AVGSEVR=4518652.0                                               

The base line value for DSNDBWG_GBP8K1 is: 25.079919815063477                                     

SYNTIME:20.906038188                                                            

Structure DSNDBWG_GBP8K1 sync service time: 20.906038 at 2020-09-03 03:30:06.420000 is normal                      

DSNDBWG_GBP25: SYNC_RATE=54793.0 SYNC_AVGSEVR=1653326.0                                                 

The base line value for DSNDBWG_GBP25 is: 24.426820755004883                                      

SYNTIME:30.1740368295                                                         

Structure DSNDBWG_GBP25 sync service time: 30.174037 at 2020-09-03 03:30:06.420000 is normal                       

DSNDBWG_LOCK1: SYNC_RATE=16710736.0 SYNC_AVGSEVR=63032274.0                                             

The base line value for DSNDBWG_LOCK1 is: 9.984295845031738                                       

SYNTIME:3.77196276693                                                                     

Structure DSNDBWG_LOCK1 sync service time: 3.771963 at 2020-09-03 03:30:06.430000 is normal             

DSNDBWG_SCA: SYNC_RATE=234175.0 SYNC_AVGSEVR=2695210.0                                            

The base line value for DSNDBWG_SCA is: 14.984884262084961                                 

SYNTIME:11.5093840077                                                    

Structure DSNDBWG_SCA sync service time: 11.509384 at 2020-09-03 03:30:06.430000 is normal       

DSNDBTG_LOCK1: SYNC_RATE=7498.0 SYNC_AVGSEVR=38490.0                                              

The base line value for DSNDBTG_LOCK1 is: 5.590181827545166                                

SYNTIME:5.13336889837                                                               

Structure DSNDBTG_LOCK1 sync service time: 5.133369 at 2020-09-03 03:30:06.420000 is normal      

DSNDBTG_SCA: SYNC_RATE=1146.0 SYNC_AVGSEVR=45997.0                                               

The base line value for DSNDBTG_SCA is: 43.525272369384766                                 

SYNTIME:40.1369982548                  

Structure DSNDBTG_SCA sync service time: 40.136998 at 2020-09-03 03:30:06.420000 is normal                    

 

The use case still lacks insufficient data and insufficient features. Currently we are able to collect only several months of data and since CF activity is not only related with TIME, we intend to continue to collect more data, build a model for all CF structures, and improve and optimize some specific structures with other features like workload status indicators.

 

 

Authors:

Hui Wang(cdlwhui@cn.ibm.com)

Zhao Yu Wang(wangzyu@cn.ibm.com)

Jing Wen Chen(bjchenjw@cn.ibm.com)

Yu Mei Dai(dyubj@cn.ibm.com)