In a previous blog, we gave an overview of Watson Machine Learning for z/OS (WML for z/OS or WLMz) and mentioned we had implemented some SMF real time scoring use cases. In this blog, we’ll introduce our experiences with SMF real time scoring through WMLz.
SMF records are an important data source on z/OS. They provide system and job-related information that can be used for analyzing resource usage, billing users and maintaining system security. Besides the risk introduced from moving sensitive data like SMF outside the scope of your z/OS security administrator’s control, offloading SMF data from your z/OS systems to doing machine learning analysis can introduce long delays due to the data movement and will impact the results' timeliness. Leveraging the data gravity advantage of WMLz to deploy machine learning modeling and scoring of SMF real time records right on z/OS can bring more timely analytic results for businesses that depend on SMF data.
SMF data is an easy target for our environment because of the volume of records we produce due to the operational approach we need to take to do our “customer-like” testing activities. In our parallel sysplex, a tremendous number of SMF records are generated every second. In this blog, we will showcase a WMLz use case that focuses on a CF structure activity model which is used to score CF performance metrics and provide warnings when we deviate from that model. The source data is CF activity data from SMF type 74 subtype 4 records. From the machine learning perspective, the type of machine learning employed for this use case are Supervised and Regression. The core idea is to collect historical data and filter the obviously abnormal records, treat the remaining data as normal data and ensure they match normal distribution. Then we use a 95% interval normal distribution threshold to build the baseline threshold model. In reviewing the historical SMF data we’ve collected along the way, we found that the CF activity records showed CF Synchronous Request Service times were periodically distributed. For the sake of simplicity, we initially decided to use only some of the TIME features as our model features.
We employed the WMLz IDE’s Jupyter Notebook interface to develop our own machine learning process through the following procedural steps: 1.Data collection, 2. Model training, 3. Saving and deploying the model, 4. Scoring new CF activity against the model.
1. Data collection
In our sysplex, SMF Type 74 records are written to a DASD-only log stream, and we use batch jobs to dump SMF records from their log streams every day for each system. One GDG dataset is produced for every LPAR for one day's worth of SMF records. We used the Python dsdbc library to access a local MDS (Mainframe Data Service) server to read the SMF dump datasets, and then extracted the required columns and saved the records to csv files.
Below, we provide sample code to show how we accomplished these steps:
#import dsdbc library and other required libraries
import dsdbc
import pandas as pd
#Create connection to MDS server AZKA
conn = dsdbc.connect(SSID="AZKA")
cursor = conn.cursor()
#Rename the columns name we need
col3={'SMF74DTE':'DATE','SMF74TME':'TIME','CHILD_KEY':'KEY','R744FNAM':'CF_NAME','R744SNAM':'STRC_NAME','R744STYP':'TYPE','R744SFLG':'E_FLAG','R744SSRC':'SYNC_RATE','R744SSTM':'SYNC_AVGSEVR','R744SARC':'ASYNC_RATE','R744SATM':'ASYNC_AVGSEVR','R744SSTA':'ASYNC_CHG','R744SQTM':'ASYNC_DEL'}
#The core function to extract the required column from SMF dataset and save them to csv files
def writetocsv(dataset):
sql1="SELECT SMF_TIME,SMF_SID,SMF74DTE,SMF74TME,CHILD_KEY,R744FNAM FROM
SMF_07404__"+dataset+" PARENT JOIN SMF_07404_R744FLCF__"+dataset+" FLCF ON PARENT.CHILD_KEY=FLCF.PARENT_KEY"
sql2="SELECT CHILD_KEY,R744SNAM,R744STYP,R744SFLG,R744SSRC,R744SSTM,R744SARC,R744SATM,R744SSTA,R744SQTM FROM SMF_07404__"+dataset+" PARENT JOIN SMF_07404_R744SREQ__"+dataset+" SREQ ON PARENT.CHILD_KEY=SREQ.PARENT_KEY"
data1=pd.read_sql(sql1,conn)
data2=pd.read_sql(sql2,conn)
data3=pd.merge(data1,data2,on='CHILD_KEY')
data3.rename(columns=col3, inplace = True)
sorteddata=data3.sort_values(by='SMF_TIME')
sorteddata.to_csv("/mlz/appdev/data/RMF744/RMF744_"+dataset.split('_')[2]+"_"+dataset[-8:]+".csv")
We set up a timer through System Automation for z/OS to automatically submit the data collection job every day. After several months, there were thousands of csv files generated which became our source data. The table content of part of one csv file is displayed below in which TYPE represents structure types: 1 represents Unserialized List structure, 2 represents Serialized List structure, 3 represents Lock structure, and 4 represents Cache structure; SYN_RATE represents SYNC request rate; and SYNC_AVGSEVR represents the average SYNC request service time.
2. Model training
We combined all the csv files in one pandas data frame and did the data pre-process which eliminates the entries with a null SYNC_RATE value. Then we summed the records of all LPARs as the overall value for the sysplex at each timestamp and made SYNC_AVGSEVR value to be divided by SYNC_RATE value to calculate the SYNCTIME which represented the average SYNC service time of each SYNC request. We split every day's records into slots of 3 hours each. In every slot, we calculated a threshold value as the base line value which would be used as the training label.
The processed source data then looks like the following:
i. Filtering the abnormal records
As there are more than 100 structures in our sysplex, we chose one structure to build its SYNC service time base line model. We plotted the SYNC service time distribution chart and determined the filter value for the structure. In the example below, we use a value of 30 to eliminate the prickly spots which are obvious abnormal records. The remaining records should represent the normal distribution of SYNC service time for this structure.
ii. Calculating the95% normal distribution interval
We were configured with 30-minute intervals for our SMF Type 74 records. In each 3 hours slot, there were 6 records and we found the 6 records approximatively matched normal distribution. We used the top 95% distribution value as the base line value for this slot which means most of the SYNC service time should not exceed this value. The sample code below shows how we calculated the base line value of 95% normal distribution interval:
#Fileter the records of Structure strc
filterdata=sync_train_data_dbwggbp[sync_train_data_dbwggbp['STRC_NAME']==strc]
#Calculate the mean and std of records in each 3 hours interval
for time in threehourslist:
filterstring=(filterdata['3HSLOTTIME'] == time)
if filterdata[filterstring].shape[0]>0:
mean=np.mean(filterdata[filterstring].SYNCTIME)
std=np.std(filterdata[filterstring].SYNCTIME)
slotvalue=filterdata[filterstring]['3HSLOT'].values[0]
conf_intveral = stats.norm.interval(0.95, loc=mean, scale=std)
#conf_intveral[1] is the calculated 95% threshold of SYNCTIME
After calculation, the final training source data looked like the screenshot below:
iii. Model Training
At last, we were able to use the training source data to build, train and label the dataset using the sample code below . As we mentioned in the beginning, considering the workloads in our sysplex were running periodically, the SYNC service time should be periodically distributed. For the sake of simplicity initially we only used several TIME features as the model features. We constructed time related features called dayofweek, dayofyear and dayofmonth. The combined final features are ['dayofweek','dayofyear','dayofmonth','3HSLOT'].
#the function to build TIME features
def create_features(df,label):
#Creates time series features from datetime index
df['date'] = df.index
df['dayofweek'] = df['date'].apply(lambda x:x.dayofweek)
df['dayofyear'] = df['date'].apply(lambda x:x.dayofyear)
df['dayofmonth'] = df['date'].apply(lambda x:x.day)
X = df[['dayofweek','dayofyear','dayofmonth','3HSLOT']]
y = df[label]
return X, y
x_train, y_train = create_features(DBWGGBP_BaseLine3H,'THRESHOLD')
After the x_train and y_train data are ready, we used the xgboost and scikit-learn libraries to train the model. As with any general machine learning model training process, we first implemented a parameters optimization step that used a cross validation method, and once the selected optimized parameters are determined, the model and pipeline can be generated.
#import the xgboost library and required sklearn library
from sklearn.grid_search import GridSearchCV
import xgboost as xgb
#Cross validation to select best parameters
……
model=xgb.XGBRegressor(**other_params)
opt=GridSearchCV(model,cv_params,scoring='r2',cv=5)
opt.fit(x_train,y_train)
……
#generate the model and pipeline for final save
model = xgb.XGBRegressor(learning_rate=0.05, n_estimators=100, max_depth=4, min_child_weight=5, subsample=1, colsample_bytree=0.8,reg_alpha=0.5, reg_lambda=0.03)
pipeline = Pipeline([('xgb',model)])
tentModelxgb = pipeline.fit(x_train, y_train)
iiii. Model evaluation
We used the model to generate the predicted baseline SYNC service time values of training data and compared them with the real label values. The chart below showed they matched the distribution of real records very well.
After we completed the training process, and the model and pipeline were generated, the model needed to be saved and deployed in WML for z/OS for future scoring usage.
i.Saving the model
In Jupyter Notebook IDE, we needed to click "Insert project context" on the head bar and the following code snippet was generated and added in the cell to create a project context containing projectName, notebookName, authToken, repositoryIp.
import dsx_core_utils
from dsx_core_utils import ProjectContext
# pc context contains projectName, notebookName, authToken, repositoryIp
pc = ProjectContext.ProjectContext('CF_Activity_Baseline_Model_Plex1', 'CF_Time_Series_Model_DBWGGBP_Hui', 'Bearer eyJ0eXAiOiJKV1QiLCJhbGciOiJSUzI1NiJ9.eyJ1c2VybmFtZSI6Imh1aXdhbmciLCJyb2xlIjoibWxhZG0sZGV2dXNlcixhcHB1c2VyLHN5c2FkbSxpbnN0YWxsYWRtIiwidWlkIjoiMTAxMiIsImlhdCI6MTU5MzY1NDg2MywiZXhwIjoxNTkzNzAxNjYzfQ.xpFrp-jAO5MWrhJWWGqd0y2aXuqSxpNb7FxHEWh1pEBRkrJybNjWYGmJ5w8adNg7zgcrQ-wX9y3Q1wF3vi7iGkoJrNQoLyed7HRpEGs2AljN-de_AUgSfyzWGJbjk7CTlhvOvNisx2wq8Ql1c7zutVNRYnBjaIUypTzmHeahYP_n6yJx6aQyrAXDaZH36rqzCSd9xOdm2LGMg7ZQ-2HWXrMahY_DX_6y3nJTvOPDY83kpGt4pxTKoeSL4oF2CmCjsbQ0IxABohSkeVZqbHHnSYbT0QUCVFkFr01KS996DAITA9jKA-WDZO3-oXxT1IkczhVVhdSMfjlfHGVzf4VHOA','mlz.xxx.ibm.com')
# The projectName is the current project name
projectName = pc.projectName
# The notebookName is the current notebook name
notebookName = pc.nbName
# The authToken is the token generated by user management, which can access the backend service run
authToken = pc.authToken
# The metaService is the backend service with https endpoint
metaService = 'https://' + pc.repositoryIp
Then we executed the cell in Jupyter notebook and used the following sample code to save the model with a system scope.
# save the model
from repository_v3.mlrepository import MetaNames
from repository_v3.mlrepository import MetaProps
from repository_v3.mlrepositoryclient import MLRepositoryClient
from repository_v3.mlrepositoryartifact import MLRepositoryArtifact
metaservicePath = "https://mlz.xxx.ibm.com"
client = MLRepositoryClient(metaservicePath)
client.authorize_with_token(authToken)
client = MLRepositoryClient(metaservicePath)
client.authorize_with_token(authToken)
props1 = MetaProps(
{MetaNames.AUTHOR_NAME:"Hui Wang",
MetaNames.AUTHOR_EMAIL:"cdlwhui@cn.ibm.com",
MetaNames.MODEL_META_PROJECT_ID: projectName,
MetaNames.MODEL_META_ORIGIN_TYPE: "notebook",
MetaNames.SCOPE: "system",
MetaNames.MODEL_META_ORIGIN_ID: notebookName})
input_artifact = MLRepositoryArtifact(pipeline, name="Plex1_CF_DBWG_GBP25_3H_V3",
meta_props=props1, training_data=x_train,training_target=y_train)
client.models.save(artifact=input_artifact)
print("model saved successfully")
ii.Deploy model
After the model was saved, the model could be seen on WMLz’s Model Management Dashboard - Models panel. You can click the model name and open the model information page to view the model detailed information.
We then needed to deploy it by clicking ACTIONS - Deploy. The deployed model can be seen on the Deployments panel:
Also, you can click the deployment name to see the Scoring Endpoint URL. This is the final model scoring RESTful API address.
4. Scoring real time CF activity against the model
After the model was deployed, we built an application that would score SMF 74 records in real time against it. For more information on how to fetch SMF records in real time, refer to our previous blog here. Once it receives one entry of real time SMF records, the records will be sent to the WML for z/OS scoring service end point URL through the HTTP POST method and the response would be the scoring result. The scoring process can be built using a Python script like the following sample code below:
#The authorization url used to get the token is combined by your WMLz webUI address + "/auth/generateToken"
authurl='https://mlz.xxx.ibm.com/auth/generateToken'
authdata={
"username": "user",
"password": "password"
}
authheaders = {'Content-Type': 'application/json'}
#Get the token used for authorization
def getauthtoken():
authresponse = requests.post(url=authurl, headers=authheaders, data=json.dumps(authdata), verify=False)
authtoken=authresponse.json()['token']
return authtoken
authtoken=getauthtoken()
#Scoring url.
scoringurl='https://mlzxx.xxx.ibm.com:14731/iml/v2/scoring/online/3d45b43d-8663-406f-bad5-2503faf234b9'
scoringheaders = {'Content-Type': 'application/json','authorization':authtoken}
#Core function to send the real time SMF data to scoring service and get the response which are the predicted baseline value.
def getscore(interval,mode,strcname):
feature=create_features(interval)
scoringresponse = requests.post(url=scoringurl, headers=scoringheaders, data=json.dumps(feature), verify=False)
return scoringresponse.json()[0]['prediction']
We submit the python script through a BPXBATCH job and the output is like the messages below. The job calculates the structure’s SYN request service time of the current interval according to the SMF records and sends the generated time features of that time to the scoring server. Then it prints the response from the scoring server which shows as the base line value. If the SYN service time is less than the scored base line value, it would say “normal”, otherwise, it would send a warning message to remind users to check if there is any abnormality related with that CF structure.
DSNDBWG_GBP8K1: SYNC_RATE=216141.0 SYNC_AVGSEVR=4518652.0
The base line value for DSNDBWG_GBP8K1 is: 25.079919815063477
SYNTIME:20.906038188
Structure DSNDBWG_GBP8K1 sync service time: 20.906038 at 2020-09-03 03:30:06.420000 is normal
DSNDBWG_GBP25: SYNC_RATE=54793.0 SYNC_AVGSEVR=1653326.0
The base line value for DSNDBWG_GBP25 is: 24.426820755004883
SYNTIME:30.1740368295
Structure DSNDBWG_GBP25 sync service time: 30.174037 at 2020-09-03 03:30:06.420000 is normal
DSNDBWG_LOCK1: SYNC_RATE=16710736.0 SYNC_AVGSEVR=63032274.0
The base line value for DSNDBWG_LOCK1 is: 9.984295845031738
SYNTIME:3.77196276693
Structure DSNDBWG_LOCK1 sync service time: 3.771963 at 2020-09-03 03:30:06.430000 is normal
DSNDBWG_SCA: SYNC_RATE=234175.0 SYNC_AVGSEVR=2695210.0
The base line value for DSNDBWG_SCA is: 14.984884262084961
SYNTIME:11.5093840077
Structure DSNDBWG_SCA sync service time: 11.509384 at 2020-09-03 03:30:06.430000 is normal
DSNDBTG_LOCK1: SYNC_RATE=7498.0 SYNC_AVGSEVR=38490.0
The base line value for DSNDBTG_LOCK1 is: 5.590181827545166
SYNTIME:5.13336889837
Structure DSNDBTG_LOCK1 sync service time: 5.133369 at 2020-09-03 03:30:06.420000 is normal
DSNDBTG_SCA: SYNC_RATE=1146.0 SYNC_AVGSEVR=45997.0
The base line value for DSNDBTG_SCA is: 43.525272369384766
SYNTIME:40.1369982548
Structure DSNDBTG_SCA sync service time: 40.136998 at 2020-09-03 03:30:06.420000 is normal
The use case still lacks insufficient data and insufficient features. Currently we are able to collect only several months of data and since CF activity is not only related with TIME, we intend to continue to collect more data, build a model for all CF structures, and improve and optimize some specific structures with other features like workload status indicators.
Authors:
Hui Wang(cdlwhui@cn.ibm.com)
Zhao Yu Wang(wangzyu@cn.ibm.com)
Jing Wen Chen(bjchenjw@cn.ibm.com)
Yu Mei Dai(dyubj@cn.ibm.com)