Watson Studio, Watson ML, Watson OpenScale

 View Only

SPSS Modeler Extension Nodes – Embedding R and Python Code in Modeler

By Sidney Phoon posted Fri June 01, 2018 06:59 PM

  
SPSS Modeler 18.1 introduced five new nodes that allows you to embed Python and R code in a Modeler Stream. With these nodes you can extend and embrace open source in SPSS Modeler, to perform tasks you can’t easily accomplish with out-of-the-box Modeler nodes.

























Extension Import Node Run R or Python scripts to import data
Extension Transform Node Takes data from a node upstream and apply transformations to the data using R or Python scripting
Extension Model Node Run R or Python scripts to build and score models
Extension Output Node Run R or Python scripts to display text and graphical outputs on screen or output to file
Extension Export Node Run R or Python scripts to export data

 



IBM SPSS Modeler Extension Nodes – Python


Python Execution Environment

  • Data will be presented in the form of a Spark DataFrame

  • The IBM SPSS Modeler installation includes a Spark distribution (for example, IBM SPSS Modeler 18.1.1 includes Spark 2.1.0)

  • IBM SPSS Modeler includes a Python distribution

  • If you plan to execute Python/Spark scripts against IBM SPSS Analytic Server, you must have a connection to Analytic Server, and Analytic Server must have access to a compatible installation of Apache Spark





Programming Framework - Python
In general, to pass data from an extension node to a node downstream, you must set the schema or data model of the data and set the data in a Spark data frame.

The Python code doesn't just run upon execution. It runs any time the output data model of the node is requested. This includes when the Apply or OK button is clicked on the node dialog, when the tab is switched and when the node is connected. Hence, prior to the execution of the node, when the data is not yet available to be passed downstream, the output data model must be defined. See sample code below.

The second half of the sample code builds the output dataframe on execution of the node. The syntax in blue is SPSS Modeler specific.

Python Programming Framework

Extension Transform Node – Python
With the Extension Transform node, you can take data from a node upstream and apply transformations to the data.

In this example, Extension_Transform_Example_Python.str, we will apply the regexp_ extract() function to extract the error code from a string.

import spss.pyspark.runtime
from pyspark.sql.types import *
from pyspark.sql.functions import regexp_extract
pattern = r'Program Ref\. \/[A-Z0-9_]+\/([A-Za-z0-9_]+)\/'
asContext = spss.pyspark.runtime.getContext()
inputSchema = asContext.getSparkInputSchema()

# Get existing schema, add a new field to create a new output schema
outputSchema = StructType(inputSchema.fields + [StructField('Code', StringType(), True)])
asContext.setSparkOutputSchema(outputSchema)
if not asContext.isComputeDataModelOnly():

# Get existing dataframe and add a new column to the dataframe
inputData = asContext.getSparkInputData()
outputData = inputData.withColumn('Code', regexp_extract("Event Description", pattern, 1))
asContext.setSparkOutputData(outputData)






Extension Import Node – Python
With the Extension Import node, you can execute Python scripts to import data.

In this example, Extension_Import_Example_Python.str, we will import a csv file from a github repository. Since this is a source node, you must construct the output schema from scratch. The csv file has two fields.

import wget
import spss.pyspark.runtime
from pyspark.sql.types import *
asContext = spss.pyspark.runtime.getContext()
sqlContext=asContext.getSparkSQLContext()

fieldList = [
StructField('ID',IntegerType(),True),
StructField('CHURN',StringType(),True)
]
outputSchema = StructType(fieldList)
asContext.setSparkOutputSchema(outputSchema)
if not asContext.isComputeDataModelOnly():

#download file from github
url_churn='https://raw.githubusercontent.com/SidneyPhoon/Data/master/churn.csv'
wget.download(url_churn,"/Data")
#read file
churn_df= sqlContext.read.format('org.apache.spark.sql.execution.datasources.csv.CSVFileFormat')\
.option('header', 'true').option("inferSchema", "true").load("/Data/churn.csv")
# return the output DataFrame as the result
asContext.setSparkOutputData(churn_df)






Extension Model Node – Python
Build and score models with the Extension Model node.

In this example, Extension_Model_Example_Python.str, we will build a Spark ML RandomForestClassifier model, to predict Mortgage Default.

In the extension node, review the syntax in the Python model building syntax, the code builds the model saves the model.

modelpath = ascontext.createTemporaryFolder()
model.save(modelpath)
ascontext.setModelContentFromPath("model",modelpath)


 
Review the syntax in the Python model scoring syntax, the code builds the output schema to include the new fields to be passed downstream, loads the model and scores the data in the input dataframe.

indf = ascontext.getSparkInputData()
model_path = ascontext.getModelContentToPath("model")

model = PipelineModel.load(model_path)
# compute the scores
r1 = model.transform(indf)




 
Extension Output Node – Python
With the Extension Output node, you can display text and graphical outputs on screen or output to file.
                  

In this example, Extension_Model_Example_Python.str, the Python syntax converts the Spark datafame into a Pandas dataframe, apply the describe() to the Pandas dataframe and prints the results. Since this is a terminal node, there is no need to set output schema nor output dataframe.

indf = ascontext.getSparkInputData()
#display summary statistics
print indf.toPandas().describe()



 
IBM SPSS Modeler Extension Nodes – R

R Execution Environment

  • You must have installed SPSS Modeler Essentials for R and the supported version of R before running R code in the extension nodes (see ModelerRInstall.pdf)


  • All R nodes are independent global R environments. Therefore, using library functions within the two separate R extension nodes requires the loading of the R library in both R scripts



Programming Framework - R
There are three important reserved R objects

  • modelerDataModel – defines the data model or schema in the R dataframe

  • modelerData – defines the R dataframe to be passed downstream to the next node

  • modelerModel – defines the R model built in the Extension Model node



In general, to pass data from an extension node to a node downstream, you must set the schema of the data to the reserved word, modelerDataModel, and set the data in a R data frame to the reserved word, modelerData. The syntax in Blue is SPSS Modeler specific.


Extension Import Node – R
With the Extension Import node, you can execute R scripts to import data. In this example, Extension_Import_Example_R.str, we will import a csv file from a github repository.

URL<-“https://raw.githubusercontent.com/SidneyPhoon/Data/master/churn.csv"
destfile<-"/Data/churn.csv"
download.file(URL, destfile)
churn_df <- read.csv(file="/Data/churn.csv", header=TRUE, sep=",")
#create a new variable
modelerData<-churn_df
#define the metadata of the new field
ID<-c(fieldName="ID",fieldLabel="ID",fieldStorage="integer",fieldMeasure="",fieldFormat="",fieldRole="")
CHURN<-c(fieldName="CHURN",fieldLabel="CHURN",fieldStorage="string",fieldMeasure="",fieldFormat="",fieldRole="")
#add the metadata of the new field to the existing metadata
modelerDataModel<-data.frame(ID,CHURN)


 
Extension Transform Node – R
With the Extension Transform node, you can take data from a node upstream and apply transformations to the data. In this example, Extension_Transform_Example_R.str, we will retrieve weather data from the Weather Underground website and pass it downstream.

Review the R syntax in the extension node. The R syntax builds the output dataframe and sets it to the reserved word, modelerData. The R syntax creates new fields and the modelerDataModel.

Extension Model Node – R
Build and score models with the Extension Model node.

In this example, Extension_Model_Example_R.str, we will build a Logistic Regression model, to predict Loan Default. The model is set to the reserved word, modelerModel.

In the extension node, review the syntax in the R model building syntax.

modelerModel <- glm(default~employ+address+debtinc+creddebt+alldebt,data= modelerData,family=binomial())


 
Review the syntax in the R model scoring syntax, the syntax scores the data, builds the output schema to include the new field, defaultpropensity.



result<-predict(modelerModel, type="response") # predicted values
modelerData<-cbind(modelerData,result)
var1<-c(fieldName="defaultpropensity",fieldLabel="",fieldStorage="real",fieldMeasure="",fieldFormat="",fieldRole="")
modelerDataModel<-data.frame(modelerDataModel,var1)


 
Extension Output Node – R
With the Extension Output node, you can display text and graphical outputs on screen or output to file.




                    



In this example, Extension_Model_Example_R.str, we aggregated the debt-to-income ratio (debtinc), and plotted the value by “Default Group”. Since this is a terminal node, there is no need to set the output data model nor output dataframe.


Troubleshooting Tips
Write and debug Python/R code in an IDE such as Jupyter Notebook or R Studio. Paste the working code into the extension nodes and set the necessary output data model, output dataframe and model variables.

If you are running the Modeler stream in Modeler Server, you must install SPSS Modeler Essentials for R and the supported version of R in Modeler Server, before running R code in the extension nodes

Supporting Information

























#datascience
#extensionnodes
#predictiveanalytics
#Programmability
#python
#R
#SPSS
#SPSSModeler
#WatsonStudio
3 comments
23 views

Permalink

Comments

Mon October 01, 2018 10:02 AM

Hi - Have you tried posting in the forum? https://developer.ibm.com/answers/topics/modeler.html?smartspace=predictive-analytics
Or if you have support, open a ticket here:http://ibm.com/mysupport

Sat September 29, 2018 06:29 AM

I have installed R 3.3.3. as suggested by the manual, but still if I try to run a stream with an Extension node I get the "Failed to run R" message. Could you please help me?

Thu June 14, 2018 12:14 PM

The last thing we need for Modeler (or the GUI of Watson Experience) become the most clever tool of the market is the ability to export all a stream in R or in Python (or PiSpark) just in clicking on a button. And as on Azure ML the stream of the model which has learnt but also the stream whit the ability to relearn. With a such possibility to obtain a full stream in one or both of this language (at leat Python) Modeler will be and stray the best tool of the market.