In SPSS Modeler, it can be useful to interact with a stream during its execution. For example, you may want to select a particular subset of data to analyse by scanning the raw data and presenting a set of possible categories before doing further processing. In this article, what we will do is use SPSS Modeler scripting to:
- generate a stream to scan a named data set for categories in a particular field
- present the available categories to the user who can then select which category they are interested in
- generate a histogram of a numeric field for the category chosen by the user
Along the way we will cover:
- how to generate and configure nodes in a stream
- some basic Java Swing UI (user interface) controls
- some of the issues that can occur when interacting with a user during execution
Stream StructureOur script will generate a stream that consists of a data source node (in this case a CSV import node) and two sub-branches off it:
- One that uses an aggregate node to identify the categories in a named field
- Another that uses a select node to select records where the category field has the named category before passing those records to a histogram node
Script StructureTo keep things flexible, we will define a primary function that takes the location of the CSV file, the name of the categorical field we want to select from and the name of the continuous field we want to plot the histogram:
def genHistogram(stream, csvFile, categoryColumn, continuousColumn):
# whatever we need to do...
This means we could call it as a stream script with:
genHistogram(modeler.script.stream(), '$CLEO_DEMOS/DRUG1n', 'Drug', 'Age')
where 'Drug' and 'Age' are two of the fields in the DRUG1n demonstration data set.
The general structure of the primary function is:
def genHistogram(stream, csvFile, categoryColumn, continuousColumn):
# clear any existing nodes from the stream
# build the stream using the information passed in the parameters
# execute the first sub-branch to get the list of possible categories
# present the user with a list of categories so they can pick one
# configure the select expression based on the chosen category
# execute the second sub-branch to create the histogram
Clearing Existing Nodes From The StreamStreams include a
clear()
which we will use to remove an existing nodes from the stream. This isn't strictly necessary but keep things tidy if we re-execute the script multiple times.
def genHistogram(stream, csvFile, categoryColumn, continuousColumn):
stream.clear() # reset the stream
# build the stream using the information passed in the parameters
# execute the first sub-branch to get the list of possible categories
# present the user with a list of categories so they can pick one
# configure the select expression based on the chosen category
# execute the second sub-branch to create the histogram
Building The StreamThe next step is to build our stream based in the information we already know (the path to the CSV file, the categorical column we're picking the category from and the continuous column we want to plot the histogram for).
The CSV file node is created first. Since we are limiting this to a simple CSV using the defaults (comma as the field separator, first line contains the field names etc.) we only need to set the file path.
In our first sub-branch, we create an aggregate node configured to use the categorical field as its only key field. We then pass the results to a table node which will create a table output. The table output will contain a row set where each row contains two columns - the category in the key field and the number of records that category appears in (we don't need the count but we will leave it enabled for simplicity).
In our second sub-branch, we create a select node to select the records with the chosen category. Since we don't yet know the category, we can't configure the expression so that will remain blank. Finally we add the histogram node with the continuous field we're going to plot.
The function which does this is shown below:
def buildStream(stream, csvFile, categoryColumn, continuousColumn):
# Create and configure the CSV import node
csvNode = stream.createAt("varfile", csvFile, 92, 92)
csvNode.setPropertyValue("full_filename", csvFile)
# Create the branch that will determine the available categories
# Create and configure the aggregate node
aggNode = stream.createAt("aggregate", "Agg", 184, 92)
aggNode.setPropertyValue("keys", categoryColumn)
# Create the table node
tableNode = stream.createAt("table", "Table", 276, 92)
# Connect them together
stream.linkPath([csvNode, aggNode, tableNode])
# Now create the second branch that will select the specified category
# and produce the required graph.
# Just create the select node - we can't configure the select expression yet
selectNode = stream.createAt("select", "Select", 184, 184)
# Create and configure the histogram node
histogramNode = stream.createAt("histogram", "Histogram", 276, 184)
histogramNode.setPropertyValue("field", continuousColumn)
# Connect them together
stream.linkPath([csvNode, selectNode, histogramNode])
# Return the nodes that the rest of the script will need access to
return [tableNode, selectNode, histogramNode]
It returns a list of the nodes the rest of the script will need access to:
- the table node in the first branch which will create the list of categories
- the select node in the second branch which will be configured once the user has selected a category
- the histogram node which will create the histogram
We can update our primary function to build the stream:
def genHistogram(stream, csvFile, categoryColumn, continuousColumn):
stream.clear() # reset the stream
nodes = buildStream(stream, csvFile, categoryColumn, continuousColumn)
# execute the first sub-branch to get the list of possible categories
# present the user with a list of categories so they can pick one
# configure the select expression based on the chosen category
# execute the second sub-branch to create the histogram
Executing The First Sub-branch To Get The CategoriesWe now need to execute the table node and get a list of categories. We will create a single function that:
- executes the table node
- scans the row set in the table output and extracts the categories
- closes the table output (which we now long need)
- sorts and returns the categories as a list
The function which does that is:
def getCategories(tableNode):
# Supply a list to capture the objects generated by executing the stream
result = []
tableNode.run(result)
# Extract the row set from the table output and get the values from the first column
rowset = result[0].getRowSet()
rowcount = rowset.getRowCount()
row = 0
values = []
while row < rowcount:
values.append(str(rowset.getValueAt(row, 0)))
row+=1
# Close/delete the table output
result[0].close()
# Sort the values before returning them
values.sort()
return values
Our updated primary function now looks like:
def genHistogram(stream, csvFile, categoryColumn, continuousColumn):
stream.clear() # reset the stream
nodes = buildStream(stream, csvFile, categoryColumn, continuousColumn)
categories = getCategories(nodes[0])
# present the user with a list of categories so they can pick one
# configure the select expression based on the chosen category
# execute the second sub-branch to create the histogram
Presenting The User With The CategoriesIn order to present the user with the categories, we are going to generate a simple UI (user interface). Since Modeler scripting uses Jython, a Java implementation of the Python language, we will use Java's Swing UI library. Swing is a complex and powerful library (which SPSS Modeler also uses for its UI) and there are dozens of books and on-line tutorials devoted to it (you can find the official tutorial
here). However, we will try to keep things simple.
The UI will consist of window containing a list of possible categories which the user can select and on OK button to confirm that selection:

To use Swing UI components we first need to import the ones we need into our script:
from java.awt import BorderLayout, Dimension
from javax.swing import JFrame, JButton, JPanel, JList, JScrollPane
A basic function to create the UI layout would be:
def makeUI(categories):
frame = JFrame("Select a category", size = (300, 200))
uilist = JList(categories)
scroll = JScrollPane(uilist)
scroll.setPreferredSize(Dimension(300, 200))
button = JButton('OK')
panel = JPanel()
# A BorderLayout provides a simple way of laying out components in a panel
panel.setLayout(BorderLayout())
panel.add(scroll, BorderLayout.CENTER)
panel.add(button, BorderLayout.SOUTH)
frame.getContentPane().add(panel)
frame.pack()
return frame
That looks fine but if we were to make the frame visible, clicking the OK button wouldn't do anything because we haven't added any behaviour to the button. To make something happen we have to attach a callback or "listener" function that gets called when certain things happen (like the button being clicked).
Before we do that though, it's worth stepping back and thinking exactly what our callback needs to do:
- check whether the user has actually selected anything before proceeding
- get the selected category from the list
- configure the select node for that category
- close the selector window
- execute the histogram node
In order to do those operations, the callback will need access to:
- the list component in the UI (to get the selected category)
- the select node that will be configured
- the name of the categorical field so an expression like "field == category" can be created and added to the select node
- the histogram node
The problem is that the callback function itself only gets passed a single object, the event that caused the callback to be called. The next issue is how to provide the callback with the values it needs.
One approach would be to store what's needed in global variables but that can quickly get unwieldy. The approach we're going to take is to create a "closure" which is basically a function with certain values baked into it.
A Note On ClosuresA closure can be created by defining a function within a function:
def buildMultiplier(x):
def mult(y):
return x * y
return mult
The
buildMultiplier
function is effectively a function factory - it is passed a number and returns a closure on the nested
mult
function that multiplies by that number. For example:
mult10 = buildMultiplier(10)
print mult10(4) # prints 40
What we will do is create a "callback" factory that captures the information we need.
We could do something like:
def buildOKHandler(selectNode, categoryColumn, histogramNode, uilist, frame):
def configAndRun(event):
# whatever the handler has to do
return configAndRun
However, this seems a little clumsy - the callback has to know about the stream components and the UI components. It would make the functionality more re-usable to separate the UI-related stuff from the stream-related stuff. One way this could be done is by capturing the UI-related functionality in a single object and allow the rest of the script to worry about the stream, nodes, fields etc. To do that, we will create a new UI component class and then create instances of that class.
Creating A Class For Our UIThere are a number of ways we could define a class for our UI. For example, we could define a class that simply created a JFrame and its associated UI elements. However,we are going to take the approach of extending the JFrame class we used in our
makeUI
function.
The syntax for extending a class can look a little clunky but getting to grips with it opens up a lot of opportunities for doing the same thing in other situations:
from java.awt import BorderLayout, Dimension
from javax.swing import JFrame, JButton, JPanel, JList, JScrollPane
class SelectorWindow(JFrame):
def __init__(self):
JFrame.__init__(self, "Select a category", size = (300, 200))
self.uilist = JList()
scroll = JScrollPane(self.uilist)
scroll.setPreferredSize(Dimension(300, 200))
button = JButton('OK')
panel = JPanel()
# A BorderLayout provides a simple way of laying out components in a panel
panel.setLayout(BorderLayout())
panel.add(scroll, BorderLayout.CENTER)
panel.add(button, BorderLayout.SOUTH)
self.getContentPane().add(panel)
self.pack()
You can see that it is similar to the
makeUI
function we defined earlier but with a few tweaks here and there. The main ones are:
class SelectorWindow(JFrame):
- this says we are defining a class called SelectorWindow which extends JFrame i.e. we have all the functionality of JFrame along with anything else we choose to add
def __init__(self):
- this is the function that is called when we create a new instance of SelectorWindow
JFrame.__init__(self, "Select a category", size = (300, 200))
- this calls the standard JFrame constructor with a title. It also sets the initial size of the JFrame to width 300 and height 200
As well as creating and laying out the UI components, we also want our class to:
- handle setting the list in the UI with the possible categories and making the UI visible
- handle the details of the "OK" button callback - checking that a value has been selected, getting the selected value from the list and closing the window, leaving our main callback code to focus on configuring and executing the second branch of the stream using the category chosen by the user
To do this, we will define a function in our class that takes the list of categories to be displayed and a function to be called once the OK button has been clicked and we know which value has been selected.
class SelectorWindow(JFrame):
# other class-related code...
def choose(self, values, cb):
# set the list data using the values
# store the callback (cb) so we can call it when the user clicks OK
# make the window visible
To call the supplied callback
cb
, we will define a separate callback function within the SelectorWindow class that will take care of checking whether an item has been selected, extracting that item from the list and closing the window before calling the supplied callback with the selected value. This means it should be possible to change the selection mechanism used by the UI (e.g. to use a combo-box rather than a list) without changing anything about the callback that gets passed to the
choose
function. Our modified class now looks like:
class SelectorWindow(JFrame):
__callback = None
# Class-specific callback
def okCallback(self, event):
# Check a callback has been set and the selection is not empty
if self.__callback != None and not(self.uilist.isSelectionEmpty()):
self.setVisible(False)
self.__callback(self.uilist.getSelectedValue())
def __init__(self):
JFrame.__init__(self, "Select a category", size = (300, 200))
self.uilist = JList()
scroll = JScrollPane(self.uilist)
scroll.setPreferredSize(Dimension(300, 200))
# Ensure the class-specific callback is always called when the OK button is clicked
button = JButton('OK', actionPerformed = self.okCallback)
panel = JPanel()
# A BorderLayout provides a simple way of laying out components in a panel
panel.setLayout(BorderLayout())
panel.add(scroll, BorderLayout.CENTER)
panel.add(button, BorderLayout.SOUTH)
self.getContentPane().add(panel)
self.pack()
def choose(self, values, cb):
self.uilist.setListData(values)
self.__callback = cb
# Centre the window on the screen
self.setLocationRelativeTo(None)
self.setVisible(True)
So the sequence is now:
- create an instance of SelectorWindow
- create our callback
- call the
choose
function on our SelectorWindow instance with the categories and our callback
This leaves us with the issue of creating our closure that references the select node, histogram node and category field.
Configuring The Second Sub-branch Based On The Chosen CategoryEarlier we defined
buildOKHandler
to create our callback as a closure. Since our SelectorWindow class is responsible for taking care of extracting the value from the list and closing the window, we can simplify it as:
def buildOKHandler(selectNode, categoryColumn, histogramNode):
def configAndRun(selectedValue):
# whatever the handler has to do to configure and run the second branch
return configAndRun
In other words we no longer need to capture the list and window from the UI (since the SelectorWindow handles that side), and our SelectorWindow will pass the selected value directly to our callback function.
So what does our callback function have to do?
- configure the select node
- execute the histogram
Configuring the select node is relatively simple. The expression will look like:
categoryColumn == selectedValue
There is an implicit assumption in this example that the category column contains strings. Although that might not always be true, it simplifies the example code if we make that assumption - if you need to support non-string categoricals then it is possible to do so but that is left as an exercise for the reader.
To make sure we can handle field names that may contain non-alphanumeric characters, we will enclose the field name in single quotes (')
'categoryColumn' == "selectedValue"
Making the necessary changes to our
configAndRun
method gives us:
def buildOKHandler(selectNode, categoryColumn, histogramNode):
def configAndRun(selectedValue):
expr = '\'' + categoryColumn + '\' == "' + selectedValue + '"'
selectNode.setPropertyValue("condition", expr)
# Then run the histogram
return configAndRun
Executing The Second Sub-branch To Create The HistogramObviously we've already shown how to execute the table node to get the categories so we can simply follow that pattern by calling the node's
run
method, right?
def buildOKHandler(selectNode, categoryColumn, histogramNode):
def configAndRun(selectedValue):
expr = '\'' + categoryColumn + '\' == "' + selectedValue + '"'
selectNode.setPropertyValue("condition", expr)
# Then run the histogram
results = []
histogramNode(results) # !! DO NOT RUN THIS
return configAndRun
Unfortunately, if you run this,
you will lock up the Modeler UI meaning the only thing you can do is kill the process and lose whatever work you were doing. So what's the problem?
The issue is the way the Swing UI toolkit we are using for displaying options to the user works. If you recall, the callback we are building here is being run in response to a UI event (i.e. the user clicking the "OK" button). While that UI event is being handled, Swing cannot do anything else until the event handling code has completed and returned control back to Swing. Why does that cause a problem? Executing a stream causes the UI to be updated in various ways but since the Swing UI is still busy handling the "OK" button click event, those updates can't happen. This means the handler execution can't complete until the UI is updated but the UI can't be updated until the handler execution is complete and this causes the Modeler UI to lock up. Obviously this is a problem but the solution is relatively simple.
Best practice in many UI environments (not just Swing) is to minimise the time spent in callbacks since this makes the UI responsive again as quickly as possible. Any long running tasks are typically run in a separate "thread". A thread is a little like a process in that it allows multiple tasks to be run in parallel (or at least give the impression that is what is happening). Using threads, like Swing, is a complex topic and beyond the scope of this article to discuss in detail. However, Jython and Java have support for threads built-in, so we will use some of those features to allow us to run the stream without locking the UI.
Executing A Node In A Separate ThreadWe are going to use the Java threading facilities to run a node in a separate thread. We could use Python/Jython features but this provides an opportunity for a second example of extending Java classes in Jython.
Java's threading support has two classes we are interested in:
Thread
which defines functions for controlling and monitoring a thread
Runnable
which defines what code the thread is actually going to run[1].
[1] Technically,
Runnable
is what's called an interface in Java. An interface is like an empty class but where the functions expected to be supported have already been declared i.e. the interface defines
what has to be supported while a class that implements the interface defines
how that support is implemented.
We are going to define a class called
NodeRunner
that extends (or implements) the Java
Runnable
class and its single required method called
run()
. Each instance takes the node to be executed and the list where any results will be stored.
We will also write a convenience function called
runLater
that takes a node and a result list, creates a
NodeRunner
instance and then passes that instance to a new
Thread
object. The thread will then be started allowing the
runLater
to return immediately i.e. before the node execution has completed.
# We need to import Runnable and Thread from Java
from java.lang import Runnable, Thread
class NodeRunner(Runnable):
def __init__(self, n, r):
self.node = n
self.results = r
def run(self):
# Don't need to do this but it can be useful
print "Executing", self.node.getLabel()
self.node.run(self.results)
def runLater(node, results):
Thread(NodeRunner(node, results)).start()
We can now modify our callback (remember that?) to call
runLater
instead of executing the node directly:
def buildOKHandler(selectNode, categoryColumn, histogramNode):
def configAndRun(selectedValue):
expr = '\'' + categoryColumn + '\' == "' + selectedValue + '"'
selectNode.setPropertyValue("condition", expr)
# Then run the histogram
results = []
runLater(histogramNode, results)
return configAndRun
("Just a moment!", I hear you say, "How come I can run the first branch directly without having to use runLater
, even though I ran the script or stream by clicking on a button in the Modeler UI?". The answer is that Modeler does the same thing as we've done in this script i.e. it begins the execution in a separate thread rather than directly in the button callback.)Tying It All TogetherWe can now go back to primary function and add the steps to create the callback, create a
SelectorWindow
and invoke its
choose
function:
def genHistogram(stream, csvFile, categoryColumn, continuousColumn):
stream.clear() # reset the stream
nodes = buildStream(stream, csvFile, categoryColumn, continuousColumn)
# Run the first branch to get the categories
categories = getCategories(nodes[0])
# Build our callback function
handler = buildOKHandler(nodes[1], categoryColumn, nodes[2])
# Create the SelectorWindow and invoke its choose() function
selector = SelectorWindow()
selector.choose(categories, handler)
SummaryThis has been a long discussion with various side-tracks so if you've made it this far then well done. Even if you don't need all these features, hopefully there are bits and pieces you can pull from these examples to help you make the most of the scripting functionality available in SPSS Modeler.
The full script is shown below or can be
downloaded from GitHub:
from java.lang import Runnable, Thread
from java.awt import BorderLayout, Dimension
from javax.swing import JFrame, JButton, JPanel, JList, JScrollPane
class SelectorWindow(JFrame):
__callback = None
# Class-specific callback
def okCallback(self, event):
# Check a callback has been set and the selection is not empty
if self.__callback != None and not(self.uilist.isSelectionEmpty()):
self.setVisible(False)
self.__callback(self.uilist.getSelectedValue())
def __init__(self):
JFrame.__init__(self, "Select a category", size = (300, 200))
self.uilist = JList()
scroll = JScrollPane(self.uilist)
scroll.setPreferredSize(Dimension(300, 200))
# Ensure the class-specific callback is always called when the OK button is clicked
button = JButton("OK", actionPerformed = self.okCallback)
panel = JPanel()
# A BorderLayout provides a simple way of laying out components in a panel
panel.setLayout(BorderLayout())
panel.add(scroll, BorderLayout.CENTER)
panel.add(button, BorderLayout.SOUTH)
self.getContentPane().add(panel)
self.pack()
def choose(self, values, cb):
self.uilist.setListData(values)
self.__callback = cb
# Centre the window on the screen
self.setLocationRelativeTo(None)
self.setVisible(True)
class NodeRunner(Runnable):
def __init__(self, n, r):
self.node = n
self.results = r
def run(self):
print "Executing", self.node.getLabel()
self.node.run(self.results)
def runLater(node, results):
Thread(NodeRunner(node, results)).start()
def buildStream(stream, csvFile, categoryColumn, continuousColumn):
# Create and configure the CSV import node
csvNode = stream.createAt("varfile", csvFile, 92, 92)
csvNode.setPropertyValue("full_filename", csvFile)
# Create the branch that will determine the available categories
# Create and configure the aggregate node
aggNode = stream.createAt("aggregate", "Agg", 184, 92)
aggNode.setPropertyValue("keys", categoryColumn)
# Create the table node
tableNode = stream.createAt("table", "Table", 276, 92)
# Connect them together
stream.linkPath([csvNode, aggNode, tableNode])
# Now create the second branch that will select the specified category
# and produce the required graph.
# Just create the select node - we can't configure the select expression yet
selectNode = stream.createAt("select", "Select", 184, 184)
# Create and configure the histogram node
histogramNode = stream.createAt("histogram", "Histogram", 276, 184)
histogramNode.setPropertyValue("field", continuousColumn)
# Connect them together
stream.linkPath([csvNode, selectNode, histogramNode])
# Return the nodes that the rest of the script will need access to
return [tableNode, selectNode, histogramNode]
def getCategories(tableNode):
# Supply a list to capture the objects generated by executing the stream
result = []
tableNode.run(result)
# Extract the row set from the table output and get the values from the first column
rowset = result[0].getRowSet()
rowcount = rowset.getRowCount()
row = 0
values = []
while row < rowcount:
values.append(str(rowset.getValueAt(row, 0)))
row+=1
# Close/delete the table output
result[0].close()
# Sort the values before returning them
values.sort()
return values
def buildOKHandler(selectNode, categoryColumn, histogramNode):
def configAndRun(selectedValue):
expr = '\'' + categoryColumn + '\' == "' + selectedValue + '"';
selectNode.setPropertyValue('condition', expr)
results = []
runLater(histogramNode, results)
return configAndRun
def genHistogram(stream, csvFile, categoryColumn, continuousColumn):
stream.clear() # reset the stream
nodes = buildStream(stream, csvFile, categoryColumn, continuousColumn)
# Run the first branch to get the categories
categories = getCategories(nodes[0])
# Build our callback function
handler = buildOKHandler(nodes[1], categoryColumn, nodes[2])
# Create the SelectorWindow and invoke its choose() function
selector = SelectorWindow()
selector.choose(categories, handler)
genHistogram(modeler.script.stream(), "$CLEO_DEMOS/DRUG1n", "Drug", "Age")
#java#Programmability#python#scripting#SPSSModeler#WatsonStudio