Blogs

Python Wordcount Extension Breakdown

By Archive User posted Wed September 09, 2015 01:35 PM

Two weeks ago I posted this blog, which introduced a new extension for IBM SPSS Statistics that counts the frequency of words in a given column/variable. This post is a follow up, providing a technical breakdown of the Python running behind the scenes. This is the interface for the extension:

This extension was developed by Jon Peck, a Senior Software Engineer at IBM. In this post I will focus on the interaction between Python and SPSS Statistics. My assumption is that you are already familiar with Python syntax. This post is formatted as a Jupyter Notebook to help explain each section of code.

The majority of this extension is written in Python 2. However, there are two lines of SPSS Statistics Syntax we need to include so the software knows when to start and stop executing the script. We need 'begin program.' at the start of the script to start executing the Python, and 'end program.' at the end to stop the script.

After we begin the program, we import three SPSS modules as well as 're' which is the Python module for regular expressions.

In [ ]:

begin program.
import re, spss, spssdata, spssaux  # Import Regular Expressions and
 SPSS Modules

Once the libraries are imported, we declare a set of words that will not be counted for our function. A set of six words is shown below, this could be expanded to account for any other common words you would like to skip when counting frequencies. We will see later the efficiency gained by using a Python set for the words we are excluding. Try experimenting with this set by adding other common words in your data that you do not want to count.

In [ ]:

# words to ignore in counting all in lower case
stopwords = set(['and', 'or', 'a', 'the', 'but', 'else'])

Next, we have the meat of the script. This is the function that will perform the word count on our data. This is the full function, followed by a breakdown of what the code means.

In [ ]:

def freqs(words, mincount):
    """accumulate word occurrence frequencies and return dictionary of counts
    
    words is the name of the variable to summarize
    mincount is the minimum number of occurrences for inclusion"""
    
    curs = spssdata.Spssdata(words, omitmissing=True)
    wordcounts = {}   # initialize a dictionary for counts
    # Loop over all cases
    for case in curs:
        words = re.findall(r"\b[a-zA-Z']+\b",case[0])
        words = [item.lower() for item in words if not item in stopwords]
        # Loop over words in case
        for w in words:
            wordcounts[w] = wordcounts.get(w, 0) + 1
    curs.CClose()
    # word list is returned sorted in descending order of counts
    wordlist =  sorted(wordcounts.items(), key=lambda x: x[1], reverse=True)
    return [item for item in wordlist if item[1] >= mincount]

We define the function 'freqs' as taking two arguments:

words - this is the Statistics variable name selected in the dialog box where the text of interest is located

mincount - this is the minimum number of occurrences required for a word to be included on the table and is also an input from the dialog box

The next line of interest declares a cursor which will link the script to the active dataset. From the documentation, "Spssdata manages the active SPSS dataset retrievals at a higher level than in the spss module". The 'omitmissing' argument for Spssdata is set to True, since missing values are delivered to Python as None, and regular expressions require strings. If this argument was not set to True, then a missing value would cause the script to fail.

In [ ]:

    curs = spssdata.Spssdata(words,  omitmissing=True)

Now that we have a connection to the dataset, we can iterate over each row (case) to do our check for words.

There are different ways to accomplish this, but in this example we use regular expressions to build the list of words. If you have not used regular expressions in Python, check out the documentation here. Notice that first we find all the words in the given cell (case[0]), then loop through that list converting all words to lower case and removing any words in the 'stopwords' set. All words are converted to lower case so that all casings of a word will be counted together. Using a set for the words we want to skip allows each word to be checked in constant time. This means performance will not be negatively impacted if we exclude many words and the size of 'stopwords' grows.

In [ ]:

    for case in curs:
        words = re.findall(r"\b[a-zA-Z']+\b",case[0])
        words = [item.lower() for item in words if not item in stopwords]

We now have our list of words for a given case, but we need to start counting the frequency for each of these words.

The for loop below iterates over each word in the list created in the outer for loop. This loop utilizes a Python dictionary with keys set to be a word in the list and the value being the number of times it has appeared. The get() method works great for this dictionary, where the default value is set to 0 in case the key (current word) is not already in the dictionary. We add 1 to that value to increment the count.

Next, the connection to the dataset is closed using the CClose() method.

In [ ]:

# Loop over words in case
        for w in words:
            wordcounts[w] = wordcounts.get(w, 0) + 1
    curs.CClose()

Now, most of the hard work is done, we just need to sort the key-value pairs in the 'wordcounts' dictionary and do a check that all the keys (words) meet the minimum count requirements. The key argument in the sorted method is set to values from the 'wordcounts' dictionary by using lambda x: x[1]. This sorts based on the second element in each tuple in the list returned from wordcounts.items().

Next, we loop through the sorted list of tuples and confirm that the value (item[1]) is greater than the minimum count and only select values meeting the minimum count criterion.

In [ ]:

 # word list is returned sorted in descending order of counts
    wordlist =  sorted(wordcounts.items(), key=lambda x: x[1], reverse=True)
    return [item for item in wordlist if item[1] >= mincount]

Now the word counting function is complete. The next step is to bring the variables from the dialog box into our code. When a variable is created in a custom dialog, we can reference it in a script by using the schema '%%IdentifierName%%'. When we created the dialog for this extension we called our target variable countedVar and the minimum count limit minCount.

We declare new variables for these two values in the script and pass them into the freqs() function.

In [ ]:

# This connects the dialog controls with the code
countedvar = %%countedVar%%
mincount = %%minCount%%

counts = freqs(countedVar, minCount)

The final bit of code creates the output. The title for this output will be "Wordcounts". A BasePivotTable is created in the next line with an argument for the Title (displayed in the output) and the templateName. The templateName specificies the OMS (Output Management System) table subtype. This argument is required for every pivot table, but in this example we are not working with the OMS. The SetDefaultFormatSpec method determines the format of the table and Count was used as it is appropriate for this case. For a table of formatting options please see page 34 of the Python Reference Guide for IBM SPSS Statistics.

The next few lines populate the data in the table we created. First, we label the row dimension ("Words"), then populate the labels with the keys at index 0 from the output from the freqs() function. Next, we label the next column "Counts" and iterate over the values at index 1 to populate data in this column. The Python Reference Guide also has detailed information on the SimplePivotTable method.

Finally, we end the procedure and end the program with SPSS syntax.

In [ ]:

# display a pivot table of results
spss.StartProcedure("Wordcounts")
table = spss.BasePivotTable("Word Counts for Variable: %s" % countedVar,
"WORDCOUNT")
table.SetDefaultFormatSpec(spss.FormatSpec.Count)
table.SimplePivotTable(rowdim = "Words",
    rowlabels = [item[0] for item in counts],
    collabels = ["Counts"],
    cells = [item[1] for item in counts])
spss.EndProcedure()
end program.

Now the script is complete!

With one click we can install this extension and it will be ready for testing. Feel free to edit the dialog to modify it to suit your needs. A great way to start would be by adding additional stop words.

This post was created with the help of IBM's Data Scientist Workbench. This is a tool being previewed for free now. I found this tool to be very easy to get up and running with and very easily exported my file as an html page and a Jupyter Notebook.

#Programmability
#python
#SPSS
#SPSSStatistics

0 comments

11 views

Blogs

Python Wordcount Extension Breakdown

By Archive User posted Wed September 09, 2015 01:35 PM

Permalink

Additional
Resources

Office

Quick Links

Blogs

Python Wordcount Extension Breakdown

By Archive User posted Wed September 09, 2015 01:35 PM

Permalink

Additional Resources

Office

Quick Links

Additional
Resources