SPSS Statistics

 View Only

Analyzing Text Data in Surveys using the SPSS Statistics STATS TEXTANALYSIS Extension Command

By Jon Peck posted Thu April 29, 2021 09:45 AM

  

Analyzing Survey Text

Jon Peck

jkpeck@gmail.com

Updated 10-September, 2021

This note describes using the STATS TEXTANALYSIS extension command.  The command is available on the Extension Hub.

It is common for surveys to include open ended questions.  Such questions come in two main types.  The first is an “other” category in a question that also has a list of specified choices.  Examples are job title, medical conditions, race, political affiliation, and brand of car owned.  The second is an attitude or opinion question or a narrative.  Examples include opinion about a company, political candidate, or issue.  There might be a narrative of a customer service problem or an explanation of why a rating was assigned in a previous question.

Questions with unstructured text answers can contain valuable information, but they are difficult to analyze, so the survey users may be reduced to just reading through the answers.  Text might be coded into structured form, but coding is time consuming and expensive, and it can be very subjective.  The coding may be inconsistent from respondent to respondent, especially if there are multiple coders.

Nevertheless, there are ways to extract the information in such questions, and combining this with the more structured information can improve the analysis.  Respondents may well answer open ended questions differently from structured equivalents, so it can be important to extract this information even if it has a structured counterpart.  Also, a pilot study may use open ended questions, and the information in the responses used to construct a more structured survey.

The STATS TEXTANALYSIS extension command for SPSS Statistics provides tools for extracting the information in unstructured responses that can then be combined with the structured data.  The command provides tools for preprocessing the data and setting the text extraction parameters, for calculating frequencies and sentiment scores, and for constructing dummy variables based on the content.  For explanations of the operational details, use the dialog  box or syntax help. 

Analysis

We will assume that the preliminary steps have already been taken and look at the statistical analysis first and then return to discuss the preliminary issues.

We will use data from the American National Elections Study 2008 for these examples (https://electionstudies.org/data-center/2008-time-series-study/).  The study has a number of dimensions that we will ignore for this exposition.

Frequencies

The most basic analysis is to count word and phrase frequencies in a question.   Finding the most frequent of these gives an overview of the responses.  It can also help in setting up coding instructions or in devising recodes where Other categories should be mapped into the structured responses.  The frequency tabulations intelligently identify words, ignoring the letter case.  First, we tabulate the most frequent words in an  open-ended artificial dataset and then we will look at likes for the Democratic candidate in the presidential election.  Frequencies produces three tables – individual words, bigrams (two words together), and trigrams (three words together).  Of the around 2300 cases, almost half have text in this question, but often text questions are much sparser.

If the dataset is weighted, the weights are used in the Frequency tables, but they are rounded to integers.  This means in particular that weights less than .5 will cause the case to be omitted from the table.  Although the standard Statistics weight is a replication weight, fractional weights are sometimes used.  If you need more precision in the counts, you can multiply the weight by, say, 10 so that rounding distorts the results less.  Bear in mind that weights are implemented by replicating the words in the data set, so very large weights may cause problems.

The definition of n-grams is more complicated than it might seem at first sight.  In this tool, the process is as follows.

  • For each case, find the sentences in the variable text
  • For each sentence, find the words it contains and convert to lower case
  • Stem, if specified, and remove stopwords
  • Count occurrences of words, adjacent pairs, and adjacent triples in each case
  • Sum across the sentences in the case

Duplicates in the sequence are discarded, so “a”, “b”, “b” or “a”, “b”, “a” would not be counted as trigrams.  By processing each sentence separately, an n-gram is prevented from spanning sentences.

 

Here is a very simple dataset with the frequency output (footnotes omitted). The data are weighted by w.
freq

 

 

 

 

For the real dataset mentioned above, this is the output,  The dataset has fractional weights.  The footnote counting cases with text is unweighted.

 Frequencies

 

The bigram and trigram cases for this question are more interesting.  This would be harder to see without this tool.

 Bigrams and Trigrams

 

 

The footnotes on the tables report that stopwords are ignored, and there is no stemming.  Stopwords are frequently occurring but mostly nonmeaningful words that should generally be ignored.  Stopwords are, of course, language specific.  The tool supports many languages. The English stopwords are these:

didn, won, then, here, are, being, been, other, both, this, should've, hasn, mustn't, whom, shan't, wasn, so, t, haven, your, did, not, shan, aren't, and, while, them, needn, you'll, no, same, she, to, few, doesn, itself, ma, wasn't, wouldn, will, with, above, ve, yourselves, it, themselves, until, out, now, all, him, own, he, because, can, their, up, down, again, than, shouldn, below, didn't, our, how, isn't, his, by, on, i, couldn, me, hadn, ourselves, they, does, during, just, hadn't, isn, that'll, once, yours, why, into, that, won't, her, you, more, aren, further, most, or, through, you'd, haven't, those, ain, very, re, hasn't, you've, had, has, do, after, don't, each, about, doesn't, m, y, having, too, am, is, there, any, don, what, herself, weren't, wouldn't, when, as, weren, we, were, myself, a, mustn, in, mightn't, it's, have, himself, should, before, off, some, against, over, couldn't, shouldn't, s, be, my, between, mightn, theirs, these, o, who, which, hers, of, where, ours, but, for, you're, was, only, doing, needn't, nor, its, if, from, ll, she's, yourself, d, the, an, under, at, such.

Frequencies always ignore them.

Stemming means removing morphological affixes from words.   That is, reducing a word to its root or stem, which might not even be a word.  This, too, is language specific.   The tool supports many languages.  The results can be surprising.  For example, meetings stems not to meeting but all the way to meet.  Especially for word searching, it is important to know the stems.  Therefore, the tool provides for creating a variable that shows the stemmed words for each case.

Sentiment

Frequencies are useful for both Other situations and opinions and narratives.  Sentiment analysis applies mainly to the latter.  Sentiment analysis is used to determine the degree to which text is negative, neutral, positive, or a composite of those.  The analyzer used here is described in
Hutto, C.J. & Gilbert, E.E. (2014). VADER: A Parsimonious Rule-based Model for Sentiment Analysis of Social Media Text. Eighth International Conference on Weblogs and Social Media (ICWSM-14). Ann Arbor, MI, June 2014.

It is best suited for small amounts of text as might be expressed in Twitter and similar social media.

The tool calculates one to four measures of the sentiment, Vader scores, of a variable for each case using a text lexicon that assigns a score to each word along with intensifiers such as greatly, barely and other such words.  It even accounts for some negatives such as cut the mustard.  It takes into account negatives such as not, doesn’t, and aint.

The negative, neutral, and positive measures sum to 1.  The compound measure is a heuristic that is not just the average of the other measures.

This table gives an idea of how these scores perform. 
Sentiments

The data have been sorted by the overall score, text_comp.  Unknown words are considered neutral.  Negative compound scores are negative sentiment, and positive are positive sentiment.  Notice that not only did very increase the rating of good, but very, very increased it even more.  Not good was recognized as a negative.

Sentiments use a lexicon which can be modified.  The tool can create a dataset of the lexicon scores and words, but it does not show the intensifiers or negations.  You can add or change words and scores based on that lexicon.  However, the analysis is based on English.  It has been suggested that machine translation of the text before calculating sentiment may be a viable option, but your mileage may vary.

Spelling Correction

To level the playing field (an idiom that Sentiment does not understand), the tool includes a spelling corrector.  Corrections are based on Levenshtein distance, which is the number of changes required to change a word into a recognized word.  The choice is then weighted by word frequency.  You can specify that recognized names not be corrected.  The checker has over 7500 names built in.  You can also exclude stopwords.  The “corrected” text is written to a new variable.   

The checker is not interactive and may well be wrong, so experimentation may be necessary in determining whether to use it or not.  You can improve performance by creating a dataset with only the variables for which you want to correct spelling and then merging the result back with the main dataset.

Some foreign languages are supported, but you can add your own spelling dictionary.  If correcting data with a specialized vocabulary, which might be jargon, organization terms, abbreviations, or other nonstandard words, results will improve if you supply a supplemental dictionary.

Searching

Identifying specific words in text variables, perhaps based on the frequency analysis, can be useful in further analysis of the cases, including clustering and segmentation.  The tool includes a method for searching variables for word and n-gram occurrences.  It creates a dummy variable for the presence of any or all or a pattern for a set of specified words or items with or without stemming.  Using pattern, the result is a string variable with a sequence of 1 and 0 values with one value per search word or n-gram in the order listed.  The output variable for cases with no text in a variable will have the system missing value if numeric (any or all) and will, therefore, be excluded from statistical procedures.  For pattern, the result will be blank, and blank is declared as a missing value.

To specify a bigram or trigram, enter a dash between the items in the search list, e.g., good-time would be a bigram.

Here is a frequency table for a pattern search variable with three words.
Search Result Frequencies

It shows that 1126 cases had text but no occurrences of any of the search words; 73 cases had only the last word, 13 cases had only the second word, 3 cases had words two and three, and so on.

A custom variable attribute named search is created that records the search criterion.   From pattern, you can create a set of dummy variables conveniently using the SPSSINC CREATE DUMMIES extension command.  The pattern or dummy variables might be used in TWOSTEP CLUSTER, REGRESSION, or many other procedures.

The Entity Search Tab

The Search tab looks for explicitly listed words, ngrams, or their synonyms.  The Entity Search tab looks for types of items such as persons without an explicit list of texts but using the grammar of the text.  Check the Search box to perform an entity search.

 

Entity Type Select the type of entity to look for.  Choose ALL to look for all recognized entities as in the following list (adapted from nltk documentation), or choose a single type.  Types marked with * may not be currently available.

 

Type

 

Example

ORGANIZATION

Organization entities are limited to corporations, agencies, and other groups of people defined by an established organizational structure

Georgia-Pacific Corp., WHO

PERSON

Person entities are limited to humans. A person may be a single individual or a group.

John Irving, President Obama

LOCATION

Location entities are limited to geographical entities such as geographical areas and landmasses, bodies of water, and geological formations

London, England, Mount Everest

DATE*

 

June, 2008-06-29

TIME*

 

two fifty a m, 1:30 p.m.

MONEY*

 

175 million Canadian Dollars, GBP 10.40

FACILITY

Facility entities are limited to buildings and other permanent man-made structures and real estate improvements

Washington Monument, Stonehenge

GPE

Geo-Political Entity.  GPE entities are geographical regions defined by political and/or social groups. A GPE entity subsumes and does not distinguish between a nation, its region, its government, or its people

South East Asia, Midlothian

 

Entities will not always be accurately detected due to the vagaries of speech and the vocabulary limitations.  Since the entity classification uses the grammar of the text, entity search seems not to work well with simple lists.  An entity may include other types of entities.  Filtering happens on the main item.

 

Width of Output Variables The procedure creates new or overwritten variables containing the entity information.  The output text lists all the entities found separated by / and preceded by <type>, where type is the first three letters of the types listed above with duplicates removed.  For example,

<ORG> Medicare/<ORG> Medicaid

The width is measured in bytes as elsewhere in SPSS Statistics.  For plain western text, this will be the same as characters, but accented characters and other scripts may require two or three bytes.  However, entity search is essentially limited to English.

 

Variable Suffix for Outcome is used as described for other functions in this procedure.

The Utilities Tab

Create Dataset from Sentiment Lexicon Check this box to create a dataset containing the words and sentiment scores used in the scoring dictionary. There are also booster words such as very and somewhat. Enter a dataset name not already in use for the output dataset.

Create Stemmed Variables Check this box to create variables containing the result of stemming the text. Stemming often produces results that go further than might be expected, so it can be useful to see how it works. Output variable names are constructed as described on the Variables tab.

Load Supplemental Sentiment Scores File Check the Load box and specify a file of words to add words and scores to the sentiment lexicon. If you generate a lexicon file first, you can see how the built-in words are scored as a guide to assigning scores to additional words. You can also change the scores for existing words this way.

Installation using the STATS PACKAGE INSTALL extension command for Statistics version 27 or later

STATS PACKAGE INSTALL was released in August 2021 and is the preferred way to install this extension.  Depending on your system setup, you might need to do the following steps in Administrator mode.  That is more likely with version 27.

  • Go to Extensions > Extension Hub and install the following packages (some might already be installed)
    • STATS TEXTANALYSIS
    • SPSSINC TRANS
    • STATS PACKAGE INSTALL
  • Open the PACKAGE INSTALL dialog in Statistics via Extensions > Install Python and R modules and specify the following Python packages
    • nltk
    • pyspellchecker
  • Run this code in a syntax window
    begin program python3.
    download()
    end program.
    This will bring up a window listing the packages available for nltk. Click on All Packages and choose at least names, stopwords, wordnet, and vader lexicon.
  • Optionally, go to spelling dictionary and extract the words.txt file from words.zip.  Specify that location when you run the procedure.

Installation with SPSS Statistics Version 27

This procedure requires several additional items. After installing it, do the following. Depending on your system setup, you might need to do these steps in Administrator mode.

  • Make sure that you have a registered Python 3 distribution matching the Statistics version you are using. For Statistics version 27, that would be Python 3.8. If you don't have this, go to Python Software Foundation and install from there. Don't install this over the distribution installed with Statistics. After installing it, go to Edit > Options > Files in Statistics and set this location for Python 3.
  • Open a command window, cd to the location of the Python installation, and install nltk and pyspellchecker from the PyPI site:
    pip install nltk
    pip install pyspellchecker
  • Start Python from that location and run this code.
    import nltk
    nltk.download()
    This will display a table of items you can add to your installation. Select at least names,  stopwords, and vader_lexicon.
  • Optionally, go to spelling dictionary as mentioned above and extract the words.txt file from words.zip.  Specify that location when you run the procedure.
  • Install the SPSSINC TRANS extension command via the Statistics Exensions > Extension Hub menu.

Installation with SPSS Statistics Version 28 and Later

This procedure requires several additional items. After installing it, do the following. Depending on your system setup, you might need to do these steps in Administrator mode.  It is no longer necessary to install a separate Python distribution.

  • Start Statistics. You might need to run it as Administrator depending on your security settings.
  • Open a syntax window and run the following commands
    host command='"spssloc\statisticspython3.bat" -m pip install nltk'.
    host command='"spssloc\statisticspython3.bat" -m pip install pyspellchecker'.
    • Replace spssloc with the full path of the location where SPSS Statistics is installed on your system.
    • Be sure to use the single quote character (') for the outer quotes. The double quotes (") are not necessary unless the SPSS location contains any blanks.
  • Run this code
    begin program python3.
    download()
    end program.
    This will bring up a window listing the packages available for nltk. Click on All Packages and choose at least names, stopwords, and vader_lexicon.
  • Optionally, go to spelling dictionary and extract the words.txt file from words.zip.  Specify that location when you run the procedure.
  • Install the SPSSINC TRANS extension command via the Statistics Extensions > Extension Hub menu.

Acknowledgements

STATS TEXTANALYSIS relies on open source tools for language processing and spelling correction.

Language processing:

 

Spelling:

  • Author: Tyler Barrus
  • Peter Norvigblog post on setting up a simple spell checking algorithm
  • P Lison and J Tiedemann, 2016, OpenSubtitles2016: Extracting Large Parallel Corpora from Movie and TV Subtitles. In Proceedings of the 10th International Conference on Language Resources and Evaluation (LREC 2016)

 

 

 

 


#SPSSStatistics
0 comments
49 views

Permalink