Is using Python for parsing a value from data in SPSS the right approach?

View Only

Expand all | Collapse all

Is using Python for parsing a value from data in SPSS the right approach?

1. Is using Python for parsing a value from data in SPSS the right approach?

0 Like
Jan C
Posted Sat November 13, 2021 12:49 PM

Reply
Hi,
I'm using a custom Python script with a regular expression to parse a numeric value from a string variable in a spss data set. See my related question for details but it boils down to running a python script similar to Using regular expressions in SPSS

SPSSINC TRANS RESULT=Zip TYPE=5 /FORMULA mymodule.SearchZip(Address).

While this works correctly, I noticed that on larger data sets (150K+ entries) it takes a substantial amount of time.

When I run the Python script on without SPSS on a .csv file, it is quite faster. I can imagine that there is a overhead of calling en external Python function on each SPSS row as opposed to calling a python function from a python script.

My situation is that I'm a fluent developer but don't know much about SPSS and I'm from time to time helping a doctor who knows SPSS + statistics but does not know anything about programming. So usually I write him some small helper Python scripts which are then called from SPSS.

Now my question is: is SPSS also meant to preprocess data before analysing them? Or is that something that should generally be done before and outside of SPSS?
For instance:
- Export the data from SPSS to csv
- Run the python script on the csv
- import the cleaned csv back to SPSS

cheers
Jan

------------------------------
Jan
------------------------------

#SPSSStatistics
2. RE: Is using Python for parsing a value from data in SPSS the right approach?

0 Like
IBM Champion

Jon Peck
Posted Sat November 13, 2021 01:32 PM

Reply
Case processing through the Dataset class is slow, especially writing values back. SPSSINC TRANS uses that interface because of its flexibility. The spssdata module provides faster case handling but cannot do some of things provided in that extension. Creating new variables is more complicated and limited than what you can do with the Dataset class. The overhead in writing back with the Dataset class depends strongly on how many variables are in the active data file, so with very wide files, it can be better to create a new dataset, write to it, and merge the result with the main file using MATCH FILES, which is very fast. The trouble with going the csv route is that you lose all the variable metadata. So, you would first use SAVE in Statistics with a KEEP subcommand to create a data file with just the necessary input variables; then open it and make it the active dataset. Run SPSSINC TRANS or whatever you need on that, and, finally, write the new variable(s) back to the main dataset with MATCH files. That will actually be considerably faster if the active file is wide.

There is a new case-passing set of apis just introduced in the 28.0.1 release that was designed with performance in mind. I haven't had a chance to benchmark it yet, but I expect it to be much, much faster. However, it requires rewriting existing code. I will probably convert SPSSINC TRANS to use it, but I don't know when I will get to that.

------------------------------
Jon Peck
------------------------------

Original Message
3. RE: Is using Python for parsing a value from data in SPSS the right approach?

0 Like
Jan C
Posted Sat November 13, 2021 05:06 PM

Reply
Hi Jon
thanks for your answer.

I'm currently looking into the Python Panda library which is probably what I could use to clean and preprocess the data by reading and writing directly into the .sav file.

------------------------------
Jan
------------------------------

Original Message
4. RE: Is using Python for parsing a value from data in SPSS the right approach?

0 Like
IBM Champion

Jon Peck
Posted Sat November 13, 2021 05:27 PM

Reply
Or write out a csv file at the end to read into Statistics.

I would be interested to hear how that goes.

--
Jon K Peck
jkpeck@gmail.com

Original Message

SPSS Statistics

Is using Python for parsing a value from data in SPSS the right approach?

Jan CSat November 13, 2021 12:49 PM

Jon PeckSat November 13, 2021 01:32 PM

Jan CSat November 13, 2021 05:06 PM

Jon PeckSat November 13, 2021 05:27 PM

1. Is using Python for parsing a value from data in SPSS the right approach?

2. RE: Is using Python for parsing a value from data in SPSS the right approach?

3. RE: Is using Python for parsing a value from data in SPSS the right approach?

4. RE: Is using Python for parsing a value from data in SPSS the right approach?