SPSS Statistics

 View Only
  • 1.  Is using Python for parsing a value from data in SPSS the right approach?

    Posted Sat November 13, 2021 12:49 PM
    Hi,
    I'm using a custom Python script with a regular expression to parse a numeric value from a string variable in a spss data set.   See my related question for details but it boils down to running a python script similar to Using regular expressions in SPSS

    SPSSINC TRANS RESULT=Zip TYPE=5 /FORMULA mymodule.SearchZip(Address).

    While this works correctly, I noticed that on larger data sets (150K+ entries) it takes a substantial amount of time.

    When I run the Python script on without SPSS on a .csv file, it is quite faster. I can imagine that there is a overhead of calling en external Python function on each SPSS row as opposed to calling a python function from a python script.


    My situation is that I'm a fluent developer but don't know much about SPSS and I'm from time to time helping a doctor who knows SPSS + statistics but does not know anything about programming. So usually I write him some small helper Python scripts which are then called from SPSS.


    Now my question is: is SPSS also meant to preprocess data before analysing them? Or is that something that should generally be done before and outside of SPSS?
    For instance:
    - Export the data from SPSS to csv
    - Run the python script on the csv
    - import the cleaned csv back to SPSS

    cheers
    Jan






    ------------------------------
    Jan
    ------------------------------

    #SPSSStatistics


  • 2.  RE: Is using Python for parsing a value from data in SPSS the right approach?

    IBM Champion
    Posted Sat November 13, 2021 01:32 PM
    Case processing through the Dataset class is slow, especially writing values back.  SPSSINC TRANS uses that interface because of its flexibility.  The spssdata module provides faster case handling but cannot do some of things provided in that extension.  Creating new variables is more complicated and limited than what you can do with the Dataset class.  The overhead in writing back with the Dataset class depends strongly on how many variables are in the active data file, so with very wide files, it can be better to create a new dataset, write to it, and merge the result with the main file using MATCH FILES, which is very fast.  The trouble with going the csv route is that you lose all the variable metadata.  So, you would first use SAVE in Statistics with a KEEP subcommand to create a data file with just the necessary input variables; then open it and make it the active dataset.  Run SPSSINC TRANS or whatever you need on that, and, finally, write the new variable(s) back to the main dataset with MATCH files.  That will actually be considerably faster if the active file is wide.

    There is a new case-passing set of apis just introduced in the 28.0.1 release that was designed with performance in mind.  I haven't had a chance to benchmark it yet, but I expect it to be much, much faster.  However, it requires rewriting existing code.  I will probably convert SPSSINC TRANS to use it, but I don't know when I will get to that.

    ------------------------------
    Jon Peck
    ------------------------------



  • 3.  RE: Is using Python for parsing a value from data in SPSS the right approach?

    Posted Sat November 13, 2021 05:06 PM
    Hi Jon
    thanks for your answer.

    I'm currently looking into the Python Panda library which is probably what I could use to clean and preprocess the data by reading and writing directly into the .sav file.


    ------------------------------
    Jan
    ------------------------------



  • 4.  RE: Is using Python for parsing a value from data in SPSS the right approach?

    IBM Champion
    Posted Sat November 13, 2021 05:27 PM
    Or write out a csv file at the end to read into Statistics.

    I would be interested to hear how that goes.

    --