Hi,
I'm using a custom Python script with a regular expression to parse a numeric value from a string variable in a spss data set.
See my related question for details but it boils down to running a python script similar to
Using regular expressions in SPSS
SPSSINC TRANS RESULT=Zip TYPE=5 /FORMULA mymodule.SearchZip(Address).
While this works correctly, I noticed that on larger data sets (150K+ entries) it takes a substantial amount of time.
When I run the Python script on without SPSS on a .csv file, it is quite faster. I can imagine that there is a overhead of calling en external Python function on each SPSS row as opposed to calling a python function from a python script.
My situation is that I'm a fluent developer but don't know much about SPSS and I'm from time to time helping a doctor who knows SPSS + statistics but does not know anything about programming. So usually I write him some small helper Python scripts which are then called from SPSS.
Now my question is: is SPSS also meant to preprocess data before analysing them? Or is that something that should generally be done before and outside of SPSS?
For instance:
- Export the data from SPSS to csv
- Run the python script on the csv
- import the cleaned csv back to SPSS
cheers
Jan
------------------------------
Jan
------------------------------
#SPSSStatistics