SPSS Statistics

SPSS Statistics

Your hub for statistical analysis, data management, and data documentation. Connect, learn, and share with your peers! 

 View Only
  • 1.  delete thousands of rows from very large dataset

    Posted Tue January 10, 2023 04:13 PM

    I am using v28 of SPSS. I have 296 data files that I am trying to clean before merging into one dataset. Each file has about 300,000 rows of data, but I only need the rows that have values in at least one of 6 different variables (about 20% of the rows). I am trying to determine if any members of this community have ever had to *delete* (not simply select out) that many rows of data in this many data files. If so, menu steps or sample code would be much appreciated. Thanks!



    ------------------------------
    becky allred
    ------------------------------

    #SPSSStatistics


  • 2.  RE: delete thousands of rows from very large dataset

    Posted Tue January 10, 2023 05:13 PM
    This sounds like a typical SELECT IF command.  Something like
    SELECT IF NMISSING(x1 to x5) <= 5.
    where x1 to x5 would be replaced by an explicit variable list if they are not consecutive in the file.

    --





  • 3.  RE: delete thousands of rows from very large dataset

    Posted Tue January 10, 2023 06:29 PM
    Thank you for responding, Jon.

    I've been thinking that this wouldn't work because I tried and it appears that the rows that are not selected remain in the data file. They are not deleted. I really need to delete the rows so that the files are smaller and then the 296 files can be merged.

    I've considered a split file command, but that apparently only works with categorical variables. I may be overlooking something, though. I welcome more ideas!





  • 4.  RE: delete thousands of rows from very large dataset

    Posted Tue January 10, 2023 07:10 PM
    The rows will be deleted if you use SELECT IF, but this won't happen until the next data pass, which could be triggered by a SAVE command or a procedure.

    Don't confuse SELECT IF with FILTER as the latter just, well, filters,  so that procedures don't see them.

    --





  • 5.  RE: delete thousands of rows from very large dataset

    Posted Tue January 10, 2023 09:32 PM
    Thank you sooo much. You've saved me months of work.

    This was the final syntax I used:

    RECODE GazepointX GazepointY GazepointleftX GazepointleftY GazepointrightX GazepointrightY
        (SYSMIS=9999).
    EXECUTE. *I suppose I could have avoided this by using "." instead of 9999 below.

    SELECT IF (GazepointX) ~= 9999.
    SORT CASES BY Eyetrackertimestamp(A).

    SELECT IF (GazepointY) ~= 9999.
    SORT CASES BY Eyetrackertimestamp(A).

    SELECT IF (GazepointleftX) ~= 9999.
    SORT CASES BY Eyetrackertimestamp(A).

    SELECT IF (GazepointleftY) ~= 9999.
    SORT CASES BY Eyetrackertimestamp(A).

    SELECT IF (GazepointrightX) ~= 9999.
    SORT CASES BY Eyetrackertimestamp(A).

    SELECT IF (GazepointrightY) ~= 9999.
    SORT CASES BY Eyetrackertimestamp(A).





  • 6.  RE: delete thousands of rows from very large dataset

    Posted Tue January 10, 2023 09:47 PM
    I don't understand the logic here.
    - The EXECUTE command is of no use.  It just forces a data pass that would have taken place anyway, which wastes time.
    - Why the SORT commands?  They don't affect the SELECTs, and since the sort order is not affected by the other code, there is no reason to keep sorting by the same variable, anyway.
    - There is no need for the RECODE.  You could just use ~ MISSING(GazepointX) etc.
    - All those SELECT commands could be combined into one single SELECT IF.  If you didn't have those SORTs in between, it would actually combine all those into just one data pass.
    - The logic seems to be to select cases that are complete on all the variables, so you could just do
    SELECT IF NMISSING(GazepointX, GazepointY, ...) EQ 0.

    --





  • 7.  RE: delete thousands of rows from very large dataset

    Posted Wed January 11, 2023 07:13 AM
    I apologize. It's probably not logic so much as pragmatics. It worked and I thought I'd share that. However, I agree the code could be cleaner.
    • EXECUTE is probably just leftover from the menu-driven steps I took to generate the correct variable names without typos.
    • The SORT commands are necessary for my next step, but may not apply to others. (I just included them as a convenient way to make the SELECT IF commands actually delete the unwanted rows since the selection/deletion doesn't take effect until certain commands are executed.)
    • I used RECODE because NMISSING didn't work for my variables. I'm not sure why.
    I'm recording these edits in case they are helpful to others. I really appreciate you helping me solve this problem, even if I did it a bit clumsily.

    Best wishes,