SPSS Statistics

 View Only
  • 1.  #SPSS Statistics - quantifying over/underrepresentation of sample data vs population data

    Posted Thu September 07, 2023 09:23 AM

    How to measure under/overrepresentation of results in SPSS?

    I run an employee survey. I want to assess:

    1. to what extent the results reflect the population demographics, overall (is there a single indicator?)
    2. which groups are significantly overrepresented or underrepresented in the results

    I know the population demographics (= how many employees are in each subgroup), and I also have the responses (=sample) by subgroup. 

    So, I can easily calculate what the response rate (percentage) is by subgroup and overall. 

    So, I can also run a Chi-Square analysis and compare within each demographic category if certain profiles have significantly higher response rates than other, for example, do certain functions respond more than other functions, do people with low seniority respond less frequently than people with longer tenure etc.

    The question:

    But how can I quantify to what extent certain groups (functions, seniority levels, …) are overrepresented or underrepresented in the survey results as a whole?

    Imagine visually two pie charts. One with the population distribution over functions, seniority levels, mgr/employee, location… all in one pie chart. And another one with the distribution of the same demographics in the sample (the responses). I guess, I need to compare the proportions of the population pie with the proportions of the sample pie. 

    I have difficulty putting that into practice in SPSS. How to organize my datasheet? Do I need to use an intermediate step through OMS? What analysis to choose? Any ideas?

    A related question could be: is there something like "post-stratification" possible in SPSS? In other words, if groups are underrepresented or overrepresented, can I extrapolate / reduce their results to make them reflect the population more accurately?



    ------------------------------
    Jos Blykers
    IBM HR BD
    ------------------------------


  • 2.  RE: #SPSS Statistics - quantifying over/underrepresentation of sample data vs population data

    IBM Champion
    Posted Thu September 07, 2023 09:54 AM
    One simple way to get at this would be to use the SPSSINC RAKE extension command (Data > Rake Weights entering the counts or proportions for each variable of interest - up to 10 variables at a time.  Then look at the extremely large or extremely small weights to see the properties of the under- or overrepresented cases.  This will consider all the prediction groups together, but if you wanted to look at, say, just a single variable representation, just weight by that one and follow the same procedure.  Very large weights indicate underrepresented and very small ones overrepresented values.

    Note that after you run RAKE, the newly generated weights will be on, so you would want to turn weights off if lookin at the output weight distribution.

    Another helpful view would be to askRAKE to produce a heatmap of the weights - up to four variables - which would show graphically how the weights are distributed according to the raking variables categories.





  • 3.  RE: #SPSS Statistics - quantifying over/underrepresentation of sample data vs population data

    Posted Fri September 08, 2023 11:56 AM

    Thank you very much, Jon. That has helped me a lot. I discovered along the way that you are actually an author of a paper on the subject!

    Let me put down a few of my learnings for the benefit of whoever is looking at doing the same. And, Jon, correct me if I state errors below.

    • When specifying the variable values, that is done in an SPSS numerical way (f.i. 10012), syntax would read for example: SPSSINC RAKE DIM1 = var3 10001 9 10002 71 (sample has 9 managers and 71 individual contributors in this case)
    • It is not necessary to separately calculate the proportions (percentages) in the population [which I had done at first], putting in the numbers for each value is sufficient. SPSS will calculate based on the total population figure.  See example above.
    • The "Sample Balance" which is obtained indicates how well the weighted sample matches the population proportions, 100 being a perfect match. Additional question: how can I find out how far the unweighted sample was off?  It would be nice to know that the weighting moved the needle from 60 to 92, for example.
    • True, the heatmap and raked weights table give an indication of how strongly certain variables had to be overweighted or underweighted to produce the new balanced sample. But is there a single number that is comparable to the balanced sample? Something like an unbalanced sample?
    • I have not yet figured out exactly how the PANELDOWNVAR=varname PANELACROSS=varname work precisely for plotting the heatmap, but that is a matter of trying out.
    • To put weighting off again, there are two ways. The brutal one is to delete (=clear) the weighting variable from the dataset. The other way is to go to Data – Weight Cases [not Rake Weights] – and select Do not weight cases – OK.

    Thanks again and open to further remarks.



    ------------------------------
    Jos Blykers
    ------------------------------



  • 4.  RE: #SPSS Statistics - quantifying over/underrepresentation of sample data vs population data

    IBM Champion
    Posted Fri September 08, 2023 01:00 PM
    I'm glad this was helpful.  You questions made me think of the relationship between the RAKE procedure and a brand new extension command that I just finished that does SMOTE sampling to make an unbalanced sample with respect to some target group proportions more balanced.  There are several way to do this (15 supported in this extension).  It takes only one outcome (group) variable and a list of input variables and generates new cases or removes cases based on the joint distribution of the input variables to match the requested proportions.  The resulting sample is, of course, artificial, especially if oversampling is used, but generates cases that are consistent with the inputs.  So in a way it is an alternative to raking, and it would be interesting to compare results with the two approaches.

    SMOTE stands for Synthetic Minority Oversampling Technique (SMOTE), including also synthetic undersampling or even combining both.  Together these are referred to as resampling.  These methods improve the balance of the dataset with respect to the target or dependent variable

    STATS IMBALANCED is not yet available on the Extension Hub, but it can be downloaded from here
    and installed via Extensions > Install local extension bundle.

    The balance measure in RAKE shows how much variation in the weight is required to meet the target proportions, so it really measures the mismatch of the proportions in the input sample.  Assuming that the requested proportions can be met, which isn't always possible, then the balance statistic would be the measure you need.

    Note also that the CTABLES procedure supports effective base weighting, which fits with the weights generated by RAKE.

    One usage point on RAKE.  Using the dialog box can be awkward and error prone if the controls have a lot of categories, so you can use syntax to specify the categories and proportions or counts.  I once had a client who had 1500 categories (stores) and really needed that feature.

    --





  • 5.  RE: #SPSS Statistics - quantifying over/underrepresentation of sample data vs population data

    Posted 5 hours ago

    Hi Jon, @Jon Peck

    Almost a year later, and only now starting to play with stats imbalanced (and comparing to rake weights).

    Looks like I need to get the basics right first (and then experiment with methods)

    So, this SMOTE-inspired rebalancing takes only one outcome (group) variable and a list of input variables and generates new cases or removes cases based on the joint distribution of the input variables to match the requested proportions.

    1.        What is meant with outcome variable? I assume it is not what we refer to as outcome variable in a regression, since we would discourage using inference on synthetic models, wouldn't we?

    My use case is employee surveys, with outcome variables (take: employee engagement), other variables as contributing factors and demographic variables. I need to cope with overrepresentation of certain demographics in the sample (the respondents) (for example: recent joiners, certain functions…) and correct this based on known proportions in the population.

    Using ordinal variable as outcome and nominal variables as independent variables (demographic variables). Icons suggest this is okay.

    2.        Which variables would I put where and where do I define the known proportions (I assumed this is done in the List of class sizes, in the Resampling Strategy).

    3.        Reading the help section, I had expected to see an additional dataset to be created, for which I provided a name, but that does not seem to happen, at least it doesn't show.

    4.        Is there a similar command to WEIGHT OFF (in Rake weights) when running the STATS IMBALANCED?

    Happy to read your suggestions to these "how to" questions, and understanding that break/fix is always a priority.

    Jos

    Running Windows – SPSS 29.02.0



    ------------------------------
    Jos Blykers
    ------------------------------