SPSS Statistics

SPSS Statistics

Your hub for statistical analysis, data management, and data documentation. Connect, learn, and share with your peers! 

 View Only
  • 1.  Data quality check for a multiple response question

    Posted Sun May 30, 2021 06:31 PM
    Edited by System Admin Fri January 20, 2023 04:14 PM
    Hey,

    I have a multiple response question which should be answered if a "yes" response was selected on a previous question. However, I have noticed data entry errors.

    Let's say that Q1 asks - do you have a medical condition? (coded as 1-yes, 2-no, 3-don't know)

    Q2 should be answer if 'yes' was selected for Q1 (Q2 is coded as 0-no, 1-yes)
    1. back problems
    2. high blood pressure
    3. diabetes 
    4. other

    Some scenarios I have noticed...
    1. Yes was selected for Q1; however, no answer was provided for Q2
    2. No was selected for Q1; however, an answer(s) was provided for Q2
    3. Don't know was selected for Q1; however, an answer(s) was provided for Q2
    2. Q1 was left blank; however, an answer(s) was provided for Q2

    I am dealing with a big data set, so I would like to know how can I perform a quality check for the scenarios above. I have these data from different states, so I would like to find out the percentage affected by these scenarios (if any). Based on that, I can create a threshold to DQ X state from the analysis. 

    Any guidance is appreciated.

    Thanks. 

    #SPSSStatistics


  • 2.  RE: Data quality check for a multiple response question

    Posted Sun May 30, 2021 07:09 PM
    Such conditions can be defined as standard COMPUTE commands, but if you have the Data Validation option (now included in Base), you can define single and cross-variable conditions and get a set of reports on violations.

    Here is a compute that checks this particular condition.
    compute invalid = (Q1 eq 1 and not any(1, Q2_1 to Q2_4)) or
    Q1 eq 0 and any(1, Q2_1 to Q2_4).

    --





  • 3.  RE: Data quality check for a multiple response question

    Posted Mon May 31, 2021 09:08 PM
    Edited by System Admin Fri January 20, 2023 04:39 PM
    Hey,

    This is great!

    Thank you so much for your guidance:)

    Original Message:
    Sent: Sun May 30, 2021 07:08 PM
    From: Jon Peck
    Subject: Data quality check for a multiple response question

    Such conditions can be defined as standard COMPUTE commands, but if you have the Data Validation option (now included in Base), you can define single and cross-variable conditions and get a set of reports on violations.
    Here is a compute that checks this particular condition.
    compute invalid = (Q1 eq 1 and not any(1, Q2_1 to Q2_4)) or
    Q1 eq 0 and any(1, Q2_1 to Q2_4).
    --



    Original Message:
    Sent: 5/30/2021 6:31:00 PM
    From: Rose
    Subject: Data quality check for a multiple response question

    Hey,

    I have a multiple response question which should be answered if a "yes" response was selected on a previous question. However, I have noticed data entry errors.

    Let's say that Q1 asks - do you have a medical condition? (coded as 1-yes, 2-no, 3-don't know)

    Q2 should be answer if 'yes' was selected for Q1 (Q2 is coded as 0-no, 1-yes)
    1. back problems
    2. high blood pressure
    3. diabetes 
    4. other

    Some scenarios I have noticed...
    1. Yes was selected for Q1; however, no answer was provided for Q2
    2. No was selected for Q1; however, an answer(s) was provided for Q2
    3. Don't know was selected for Q1; however, an answer(s) was provided for Q2
    2. Q1 was left blank; however, an answer(s) was provided for Q2

    I am dealing with a big data set, so I would like to know how can I perform a quality check for the scenarios above. I have these data from different states, so I would like to find out the percentage affected by these scenarios (if any). Based on that, I can create a threshold to DQ X state from the analysis. 

    Any guidance is appreciated.

    Thanks. 

    #SPSSStatistics


  • 4.  RE: Data quality check for a multiple response question

    Posted Thu June 03, 2021 10:04 PM
    Edited by System Admin Fri January 20, 2023 04:09 PM
    Hi again,

    Is it acceptable to use a multiple response question as factors in a multinomial logistic regression model?

    For instance, I am using medical conditions - it could be the case that the person selected 1 or multiple options.
    Each response represents a variable in SPSS - coded as apply (1) or does not apply (0). 

    How would logistic regression treat  instances in which more than 1 condition was selected? Does that matter for the model? Maybe I should only use cases that  have only one condition selected?

    Thanks in advance!





  • 5.  RE: Data quality check for a multiple response question

    Posted Fri June 04, 2021 01:34 AM
    Edited by System Admin Fri January 20, 2023 04:19 PM
    Hi, Rose.

    Multiple response variables as such are limited to either Custom Tables or the MULT RESPONSE command.

    Judging from your first post, it seems you have four component variables: back problems, high blood pressure, diabetes, or other. I assume that each is (or will be) coded 0-1. Given that, you could use them as independent variables in a logistic model in one of at least two ways.

    First, if you are interested in the number of medical conditions as causative (without respect to which particular ones any patient or respondent has), then you could sum the responses and use the count variable in the model. Second, if you are concerned with how the presence/absence of each particular illness affects the dependent measure, then entering each variable as a factor would be preferable (that also raises the issue of interactions between those factors).

    Of course, the use of those variables carries the usual assumptions - see this multinomial logistic example if you want more information.

    ------------------------------
    Rick Marcantonio
    Quality Assurance
    IBM
    ------------------------------



  • 6.  RE: Data quality check for a multiple response question

    Posted Sat June 05, 2021 08:28 AM
    Edited by System Admin Fri January 20, 2023 04:34 PM
    Morning!

    You are correct - the data are coded as 0 and 1. My main interest is in the presence/absence of each particular condition to evaluate its effect on the dependent variable - as you have mentioned. I never thought about the possible effect of the number of selected conditions on the DV - that something worth to exploring as well.

    When you say the interactions between the factors is that related to multicollinearity? I

    Thanks a lot for your response and time.




  • 7.  RE: Data quality check for a multiple response question

    Posted Mon June 07, 2021 10:15 AM
    Edited by System Admin Fri January 20, 2023 04:16 PM
    When I said interactions between factors I meant exactly that.

    I don't know what the dependent variable (DV) is but I assume it's ordinal in at least 3 categories. So for example, if you think it's reasonable that subjects who report both high blood pressure (HBP) and diabetes show a multiplicative effect on the DV rather than an additive one (i.e., it's far worse to have both than it is to have either one by itself), then you should include the interaction term between those two variables. The same for the other factors. Of course, that adds more terms and complexity to your model but it's a more accurate reflection of the situation.

    And yes, the factors could be correlated. It might be in your sample, say, that the majority of patients with HBP also selected "Other" (so, they have 1 or more co-morbidities). Same for diabetes or back pain. 
     
    As a preliminary step to check that, try running a PROXIMITIES command, something like this:

    PROXIMITIES <your factor variable names here>
      /VIEW=VARIABLE /MEASURE=PHI (1,0).
    ​

    Interpret the phi coefficients in the same way you would Pearson correlation coefficients; the higher the correlation, the more similar the variables. If the correlations are high, then I would re-think using those factors as joint predictors in the regression model and come up with a simpler but coarser measure. I've seen occasions where researchers ended up simplifying down to about the coarsest measure of all: A binary variable coded "1" if the patient had any of these illnesses and 0 otherwise, or the SUM measure I mentioned before. Not ideal by any means but that's what happens sometimes.

    ------------------------------
    Rick Marcantonio
    Quality Assurance
    IBM
    ------------------------------



  • 8.  RE: Data quality check for a multiple response question

    Posted Mon June 07, 2021 06:47 PM
    Hey,

    Thanks for all that information; it is highly appreciated.

    My DV is a nominal variable with more than 5 categories. 

    In my examples, I provide fictitious data. However, I am dealing with medical/mental conditions (only 5 in total) and I don't have multicollinearity issues among them. I was surprised by the fact there is no multicollinearity among my factors. However, no multicollinearity is one of the assumption for multinomial logistic Regression, so that's good in this case:)

    May I ask you how can I check for outliers for multinomial Logistic Regression. It seems that MLR does not have features like binary logistic to check for the assumptions, so I am wondering if it is permissible to use the binary logistic features to check the assumptions of multinomial logistic?

    Thanks again.





    ------------------------------
    Rose
    ------------------------------



  • 9.  RE: Data quality check for a multiple response question

    Posted Mon June 07, 2021 08:10 PM
    Yes, you can use logistic regression for that.

    ------------------------------
    Rick Marcantonio
    Quality Assurance
    IBM
    ------------------------------



  • 10.  RE: Data quality check for a multiple response question

    Posted Tue June 08, 2021 03:51 PM
    Thanks again!