Data quality check for a multiple response question

6. RE: Data quality check for a multiple response question

Like

Rose

Posted Sat June 05, 2021 08:28 AM
Edited by System Admin Fri January 20, 2023 04:34 PM

Morning!

You are correct - the data are coded as 0 and 1. My main interest is in the presence/absence of each particular condition to evaluate its effect on the dependent variable - as you have mentioned. I never thought about the possible effect of the number of selected conditions on the DV - that something worth to exploring as well.

When you say the interactions between the factors is that related to multicollinearity? I

Thanks a lot for your response and time.

Original Message

7. RE: Data quality check for a multiple response question

Like

Rick Marcantonio

Posted Mon June 07, 2021 10:15 AM
Edited by System Admin Fri January 20, 2023 04:16 PM

When I said interactions between factors I meant exactly that.

I don't know what the dependent variable (DV) is but I assume it's ordinal in at least 3 categories. So for example, if you think it's reasonable that subjects who report both high blood pressure (HBP) and diabetes show a multiplicative effect on the DV rather than an additive one (i.e., it's far worse to have both than it is to have either one by itself), then you should include the interaction term between those two variables. The same for the other factors. Of course, that adds more terms and complexity to your model but it's a more accurate reflection of the situation.

And yes, the factors could be correlated. It might be in your sample, say, that the majority of patients with HBP also selected "Other" (so, they have 1 or more co-morbidities). Same for diabetes or back pain.

As a preliminary step to check that, try running a PROXIMITIES command, something like this:

PROXIMITIES <your factor variable names here>
  /VIEW=VARIABLE /MEASURE=PHI (1,0).

Interpret the phi coefficients in the same way you would Pearson correlation coefficients; the higher the correlation, the more similar the variables. If the correlations are high, then I would re-think using those factors as joint predictors in the regression model and come up with a simpler but coarser measure. I've seen occasions where researchers ended up simplifying down to about the coarsest measure of all: A binary variable coded "1" if the patient had any of these illnesses and 0 otherwise, or the SUM measure I mentioned before. Not ideal by any means but that's what happens sometimes.

------------------------------
Rick Marcantonio
Quality Assurance
IBM
------------------------------

Original Message

8. RE: Data quality check for a multiple response question

Like

Rose

Posted Mon June 07, 2021 06:47 PM

Hey,

Thanks for all that information; it is highly appreciated.

My DV is a nominal variable with more than 5 categories.

In my examples, I provide fictitious data. However, I am dealing with medical/mental conditions (only 5 in total) and I don't have multicollinearity issues among them. I was surprised by the fact there is no multicollinearity among my factors. However, no multicollinearity is one of the assumption for multinomial logistic Regression, so that's good in this case:)

May I ask you how can I check for outliers for multinomial Logistic Regression. It seems that MLR does not have features like binary logistic to check for the assumptions, so I am wondering if it is permissible to use the binary logistic features to check the assumptions of multinomial logistic?

Thanks again.

------------------------------
Rose
------------------------------

Original Message

Original Message:
Sent: Mon June 07, 2021 10:14 AM
From: Rick Marcantonio
Subject: Data quality check for a multiple response question

When I said interactions between factors I meant exactly that.

I don't know what the dependent variable (DV) is but I assume it's ordinal in at least 3 categories. So for example, if you think it's reasonable that subjects who report both high blood pressure (HBP) and diabetes show a multiplicative effect on the DV rather than an additive one (i.e., it's far worse to have both than it is to have either one by itself), then you should include the interaction term between those two variables. The same for the other factors. Of course, that adds more terms and complexity to your model but it's a more accurate reflection of the situation.

And yes, the factors could be correlated. It might be in your sample, say, that the majority of patients with HBP also selected "Other" (so, they have 1 or more co-morbidities). Same for diabetes or back pain.

As a preliminary step to check that, try running a PROXIMITIES command, something like this:

PROXIMITIES <your factor variable names here>  /VIEW=VARIABLE /MEASURE=PHI (1,0).

Interpret the phi coefficients in the same way you would Pearson correlation coefficients; the higher the correlation, the more similar the variables. If the correlations are high, then I would re-think using those factors as joint predictors in the regression model and come up with a simpler but coarser measure. I've seen occasions where researchers ended up simplifying down to about the coarsest measure of all: A binary variable coded "1" if the patient had any of these illnesses and 0 otherwise, or the SUM measure I mentioned before. Not ideal by any means but that's what happens sometimes.

------------------------------
Rick Marcantonio
Quality Assurance
IBM

Original Message:
Sent: Sat June 05, 2021 08:27 AM
From: Rose
Subject: Data quality check for a multiple response question

Morning!

You are correct - the data are coded as 0 and 1. My main interest is in the presence/absence of each particular condition to evaluate its effect on the dependent variable - as you have mentioned. I never thought about the possible effect of the number of selected conditions on the DV - that something worth to exploring as well.

When you say the interactions between the factors is that related to multicollinearity? I

Thanks a lot for your response and time.

------------------------------
Rose

Original Message:
Sent: Fri June 04, 2021 01:33 AM
From: Rick Marcantonio
Subject: Data quality check for a multiple response question

Hi, Rose.

Multiple response variables as such are limited to either Custom Tables or the MULT RESPONSE command.

Judging from your first post, it seems you have four component variables: back problems, high blood pressure, diabetes, or other. I assume that each is (or will be) coded 0-1. Given that, you could use them as independent variables in a logistic model in one of at least two ways.

First, if you are interested in the number of medical conditions as causative (without respect to which particular ones any patient or respondent has), then you could sum the responses and use the count variable in the model. Second, if you are concerned with how the presence/absence of each particular illness affects the dependent measure, then entering each variable as a factor would be preferable (that also raises the issue of interactions between those factors).

Of course, the use of those variables carries the usual assumptions - see this multinomial logistic example if you want more information.

------------------------------
Rick Marcantonio
Quality Assurance
IBM

Original Message:
Sent: Thu June 03, 2021 10:04 PM
From: Rose
Subject: Data quality check for a multiple response question

Hi again,

Is it acceptable to use a multiple response question as factors in a multinomial logistic regression model?

For instance, I am using medical conditions - it could be the case that the person selected 1 or multiple options.
Each response represents a variable in SPSS - coded as apply (1) or does not apply (0).

How would logistic regression treat instances in which more than 1 condition was selected? Does that matter for the model? Maybe I should only use cases that have only one condition selected?

Thanks in advance!

------------------------------
Rose

Original Message:
Sent: Sun May 30, 2021 07:08 PM
From: Jon Peck
Subject: Data quality check for a multiple response question

Such conditions can be defined as standard COMPUTE commands, but if you have the Data Validation option (now included in Base), you can define single and cross-variable conditions and get a set of reports on violations.

Here is a compute that checks this particular condition.

compute invalid = (Q1 eq 1 and not any(1, Q2_1 to Q2_4)) or

Q1 eq 0 and any(1, Q2_1 to Q2_4).

--

Jon K Peck
jkpeck@gmail.com

Original Message:
Sent: 5/30/2021 6:31:00 PM
From: Rose
Subject: Data quality check for a multiple response question

Hey,

I have a multiple response question which should be answered if a "yes" response was selected on a previous question. However, I have noticed data entry errors.

Let's say that Q1 asks - do you have a medical condition? (coded as 1-yes, 2-no, 3-don't know)

Q2 should be answer if 'yes' was selected for Q1 (Q2 is coded as 0-no, 1-yes)
1. back problems
2. high blood pressure
3. diabetes
4. other

Some scenarios I have noticed...
1. Yes was selected for Q1; however, no answer was provided for Q2
2. No was selected for Q1; however, an answer(s) was provided for Q2
3. Don't know was selected for Q1; however, an answer(s) was provided for Q2
2. Q1 was left blank; however, an answer(s) was provided for Q2

I am dealing with a big data set, so I would like to know how can I perform a quality check for the scenarios above. I have these data from different states, so I would like to find out the percentage affected by these scenarios (if any). Based on that, I can create a threshold to DQ X state from the analysis.

Any guidance is appreciated.

Thanks.

------------------------------
Rose
------------------------------
#SPSSStatistics

9. RE: Data quality check for a multiple response question

Like

Rick Marcantonio

Posted Mon June 07, 2021 08:10 PM

Yes, you can use logistic regression for that.

------------------------------
Rick Marcantonio
Quality Assurance
IBM
------------------------------

Original Message

Original Message:
Sent: Mon June 07, 2021 06:46 PM
From: Rose
Subject: Data quality check for a multiple response question

Hey,

Thanks for all that information; it is highly appreciated.

My DV is a nominal variable with more than 5 categories.

In my examples, I provide fictitious data. However, I am dealing with medical/mental conditions (only 5 in total) and I don't have multicollinearity issues among them. I was surprised by the fact there is no multicollinearity among my factors. However, no multicollinearity is one of the assumption for multinomial logistic Regression, so that's good in this case:)

May I ask you how can I check for outliers for multinomial Logistic Regression. It seems that MLR does not have features like binary logistic to check for the assumptions, so I am wondering if it is permissible to use the binary logistic features to check the assumptions of multinomial logistic?

Thanks again.

------------------------------
Rose

Original Message:
Sent: Mon June 07, 2021 10:14 AM
From: Rick Marcantonio
Subject: Data quality check for a multiple response question

When I said interactions between factors I meant exactly that.

I don't know what the dependent variable (DV) is but I assume it's ordinal in at least 3 categories. So for example, if you think it's reasonable that subjects who report both high blood pressure (HBP) and diabetes show a multiplicative effect on the DV rather than an additive one (i.e., it's far worse to have both than it is to have either one by itself), then you should include the interaction term between those two variables. The same for the other factors. Of course, that adds more terms and complexity to your model but it's a more accurate reflection of the situation.

And yes, the factors could be correlated. It might be in your sample, say, that the majority of patients with HBP also selected "Other" (so, they have 1 or more co-morbidities). Same for diabetes or back pain.

As a preliminary step to check that, try running a PROXIMITIES command, something like this:

PROXIMITIES <your factor variable names here>  /VIEW=VARIABLE /MEASURE=PHI (1,0).

Interpret the phi coefficients in the same way you would Pearson correlation coefficients; the higher the correlation, the more similar the variables. If the correlations are high, then I would re-think using those factors as joint predictors in the regression model and come up with a simpler but coarser measure. I've seen occasions where researchers ended up simplifying down to about the coarsest measure of all: A binary variable coded "1" if the patient had any of these illnesses and 0 otherwise, or the SUM measure I mentioned before. Not ideal by any means but that's what happens sometimes.

------------------------------
Rick Marcantonio
Quality Assurance
IBM

Original Message:
Sent: Sat June 05, 2021 08:27 AM
From: Rose
Subject: Data quality check for a multiple response question

Morning!

You are correct - the data are coded as 0 and 1. My main interest is in the presence/absence of each particular condition to evaluate its effect on the dependent variable - as you have mentioned. I never thought about the possible effect of the number of selected conditions on the DV - that something worth to exploring as well.

When you say the interactions between the factors is that related to multicollinearity? I

Thanks a lot for your response and time.

------------------------------
Rose

Original Message:
Sent: Fri June 04, 2021 01:33 AM
From: Rick Marcantonio
Subject: Data quality check for a multiple response question

Hi, Rose.

Multiple response variables as such are limited to either Custom Tables or the MULT RESPONSE command.

Judging from your first post, it seems you have four component variables: back problems, high blood pressure, diabetes, or other. I assume that each is (or will be) coded 0-1. Given that, you could use them as independent variables in a logistic model in one of at least two ways.

First, if you are interested in the number of medical conditions as causative (without respect to which particular ones any patient or respondent has), then you could sum the responses and use the count variable in the model. Second, if you are concerned with how the presence/absence of each particular illness affects the dependent measure, then entering each variable as a factor would be preferable (that also raises the issue of interactions between those factors).

Of course, the use of those variables carries the usual assumptions - see this multinomial logistic example if you want more information.

------------------------------
Rick Marcantonio
Quality Assurance
IBM

Original Message:
Sent: Thu June 03, 2021 10:04 PM
From: Rose
Subject: Data quality check for a multiple response question

Hi again,

Is it acceptable to use a multiple response question as factors in a multinomial logistic regression model?

For instance, I am using medical conditions - it could be the case that the person selected 1 or multiple options.
Each response represents a variable in SPSS - coded as apply (1) or does not apply (0).

How would logistic regression treat instances in which more than 1 condition was selected? Does that matter for the model? Maybe I should only use cases that have only one condition selected?

Thanks in advance!

------------------------------
Rose

Original Message:
Sent: Sun May 30, 2021 07:08 PM
From: Jon Peck
Subject: Data quality check for a multiple response question

Such conditions can be defined as standard COMPUTE commands, but if you have the Data Validation option (now included in Base), you can define single and cross-variable conditions and get a set of reports on violations.

Here is a compute that checks this particular condition.

compute invalid = (Q1 eq 1 and not any(1, Q2_1 to Q2_4)) or

Q1 eq 0 and any(1, Q2_1 to Q2_4).

--

Jon K Peck
jkpeck@gmail.com

Original Message:
Sent: 5/30/2021 6:31:00 PM
From: Rose
Subject: Data quality check for a multiple response question

Hey,

I have a multiple response question which should be answered if a "yes" response was selected on a previous question. However, I have noticed data entry errors.

Let's say that Q1 asks - do you have a medical condition? (coded as 1-yes, 2-no, 3-don't know)

Q2 should be answer if 'yes' was selected for Q1 (Q2 is coded as 0-no, 1-yes)
1. back problems
2. high blood pressure
3. diabetes
4. other

Some scenarios I have noticed...
1. Yes was selected for Q1; however, no answer was provided for Q2
2. No was selected for Q1; however, an answer(s) was provided for Q2
3. Don't know was selected for Q1; however, an answer(s) was provided for Q2
2. Q1 was left blank; however, an answer(s) was provided for Q2

I am dealing with a big data set, so I would like to know how can I perform a quality check for the scenarios above. I have these data from different states, so I would like to find out the percentage affected by these scenarios (if any). Based on that, I can create a threshold to DQ X state from the analysis.

Any guidance is appreciated.

Thanks.

------------------------------
Rose
------------------------------
#SPSSStatistics

10. RE: Data quality check for a multiple response question

Like

Rose

Posted Tue June 08, 2021 03:51 PM

Thanks again!

Original Message

Original Message:
Sent: 6/7/2021 8:10:00 PM
From: Rick Marcantonio
Subject: RE: Data quality check for a multiple response question

Yes, you can use logistic regression for that.

------------------------------
Rick Marcantonio
Quality Assurance
IBM
------------------------------

Original Message:
Sent: Mon June 07, 2021 06:46 PM
From: Rose
Subject: Data quality check for a multiple response question

Hey,

Thanks for all that information; it is highly appreciated.

My DV is a nominal variable with more than 5 categories.

In my examples, I provide fictitious data. However, I am dealing with medical/mental conditions (only 5 in total) and I don't have multicollinearity issues among them. I was surprised by the fact there is no multicollinearity among my factors. However, no multicollinearity is one of the assumption for multinomial logistic Regression, so that's good in this case:)

May I ask you how can I check for outliers for multinomial Logistic Regression. It seems that MLR does not have features like binary logistic to check for the assumptions, so I am wondering if it is permissible to use the binary logistic features to check the assumptions of multinomial logistic?

Thanks again.

------------------------------
Rose

Original Message:
Sent: Mon June 07, 2021 10:14 AM
From: Rick Marcantonio
Subject: Data quality check for a multiple response question

When I said interactions between factors I meant exactly that.

I don't know what the dependent variable (DV) is but I assume it's ordinal in at least 3 categories. So for example, if you think it's reasonable that subjects who report both high blood pressure (HBP) and diabetes show a multiplicative effect on the DV rather than an additive one (i.e., it's far worse to have both than it is to have either one by itself), then you should include the interaction term between those two variables. The same for the other factors. Of course, that adds more terms and complexity to your model but it's a more accurate reflection of the situation.

And yes, the factors could be correlated. It might be in your sample, say, that the majority of patients with HBP also selected "Other" (so, they have 1 or more co-morbidities). Same for diabetes or back pain.

As a preliminary step to check that, try running a PROXIMITIES command, something like this:

PROXIMITIES <your factor variable names here>  /VIEW=VARIABLE /MEASURE=PHI (1,0).

Interpret the phi coefficients in the same way you would Pearson correlation coefficients; the higher the correlation, the more similar the variables. If the correlations are high, then I would re-think using those factors as joint predictors in the regression model and come up with a simpler but coarser measure. I've seen occasions where researchers ended up simplifying down to about the coarsest measure of all: A binary variable coded "1" if the patient had any of these illnesses and 0 otherwise, or the SUM measure I mentioned before. Not ideal by any means but that's what happens sometimes.

------------------------------
Rick Marcantonio
Quality Assurance
IBM

Original Message:
Sent: Sat June 05, 2021 08:27 AM
From: Rose
Subject: Data quality check for a multiple response question

Morning!

You are correct - the data are coded as 0 and 1. My main interest is in the presence/absence of each particular condition to evaluate its effect on the dependent variable - as you have mentioned. I never thought about the possible effect of the number of selected conditions on the DV - that something worth to exploring as well.

When you say the interactions between the factors is that related to multicollinearity? I

Thanks a lot for your response and time.

------------------------------
Rose

Original Message:
Sent: Fri June 04, 2021 01:33 AM
From: Rick Marcantonio
Subject: Data quality check for a multiple response question

Hi, Rose.

Multiple response variables as such are limited to either Custom Tables or the MULT RESPONSE command.

Judging from your first post, it seems you have four component variables: back problems, high blood pressure, diabetes, or other. I assume that each is (or will be) coded 0-1. Given that, you could use them as independent variables in a logistic model in one of at least two ways.

First, if you are interested in the number of medical conditions as causative (without respect to which particular ones any patient or respondent has), then you could sum the responses and use the count variable in the model. Second, if you are concerned with how the presence/absence of each particular illness affects the dependent measure, then entering each variable as a factor would be preferable (that also raises the issue of interactions between those factors).

Of course, the use of those variables carries the usual assumptions - see this multinomial logistic example if you want more information.

------------------------------
Rick Marcantonio
Quality Assurance
IBM

Original Message:
Sent: Thu June 03, 2021 10:04 PM
From: Rose
Subject: Data quality check for a multiple response question

Hi again,

Is it acceptable to use a multiple response question as factors in a multinomial logistic regression model?

For instance, I am using medical conditions - it could be the case that the person selected 1 or multiple options.
Each response represents a variable in SPSS - coded as apply (1) or does not apply (0).

How would logistic regression treat instances in which more than 1 condition was selected? Does that matter for the model? Maybe I should only use cases that have only one condition selected?

Thanks in advance!

------------------------------
Rose

Original Message:
Sent: Sun May 30, 2021 07:08 PM
From: Jon Peck
Subject: Data quality check for a multiple response question

Such conditions can be defined as standard COMPUTE commands, but if you have the Data Validation option (now included in Base), you can define single and cross-variable conditions and get a set of reports on violations.

Here is a compute that checks this particular condition.

compute invalid = (Q1 eq 1 and not any(1, Q2_1 to Q2_4)) or

Q1 eq 0 and any(1, Q2_1 to Q2_4).

--

Jon K Peck
jkpeck@gmail.com

Original Message:
Sent: 5/30/2021 6:31:00 PM
From: Rose
Subject: Data quality check for a multiple response question

Hey,

I have a multiple response question which should be answered if a "yes" response was selected on a previous question. However, I have noticed data entry errors.

Let's say that Q1 asks - do you have a medical condition? (coded as 1-yes, 2-no, 3-don't know)

Q2 should be answer if 'yes' was selected for Q1 (Q2 is coded as 0-no, 1-yes)
1. back problems
2. high blood pressure
3. diabetes
4. other

Some scenarios I have noticed...
1. Yes was selected for Q1; however, no answer was provided for Q2
2. No was selected for Q1; however, an answer(s) was provided for Q2
3. Don't know was selected for Q1; however, an answer(s) was provided for Q2
2. Q1 was left blank; however, an answer(s) was provided for Q2

I am dealing with a big data set, so I would like to know how can I perform a quality check for the scenarios above. I have these data from different states, so I would like to find out the percentage affected by these scenarios (if any). Based on that, I can create a threshold to DQ X state from the analysis.

Any guidance is appreciated.

Thanks.

------------------------------
Rose
------------------------------
#SPSSStatistics

SPSS Statistics

SPSS Statistics