Original Message:
Sent: 12/22/2023 2:39:00 PM
From: A. R. Afework
Subject: RE: Error with Relational Operators (using >= in a COMPUTE)
Ah. I appreciate your very thorough reply. I now understand what is happening. Still not necessarily satisficed by the behavior, but it's now clear to me why rounding the mean before doing the comparison gave a different answer--the expected one--despite the mean "already" being "exactly" 0.5. And since I understand, I can work around it.
Thank you for taking the time!
------------------------------
A. R. Afework
------------------------------
Original Message:
Sent: Fri December 22, 2023 01:34 PM
From: Jon Peck
Subject: Error with Relational Operators (using >= in a COMPUTE)
The computed sum is .10.9999999999999980 for case 127. This is due to the nature of the floating point arithmetic hardware used in all modern computers - at least all that I know about. Double precision floating point values have 53 binary bits of precision (the rest is for the exponent), for representing numbers in terms of powers of 2. So, although very close, there are infinitely many numbers that cannot be represented exactly. All floating point arithmetic is therefore approximate even though the values are extremely close. You can dig into the details of floating point hardware on Wikipedia (see below)
If computations were carried out in base 10, decimal arithmetic, you would not see differences with decimal values. It is possible to write code to do this, but it is hundreds of times slower than binary arithmetic since it has to be done in software rather than the usual hardware, so it is not used in scientific computation. The Decimal data type in Python provides this computation and could be used in SPSS via the SPSSINC TRANS extension command.
SPSS Statistics provides a fuzz bits setting for the RND and TRUNC functions that deals with boundary cases. The Edit > Options > Data help provides information on this, but it would not affect functions like SUM and MEAN.
If you compute all the partial sums of the scale variables, you see for case 127 that the first inexact number appears at sum16, i.e., sum(scale1 to scale16). BTW, you can abbreviate syntax with TO in the variable list.
In computing, floating-point arithmetic (FP) is arithmetic that represents subsets of real numbers using an integer with a fixed precision, called the significand, scaled by an integer exponent of a fixed base. Numbers of this form are called floating-point numbers.[1]: 3 [2]: 10 For example, 12.345 is a floating-point number in base ten with five digits of precision:
However, unlike 12.345, 12.3456 is not a floating-point number in base ten with five digits of precision-it needs six digits of precision; the nearest floating-point number with only five digits is 12.346. In practice, most floating-point systems use base two, though base ten (decimal floating point) is also common.
Floating-point arithmetic operations, such as addition and division, approximate the corresponding real number arithmetic operations by rounding any result that is not a floating-point number itself to a nearby floating-point number.[1]: 22 [2]: 10 For example, in a floating-point arithmetic with five base-ten digits of precision, the sum 12.345 + 1.0001 = 13.3451 might be rounded to 13.345.
--
Original Message:
Sent: 12/22/2023 11:39:00 AM
From: A. R. Afework
Subject: RE: Error with Relational Operators (using >= in a COMPUTE)
Hi Jon,
Thank you for looking at my dataset and my question.
So for rows 119-127, I can see that if I increase the variables to 16 places after the decimal, SPSS has calculated the mean as less than 0.5. On the plus side, this means that the greater than or equal to operation is fine. Unfortunately, this insight highlights a couple of issues:
- Originally--when created and when the calculation was done--the variable was only set to have two decimals. So SPSS is apparently storing additional decimal places but only displaying 2. This is strange.
- The mean function then seems to be the issue by returning an incorrect value. For rows 119-127, all of those means are exactly 0.5.
- Specifically case 127 has 22 individual scores, which sum to 11 so the mean is 0.5.
------------------------------
A. R. Afework
Original Message:
Sent: Fri December 22, 2023 11:06 AM
From: Jon Peck
Subject: Error with Relational Operators (using >= in a COMPUTE)
The calculation are correct. If, for example, you look at case 127, it looks like the value is .50000, but if you expand the number of decimals, you can see that the exact value is actually .4999999999999999
Original Message:
Sent: 12/21/2023 2:09:00 PM
From: A. R. Afework
Subject: Error with Relational Operators (using >= in a COMPUTE)
Hello there,
We have run into a strange error. We have a large dataset where we compute 25 different scores and then the mean of those scores for each case. Each case where the mean score is over 0.5 is considered to have met the standard and we report out the percentage who met the standard overall, and for various subgroups. We expected this to be really straightforward, but something is happening. The calculation of whether a case met the standard--which is simply a test of greater than or equal to--is failing. For a small percentage of cases, the mean score is equal to the cutoff value, but the MetStandard variable is being assigned 0 and we do not understand why.
I am attaching a dataset for illustration purposes, in case that helps. It has 2500 cases. Each case has up to 25 scores (Scale1 through Scale25) that range from 0.2 to 1.0. The mean score and the variable indicating whether or not the variable met the standard are assigned as follows:
COMPUTE MeanScale=Mean(Scale1, Scale2, Scale3, Scale4, Scale5, Scale6, Scale7, Scale8, Scale9,
Scale10, Scale11, Scale12, Scale13, Scale14, Scale15, Scale16, Scale17, Scale18, Scale19, Scale20,
Scale21, Scale22, Scale23, Scale24, Scale25).
COMPUTE MetStandard=MeanScale >= 0.5.
There are 9 cases where MeanScale is 0.5, but MetStandard was assigned 0. I noticed that rounding the MeanScale before doing the comparison seems to resolve this problem (at least in this particular dataset, though I do not know why that should matter given the cases with issue were exactly 0.5). Is there some underlying limitation that I should be aware of that explains this? Are there other problems that issue can cause?
Thank you for any guidance!
------------------------------
A. R. Afework
------------------------------