Oops. I need a bigger font :-)
-1.006483356943868
--
Original Message:
Sent: 9/22/2024 12:09:00 PM
From: Kirill Orlov
Subject: RE: Strange result of Box-Cox transform by ADP
Jon, did you get BoxCox lambda -1.006 rather than 1.006?
------------------------------
Kirill Orlov
------------------------------
Original Message:
Sent: Sun September 22, 2024 10:48 AM
From: Jon Peck
Subject: Strange result of Box-Cox transform by ADP
Using STATS PREPROCESS VARIABLES,
STATS PREPROCESS VARIABLES=y ID=id
SORT=NO DATASET=normal
/NONLINEAR DONONLINEAR=YES METHOD=BOXCOX .
I get a lambda of 1.006. If I do Yeo-Johnson,
STATS PREPROCESS VARIABLES=y ID=id
SORT=NO DATASET=normalyj
/NONLINEAR DONONLINEAR=YES METHOD=YEOJOHNSON .
I get a lambda of -1.759. The histograms look slightly different, but this suggests that there result is not very sensitive to the transformation parameter. (Image attached)
------------------------------
Jon Peck
Data Scientist
JKP Associates
Santa Fe
Original Message:
Sent: Sun September 22, 2024 06:22 AM
From: Kirill Orlov
Subject: Strange result of Box-Cox transform by ADP
Automatic Data Preparation (ADP) can apply Box-Cox transformation to target variable to make a skewed one more normally distributed.
I am observing a strange, incongruous Box-Cox result by ADP, looking to me like a bug. ADP could not find the optimal lambda parameter of the transform in a case where it surely ought to find, since the data is simple and the optimal lambda value lies on the grid search ADP utilizes. Below is an example.
*Create first a standard normal variate X.
SET RNG=MC SEED=5676954.
input prog.
loop #case= 1 to 1000.
comp x= normal(1).
end case.
end loop.
end file.
end input prog.
exec.
dataset name data.
*and, after shifting it to the right so all its values are positive...
AGGREGATE
/OUTFILE=* MODE=ADDVARIABLES
/x_min=MIN(x).
comp x= x-x_min+1.
exec.
*... create the right-skew variate Y as 1/X.
comp y= 1/x.
exec.
*Y is the variate we want to apply Box-Cox transform to.
*Use ADP to do it.
comp factor= rnd(uniform(1)). /*(a factor variable, won't be used - just ADP needs it to mention)
exec.
var lev factor (nom).
*ADP doing only Box-Cox.
ADP
/FIELDS TARGET=y INPUT=factor
/PREPDATETIME DATEDURATION=NO TIMEDURATION=NO EXTRACTYEAR=NO EXTRACTMONTH=NO EXTRACTDAY=NO
EXTRACTHOUR=NO EXTRACTMINUTE=NO EXTRACTSECOND=NO
/SCREENING PCTMISSING=NO UNIQUECAT=NO SINGLECAT=NO
/ADJUSTLEVEL INPUT=NO TARGET=NO
/OUTLIERHANDLING INPUT=NO TARGET=NO
/REPLACEMISSING INPUT=NO TARGET=NO
/REORDERNOMINAL INPUT=NO TARGET=NO
/RESCALE INPUT=NONE TARGET=BOXCOX(MEAN=0 SD=1)
/TRANSFORM MERGESUPERVISED=NO MERGEUNSUPERVISED=NO BINNING=NONE SELECTION=NO CONSTRUCTION=NO
/CRITERIA SUFFIX(TARGET='_transformed' INPUT='_transformed')
/OUTFILE PREPXML='C:\Temp\spssadp_automatic.tmp'.
TMS IMPORT
/INFILE TRANSFORMATIONS='C:\Temp\spssadp_automatic.tmp' MODE=FORWARD (ROLES=UPDATE)
/SAVE TRANSFORMED=YES
/OUTFILE SYNTAX='C:\Temp\TransSyntax.sps'.
GRAPH
/HISTOGRAM(NORMAL)=Y_transformed.
*The Y_transformed variate is not much more normal than Y is.
*And, from the syntax file TransSyntax.sps we can learn that the lambda parameter used was: -3.
*However, the obviously best lambda should be -1, which tranforms Y "back" to a normal variate.
*ADP does grid search for the best lambda from -3 to 3 by step 0.5, so it must find lambda=-1.
*Why did it stick to (the incorrect) lambda=-3 instead?
I hope SPSS statisticians or ADP developers will come to answer it.
------------------------------
Kirill Orlov
------------------------------