Is Automatic Data Preparation, specifically Feature Construction, correct? Is its Algorithms description document correct?
The following ADP syntax performs only Feature Construction and there is no target variable. Therefore, only 1) grouping of variables and 2) extraction of one principal component from each group, is performed. Syntax of the transformations is saved.
ADP
/FIELDS INPUT= a1 a2 a3 a4 a5 a6 a7 a8 a9 a10
/PREPDATETIME DATEDURATION=NO TIMEDURATION=NO EXTRACTYEAR=NO EXTRACTMONTH=NO EXTRACTDAY=NO
EXTRACTHOUR=NO EXTRACTMINUTE=NO EXTRACTSECOND=NO
/SCREENING PCTMISSING=NO UNIQUECAT=NO SINGLECAT=NO
/ADJUSTLEVEL INPUT=NO TARGET=NO
/OUTLIERHANDLING INPUT=NO TARGET=NO
/REPLACEMISSING INPUT=NO TARGET=NO
/REORDERNOMINAL INPUT=NO TARGET=NO
/RESCALE INPUT=NONE TARGET=NONE
/TRANSFORM MERGESUPERVISED=NO MERGEUNSUPERVISED=NO BINNING=NONE SELECTION=NO
CONSTRUCTION=YES(ROOT=feat)
/CRITERIA SUFFIX(TARGET='_tr' INPUT='_tr')
/OUTFILE PREPXML='C:\Temp\spss5732\spssadp_automatic.tmp'.
TMS IMPORT
/INFILE TRANSFORMATIONS='C:\Temp\spss5732\spssadp_automatic.tmp' MODE=FORWARD
/SAVE TRANSFORMED=NO
/OUTFILE SYNTAX='D:\Exercise\trans.sps'. /*PLEASE SPECIFY YOUR PATH HERE
EXECUTE.
ERASE FILE='C:\Temp\spss5732\spssadp_automatic.tmp'.
(The data file is attached)
The correlations between the 10 variables are:
According to the syntax saved by ADP command...
COMPUTE feat01 = ((1.38160393061016*a3)+(1.45134863418947*a2)).COMPUTE feat02 = ((0.883586027863073*a5)+((0.979405457980496*a4)+(0.540720965943866*a6))).COMPUTE feat03 = ((0.728350503663626*a10)+(1.41946313742639*a9)).VARIABLE ROLE/NONE a2 a3 a4 a5 a6 a9 a10/INPUT a1 a7 a8 feat01 feat02 feat03.... the procedure formed the following three groups of variables (to replace them with their 1st PCs):
(a2,a3)
(a4,a5,a6)
(a9,a10)
Let us trace the how that was done, basing ourselves on the algorithm described in the Algorithms document > Automatic Data Preparation > Continuous Predictor Handling > Feature Selection and Construction.
On the a_group=0.4 level, we encounter correlation 0.486 between a2 and a3, so we group them (step 4).
There is no other variable X with correlation min(r_xa2, r_xa3)>a_group, so we close this group (step 5).
We produce the PC (feat01) out of (a2,a3) and remove a2, a3 from the correlation matrix (steps 6, 8).
So far so good.
Still on the a_group=0.4 level, we encounter correlation 0.426 between a4 and a5, so we group them (step 4).
There is no other variable X with correlation min(r_xa4, r_xa5)>a_group, so we should close the group.
But unexpectedly to me, a6 is also included in the group. Despite that min(r_a6a4, r_a6a5)=0.337 (<0.4).
Question 1: Why? What's special here? Am I missing something?
On the a_group=0.1 level, we could expect to stop the process (see step 3). Still, ADP continues and groups a9 with a10 (r=0.197).
Question 2: Why? Might there be some lapse in the description?
Question 3: Why a1, a7, a8 were never included together in a group, despite there are correlations among them exceeding a_group=0.1?
Finaly, general question
Question 4: What is a sane motivation behind extracting a principal component from a group united by so low correlations as 0.1 or 0.2? In other words, why does the algorithm continue constructing (extracting) features until a_group level as low as 0.1?
I hope very much that a statisticial/developer who designed ADP come down to answer my (probably silly) questions.
------------------------------
Kirill Orlov
------------------------------
#SPSSStatistics