SPSS Statistics

SPSS Statistics

Your hub for statistical analysis, data management, and data documentation. Connect, learn, and share with your peers! 

 View Only
  • 1.  ADP, Feature Construction algorithm - a problem

    Posted Sat August 27, 2022 12:03 PM
    Edited by System Admin Fri January 20, 2023 04:43 PM
      |   view attached
    Is Automatic Data Preparation, specifically Feature Construction, correct? Is its Algorithms description document correct?

    The following ADP syntax performs only Feature Construction and there is no target variable. Therefore, only 1) grouping of variables and 2) extraction of one principal component from each group, is performed. Syntax of the transformations is saved.


    ADP
    /FIELDS INPUT= a1 a2 a3 a4 a5 a6 a7 a8 a9 a10
    /PREPDATETIME DATEDURATION=NO TIMEDURATION=NO EXTRACTYEAR=NO EXTRACTMONTH=NO EXTRACTDAY=NO
    EXTRACTHOUR=NO EXTRACTMINUTE=NO EXTRACTSECOND=NO
    /SCREENING PCTMISSING=NO UNIQUECAT=NO SINGLECAT=NO
    /ADJUSTLEVEL INPUT=NO TARGET=NO
    /OUTLIERHANDLING INPUT=NO TARGET=NO
    /REPLACEMISSING INPUT=NO TARGET=NO
    /REORDERNOMINAL INPUT=NO TARGET=NO
    /RESCALE INPUT=NONE TARGET=NONE
    /TRANSFORM MERGESUPERVISED=NO MERGEUNSUPERVISED=NO BINNING=NONE SELECTION=NO
    CONSTRUCTION=YES(ROOT=feat)
    /CRITERIA SUFFIX(TARGET='_tr' INPUT='_tr')
    /OUTFILE PREPXML='C:\Temp\spss5732\spssadp_automatic.tmp'.
    TMS IMPORT
    /INFILE TRANSFORMATIONS='C:\Temp\spss5732\spssadp_automatic.tmp' MODE=FORWARD
    /SAVE TRANSFORMED=NO
    /OUTFILE SYNTAX='D:\Exercise\trans.sps'. /*PLEASE SPECIFY YOUR PATH HERE
    EXECUTE.
    ERASE FILE='C:\Temp\spss5732\spssadp_automatic.tmp'.

    (The data file is attached)

    The correlations between the 10 variables are:

    According to the syntax saved by ADP command...
    COMPUTE feat01 = ((1.38160393061016*a3)+(1.45134863418947*a2)).
    COMPUTE feat02 = ((0.883586027863073*a5)+((0.979405457980496*a4)+(0.540720965943866*a6))).
    COMPUTE feat03 = ((0.728350503663626*a10)+(1.41946313742639*a9)).
    VARIABLE ROLE
    /NONE a2 a3 a4 a5 a6 a9 a10
    /INPUT a1 a7 a8 feat01 feat02 feat03.

    ... the procedure formed the following three groups of variables (to replace them with their 1st PCs):
    (a2,a3)
    (a4,a5,a6)
    (a9,a10)

    Let us trace the how that was done, basing ourselves on the algorithm described in the Algorithms document > Automatic Data Preparation > Continuous Predictor Handling > Feature Selection and Construction.

    On the a_group=0.4 level, we encounter correlation 0.486 between a2 and a3, so we group them (step 4).
    There is no other variable X with correlation min(r_xa2, r_xa3)>a_group, so we close this group (step 5).
    We produce the PC (feat01) out of (a2,a3) and remove a2, a3 from the correlation matrix (steps 6, 8).
    So far so good.

    Still on the a_group=0.4 level, we encounter correlation 0.426 between a4 and a5, so we group them (step 4).
    There is no other variable X with correlation min(r_xa4, r_xa5)>a_group, so we should close the group. But unexpectedly to me, a6 is also included in the group. Despite that min(r_a6a4, r_a6a5)=0.337 (<0.4).
    Question 1: Why? What's special here? Am I missing something?

    On the a_group=0.1 level, we could expect to stop the process (see step 3). Still, ADP continues and groups a9 with a10 (r=0.197).
    Question 2: Why? Might there be some lapse in the description?

    Question 3: Why a1, a7, a8 were never included together in a group, despite there are correlations among them exceeding a_group=0.1?

    Finaly, general question
    Question 4: What is a sane motivation behind extracting a principal component from a group united by so low correlations as 0.1 or 0.2? In other words, why does the algorithm continue constructing (extracting) features until a_group level as low as 0.1?

    I hope very much that a statisticial/developer who designed ADP come down to answer my (probably silly) questions.

    ------------------------------
    Kirill Orlov
    ------------------------------
    #SPSSStatistics

    Attachment(s)

    sav
    adp2.sav   99 KB 1 version


  • 2.  RE: ADP, Feature Construction algorithm - a problem

    Posted Tue August 30, 2022 09:39 AM
    PLEASE DO FORWARD my question to the SPSS developers who were responsible for the ADP feature.

    ------------------------------
    Kirill Orlov
    ------------------------------



  • 3.  RE: ADP, Feature Construction algorithm - a problem

    Posted Tue September 06, 2022 04:39 PM
    Thanks for letting us know about this, Kirill. I was unavailable for several days owing to illness.

    One of the statisticians has replicated your findings and has filed an issue against the code. He isn't sure where the problem is, but it will be looked into.

    Thanks.


    ------------------------------
    Rick Marcantonio
    Quality Assurance
    IBM
    ------------------------------