(updated with several improvements 1/5/2024)
I am excited to announce a beta-test version of a substantial new SPSS Statistics extension command for conditional inference classification and regression trees. It provides estimation and prediction capabilities. Tree models are very intuitive, but they have two major problems. Typical algorithms tend to overfit and, therefore, do not generalize well. Cross validation can assess out-of-sample performance, but the trees may still be nonoptimal. Also, trees quickly grow in size, and large trees are difficult to display or understand.
The new procedure, STATS CITREE, which is implemented in R, addresses both of these issues. Details are in the help.
The procedure also provides linear and logistic regression equation estimates with tests for where the coefficients differ over the subpopulations in the tree, which helps in selecting control variables.
The procedure, which has the usual dialog box and syntax in SPSS style, has some novel features.
- You can estimate a tree and then return to the estimated tree dialog to adjust plots and other output characteristics without reestimating even if you didn’t save the model.
- Trees can be time consuming to estimate, and this is the first extension procedure that I am aware of that can take advantage of multiple CPUs to improve performance.
- The dialog box has a Vignettes tab that provides more extensive help than usual for extensions. I suggest that you start by reading the dialog box help vignette.
The procedure is not yet on the SPSS Statistics Extension Hub. To try it out, download it from here
STATS_CITREE.spe
and install it via Extensions > Install local extension bundle. It will appear under Analyze > Classify.
This is a beta version with a lot of new code, so there are, doubtless, bugs. Please report what you find and add any comments or suggestions either via this forum or directly to me at jkpeck@gmail.com. Be sure to identify your platform and Statistics version.
The procedure has been tested on SPSS 29.0.2.0 and 30 but is expected to work on some older versions as long as the R Essentials are installed. I set the minimum SPSS version to 27.
Known Problems:
- If you do predictions, the user interface may lock up for a minute or so after the procedure completes (status bar will show Running INSERT CHART but that actually completed), but it will then return to normal.
- If you predict choosing node probabilities as the predicted value, and you have two or more dependent variables, the probabilities dataset values will be correct, but the variable names will be wrong. This is a bug in the underlying R package.
#IBMChampion