In 1996 a student asked the distinguished statistician Brad Efron “what are the most important problems in statistics?”. Efron named a single problem: variable selection in regression. Statisticians have tackled this problem in many ways and have come up with many solutions, but this is still the most important problem. This post reviews the options available in SPSS Statistics for automatic variable selection in the context of generalized linear models. Table 1 lists the procedures available in SPSS Statistics for doing this.
In an ideal world, the correct variables for a model would be known from theory or past practice or an experimental design would have made the issue moot, but most often, especially in the social sciences or business analytics, some variables would be known, but there would be much uncertainty about others, and the choice affects the validity of conclusions about the questions of interest regarding variable effects, significance, and predictions.
The purpose of the analysis affects the choices to be made. Classical regression theory posits a set of assumptions that must be correct for the analysis to be valid. The model, i.e., the set of predictors and the functional form of the regression equation are assumed to be known, almost without discussion, and the focus is on the independence of the observations and assumptions about the error terms, but using automatic variable selection methods affects the statistical properties of the results. If the wrong variables are included or variables that should be included are omitted, the model is not valid. Furthermore, selecting variables based on the data also used for estimation biases the coefficients and the significance tests.
If the primary goal is accurate prediction, inclusion of some irrelevant variables might not matter much, but if testing a theory or explaining something, including extraneous variables would be a problem. This also affects the sensitivity of results to small changes in the data and biases the results. Collecting the data on extraneous variables may increase the analysis cost as well. Selection methods vary in how they affect the estimation statistics, and methods vary in effectiveness.
Different algorithms will, of course, choose different sets of variables. SPSS provides many selection algorithms: using several different algorithms may give you a consensus on some variables and, along with your judgment and prior literature, help you to formulate a final model.
The table below shows all the choices of selection methods in SPSS – at least all the ones I could find. The next section contains comments on some of the methods, but it does not attempt to cover all that can be said about these methods. The procedure documentation and the statistical literature should be consulted.
Comments
REGRESSION: The traditional stepwise methods have long been used and are still popular, but they are probably the worst choices as they give overstated significance levels, and they don’t generally find the best model. However, if it is used, the Stepwise variant in REGRESSION would be the best choice in that procedure:
Stepwise. At each step, the independent variable not in the equation that has the smallest probability of F is entered, if that probability is sufficiently small. Variables already in the regression equation are removed if their probability of F becomes sufficiently large.
LINEAR offers best subset, which checks all or most of the forward stepwise possibilities using the AICC, which penalizes larger models, Overfit Prevention Criterion (ASE) is based on the fit (average squared error, or ASE) of the overfit prevention set. The overfit prevention set is a random subsample of approximately 30% of the original dataset that is not used to train the model.. This can be time consuming, and it still has overstated significance levels. LINEAR offers three entry and removal criteria.
STATS BAYES SELECTVARS provides tests for generalized linear models using Bayesian model averaging. It shows probabilities that coefficients are nonzero and patterns of variable inclusion in the top five models. The output consists of three tables and five optional plots to help in choosing what to include. It supports split files but not weights.
The STATS RELIMP command allows for forced variables. Its importance metrics are
· Shapley Value the incremental R2 for the variable averaged over all models (labelled lmg in the output)
· First The R2 for the variable when entered first
· Last The incremental R2 when the variable is entered last
· Beta Sq The square of the standardized coefficient. These values would sum to the R2 if the independent variables were uncorrelated.
· Pratt The standardized coefficient times the correlation
· CAR Marginal correlations after adjusting for correlations among the regressors.
RELIMP is computationally intensive, so you should limit it to a dozen or so variables if using the Shapley Value, Selecting variables in batches and then combining the better variables in a final round can be used to deal with larger numbers of variables. The Shapley value criterion has the properties of efficiency (full distribution), symmetry (equal contributors get equal values), additivity, and null player (zero contribution yields a zero value). No significance statistics are provided as they would be biased. Using a holdout sample addresses this issue. RELIMP has the benefit of showing how variable effects vary with model size.
Tree models aren’t usually thought of as GLMs, but they do select variables based on significance tests or other criteria. STATS CITREE can estimate a regression model finding which subpopulations have significantly different regression coefficients and, hence, which subpopulation variables should be included in the equation or be estimated separately. The terminal nodes table can include the definition of each node in SPSS syntax, so you can easily construct dummy variables representing each terminal node and use them in a regression.
CITREE can produce trees, linear, and logistic models. It has many parameters that can be used to tune the results. The vignettes available via the Vignettes tab in the dialog box goes into details on these and the resulting statistical properties. The statistical approach ensures that the right-sized tree is grown without additional (post-)pruning or cross-validation
STATS EARTH simultaneously does variable selection, finds interactions up to a specified maximum order, and finds nonlinearities. It works by trying linear splines repeatedly, so the table of regression coefficients shows the terms of the surviving segments. You can constrain individual variables to enter linearly over their entire domain. Doing that can provide a test for interactions of the significant variables. While the coefficient table can be difficult to read, plots of the individual variable effects shows the nonlinearities clearly. One way to use EARTH is to use it to determine the best set of terms and then computing the variables in the formula and running a regular linear regression.
LINEAR_LASSO, unlike ridge drives coefficients to zero, effectively doing variable selection. LINEAR ELASTIC NET can also drive coefficients to zero. As with other automatic selection methods, using a holdout sample is necessary to get correct significance levels.
STATS BORUTAFEATURES” The Boruta method is unusual. It randomly shuffles case values to create shadow variables and judges the real variables by comparing their performance to the shadow variables using a random forest algorithm.
NAIVEBAYES automatically bins scale variables, both the dependent and predictor variables, using the declared measurement levels. A predictoras subset size can be specified,
Automatic Selection Methods in SPSS Statistics
The methods in the table are listed in the approximate order of where they appear as you traverse the Analyze menu tree. Some of the methods are in optional modules, and extension commands, which are all free, many need to be installed using the Extensions > Extension Hub menu.
STATS PERM provides tests for simple two-group t test, anova, and regression using resampling. It does not assume normality, and the tests are appropriate even for small datasets where asymptotic properties might not be reliable.
Procedures marked with (*) are currently in QA.
Table 1: Automatic Variable Selection Methods in SPSS Statistics
|
Procedure
|
Command Name
|
Model Type
|
Dependent Variables
|
Variable Selection Methods
|
Notes
|
Component
|
|
Multiple Adaptive Regression Splines
|
STATS EARTH
|
GLM
|
scale categorical
|
generalized cross validation (variant of AIC)
|
includes interaction effects
|
extension
|
|
Automatic Linear Modeling
|
LINEAR
|
regression
|
scale
|
forward stepwise
best subsets
|
AICC
F Adj R2 ASE
|
Base
|
|
Generalized Boosted Regression
|
STATS GBM
|
GLM
|
categorical scale
|
out of bag
cross validation holdout
|
Variable Relative Importance Table
Many distributions available
|
extension
|
|
Bayesian Regression Variable Selection(*)
|
STATS BAYES SELECTVARS
|
GLM (linear, logit, Poisson, Gamma)
|
scale
|
Bayesian Adaptive Samp;omg
|
Shows patterns for top models
Uses JGZ or CCH prior
|
extension
|
|
Linear
|
REGRESSION
|
regression
|
Scale
|
stepwise remove backward forward
|
Results are biased
|
Base
|
|
Permutation Tests(*)
|
STATS PERM
|
regression
|
Scale
|
permutation tests for simple two-group t tests, anova, and regression
|
Normality not assumed
Good when sample too small for asumptotic properties to hold
|
extension
|
|
Lasso
|
LINEAR_LASSO
|
regularized regression
|
scale
|
train/test cross validation MSE
|
|
Base
|
|
Elastic Net
|
LINEAR_ELASTIC_NET
|
regularized regression
|
scale
|
train/test cross validation MSE
|
Includes shrinkage and selection
|
Base
|
|
Regression Relative Importance
|
STATS RELIMP
|
regression
|
scale
|
Shapley value first last beta square Pratt marginal correlations
|
|
extension
|
|
Logistic Regression
|
LOGISTIC REGRESSION
|
logistic
|
dichotomous
|
forward selection (conditional) forward selection (likelihood) forward selection (Wald) backward elimination (likelihood) back elimination (Wald)
backward elimination (conditional))
|
|
Standard
|
|
Naïve Bayes
|
NAIVEBAYES
|
-
|
categorical
|
Best subset
|
|
Base
|
|
Tree
|
TREE
|
tree
|
categorical scale
|
CHAID exhaustive CHAID CRT QUEST
|
|
Professional
|
|
Conditional Inference Tree
|
STATS CITREE
|
tree regression logitic
|
categorical scale
|
quadratic maximum
|
Many optional parameters
|
extension
|
|
C5.0 Decision Tree
|
STATS C5.0 TREE
|
tree
|
categorical scale
|
C5.0
|
|
extension
|
|
Discriminant
|
DISCRIMINANT
|
discriminant
|
categorical
|
Wilks' lambda unexplained variance Mahalanobis distance smallest F ratio Rao's V
|
|
Base
|
|
Ranfor Estimation and Prediction (random forest)
|
SPSSINC RANFOR
|
classification or regression forest
|
categorical scale
|
node impurity reduction or variable importance
|
|
extension
|
|
Boruta Feature Selection
|
STATS BORUTAFEATURES
|
Classification or regression forest
|
categorical
scale
|
shadow variables
|
Time consuming
|
extension
|