SPSS Statistics

SPSS Statistics

Your hub for statistical analysis, data management, and data documentation. Connect, learn, and share with your peers! 

 View Only

Automatic Variable Selection in SPSS Statistics

By Jon Peck posted Wed January 14, 2026 09:34 AM

  

In 1996 a student asked the distinguished statistician Brad Efron “what are the most important problems in statistics?”.   Efron named a single problem: variable selection in regression.  Statisticians have tackled this problem in many ways and have come up with many solutions, but this is still the  most important problem.  This post reviews the options available in SPSS Statistics for automatic variable selection in the context of generalized linear models.  Table 1 lists the procedures available in SPSS Statistics for doing  this.

In an ideal world, the correct variables for a model would be known from theory or past practice or an experimental design would have made the issue moot, but most often, especially in the social sciences or business analytics, some variables would be known, but there would be much uncertainty about others, and the choice affects the validity of conclusions about the questions of interest regarding variable effects, significance, and predictions.

The purpose of the analysis affects the choices to be made.  Classical regression theory posits a set of assumptions that must be correct for the analysis to be valid.  The model, i.e., the set of predictors and the functional form of the regression equation are assumed to be known, almost without discussion, and the focus is on the independence of the observations and assumptions about the error terms, but using automatic variable selection methods affects the statistical properties of the results.  If the wrong variables are included or variables that should be included are omitted, the model is not valid.  Furthermore, selecting variables based on the data also used for estimation biases the coefficients and the significance tests.

If the primary goal is accurate prediction, inclusion of some irrelevant variables might not matter much, but if testing a theory or explaining something, including extraneous variables would be a problem.  This also affects the sensitivity of results to small changes in the data and biases the results.  Collecting the data on extraneous variables may increase the analysis cost as well.  Selection methods vary in how they affect the estimation statistics, and methods vary in effectiveness.

Different algorithms will, of course, choose different sets of variables.  SPSS provides many selection algorithms: using several different algorithms may give you a consensus on some variables and, along with your judgment and prior literature, help you to formulate a final model.

The table below shows all the choices of selection methods in SPSS – at least all the ones I could find.  The next section contains comments on some of the methods, but it does not attempt to cover all that can be said about these methods.  The procedure documentation and the statistical literature should be consulted.

Comments

REGRESSION: The traditional stepwise methods have long been used and are still popular, but they are probably the worst choices as they give overstated significance levels, and they don’t generally find the best model.  However, if it is used, the Stepwise variant in REGRESSION would be the best choice in that procedure:
Stepwise. At each step, the independent variable not in the equation that has the smallest probability of F is entered, if that probability is sufficiently small. Variables already in the regression equation are removed if their probability of F becomes sufficiently large. 

LINEAR offers best subset, which checks all or most of the forward stepwise possibilities  using the AICC, which penalizes larger models,  Overfit Prevention Criterion (ASE) is based on the fit (average squared error, or ASE) of the overfit prevention set. The overfit prevention set is a random subsample of approximately 30% of the original dataset that is not used to train the model..  This can be time consuming, and it still has overstated significance levels.  LINEAR offers three entry and removal criteria.

STATS BAYES SELECTVARS provides tests for generalized linear models using Bayesian model averaging.  It shows probabilities that coefficients are nonzero and patterns of variable inclusion in the top five models.  The output consists of three tables and five optional plots to help in choosing what to include.  It supports split files but not weights.

The STATS RELIMP command allows for forced variables.  Its importance metrics are

·        Shapley Value the incremental R2 for the variable averaged over all models (labelled lmg in the output)

·        First The R2 for the variable when entered first

·        Last The incremental R2 when the variable is entered last

·        Beta Sq The square of the standardized coefficient. These values would sum to the R2 if the independent variables were uncorrelated.

·        Pratt The standardized coefficient times the correlation

·        CAR Marginal correlations after adjusting for correlations among the regressors.

RELIMP  is computationally intensive, so you should limit it to a dozen or so variables if using the Shapley Value, Selecting variables in batches and then combining the better variables in a final round can be used to deal with larger numbers of variables.  The Shapley value criterion has the properties of efficiency (full distribution), symmetry (equal contributors get equal values), additivity, and null player (zero contribution yields a zero value).   No significance statistics are provided as they would be biased.  Using a holdout sample addresses this issue.  RELIMP has the benefit of showing how variable effects vary with model size.

Tree models aren’t usually thought of as GLMs, but they do select variables based on significance tests or other criteria.  STATS CITREE can estimate a regression model finding which subpopulations have significantly different regression coefficients and, hence, which subpopulation variables should be included in the equation or be estimated separately.  The terminal nodes table can include the definition of each node in SPSS syntax, so you can easily construct dummy variables representing each terminal node and use them in a regression.

CITREE can produce trees, linear, and logistic models.  It has many parameters that can be used to tune the results.  The vignettes available via the Vignettes tab in the dialog box goes into details on these and the resulting statistical properties.  The statistical approach ensures that the right-sized tree is grown without additional (post-)pruning or cross-validation

STATS EARTH simultaneously does variable selection, finds interactions up to a specified maximum order, and finds nonlinearities.  It works by trying linear  splines repeatedly, so the table of regression coefficients shows the terms of the surviving segments.  You can constrain individual variables to enter linearly over their entire domain.  Doing that can provide a test for interactions of the significant variables.  While the coefficient table can be difficult to read,  plots of the individual variable effects shows the nonlinearities clearly.  One way to use EARTH is to use it to determine the best set of terms and then computing the variables in the formula and running a regular linear regression.

LINEAR_LASSO, unlike ridge drives coefficients to zero, effectively doing variable selection.  LINEAR ELASTIC NET can also drive coefficients to zero.  As with other automatic selection methods, using a holdout sample is necessary to get correct significance levels.

STATS BORUTAFEATURES” The Boruta method is unusual.  It randomly shuffles case values to create shadow variables and judges the real variables by comparing their performance to the shadow variables using a random forest algorithm.

NAIVEBAYES automatically bins scale variables, both the dependent and predictor variables, using the declared measurement levels.  A predictoras subset size can be specified,

Automatic Selection Methods in SPSS Statistics

The methods in the table are listed in the approximate order of where they appear as you traverse the Analyze menu  tree.  Some of the methods are in optional modules, and extension commands, which are all free, many need to be installed using the Extensions > Extension Hub menu.

STATS PERM provides tests for simple two-group t test, anova, and regression using resampling.  It does not assume normality, and the tests are appropriate even for small datasets where asymptotic properties might not be reliable.

Procedures marked with (*) are currently in QA.

Table 1: Automatic Variable Selection Methods in SPSS Statistics

Procedure

Command Name

Model Type

Dependent Variables

Variable Selection Methods

Notes

Component

Multiple Adaptive Regression Splines

STATS EARTH

GLM

scale
categorical

generalized cross validation (variant of AIC)

includes interaction effects

extension

Automatic Linear Modeling

LINEAR

regression

scale

forward stepwise

best subsets

AICC

F
Adj R2
ASE

Base

Generalized Boosted Regression

STATS GBM

GLM

categorical
scale

out of bag

cross validation
holdout

Variable Relative
Importance Table

Many distributions available

extension

Bayesian Regression Variable Selection(*)

STATS BAYES SELECTVARS

 GLM (linear, logit, Poisson, Gamma)

scale

Bayesian Adaptive Samp;omg

Shows patterns for top models

Uses JGZ or CCH prior

extension

Linear

REGRESSION

regression

Scale

stepwise
remove
backward
forward

Results are biased

Base

Permutation Tests(*)

STATS PERM

regression

Scale

 permutation tests for simple two-group t tests, anova, and regression

Normality not assumed

Good when sample too small for asumptotic properties to hold

extension

Lasso

LINEAR_LASSO

regularized regression

scale

train/test
cross validation MSE

Base

Elastic Net

LINEAR_ELASTIC_NET

regularized regression

scale

train/test
cross validation MSE

Includes shrinkage and selection

Base

Regression Relative Importance

STATS RELIMP

regression

scale

Shapley value
first
last
beta square
Pratt
marginal correlations

extension

Logistic Regression

LOGISTIC REGRESSION

logistic

dichotomous

forward selection (conditional)
forward selection (likelihood)
forward selection (Wald)
 backward elimination (likelihood)
back elimination (Wald)

backward elimination (conditional))

Standard

Naïve Bayes

NAIVEBAYES

-

categorical

Best subset

Base

Tree

TREE

tree

categorical
scale

CHAID
exhaustive CHAID
CRT
QUEST

Professional

Conditional Inference Tree

STATS CITREE

tree
regression
logitic

categorical
scale

quadratic
maximum

Many optional parameters

extension

C5.0 Decision Tree

STATS C5.0 TREE

tree

categorical
scale

C5.0

extension

Discriminant

DISCRIMINANT

discriminant

categorical

Wilks' lambda
unexplained variance Mahalanobis distance smallest F ratio Rao's V

Base

Ranfor Estimation and Prediction (random forest)

SPSSINC RANFOR

classification or regression forest

categorical
scale

node impurity reduction or variable importance

extension

Boruta Feature Selection

STATS BORUTAFEATURES

Classification or regression forest

categorical

scale

shadow variables

Time consuming

extension

0 comments
6 views

Permalink