Introduction to robust model
selection
in linear regression
Mallows’ $C_p$ is widely used for the selection of a model from among many
non-nested regression models. However, the statistic is a function of two
residual sums of squares; it is an aggregate statistic, a function of all the
observations. Thus $C_p$ suffers from the well-known lack of robustness of least
squares and provides no evidence of whether or how individual observations
or unidentified structure are affecting the choice of model. In this part of the toolbox
we use the robustness of the data-driven flexible trimming provided by the
forward search to choose regression models in the presence of outliers. Our
tools are new distributional results on added t-test (Atkinson and Riani, 2002) and $C_p$
(Riani and Atkinson, 2010) in the forward search and a powerful
new version of the $C_p$ plot, which we call a generalized candlestick plot.