Processing math: 0%

Technical introduction to Robust methods in regression

Notation

Consider the usual regression model

 

where u_i are random errors which have common variance equal to \sigma^2 and are independent from the covariates x_i\beta, the location component of the linear model, is the parameter of interest. \sigma is the so called scale component of the linear model.

Given an estimator of \beta, say \hat \beta, the residuals are defined as

r_i(\hat \beta)=y_i- x_i^T \hat \beta

The traditional least squares estimate of \beta is denoted with

\hat \beta = \hat \beta_{LS}=\hat \beta_{LS}(X,y)=(X'X)^{-1}X'y

We are concerned with the case where a certain proportion of the observations may not follow model above. Traditional robust estimators attempt to limit the influence of outliers by replacing in the estimation of \beta the square of the residuals with a less rapidly increasing loss function or by a function ρ of the residuals themselves which is bounded.

 

The \rho function

In the literature of robust statistics the \rho function denotes a function that

  1. \rho(x) is a non decreasing function of |x|
  2. \rho(0)=0
  3. \rho(x) is increasing for x>0 such that \rho(x) < \rho(\infty)

 

Perhaps the most popular choice for the \rho function in is Tukey’s biweight (bisquare) function 

\rho(x) = \begin{cases} \frac{x^2}{2}-\frac{x^4}{2c^2}+\frac{x^6}{6c^4} & \mbox{if} \quad |x| \leq c \\ \\ \frac{c^2}{6} & \mbox{if} \quad |x| > c, \\ \end{cases} where c>0 is a tuning constant which is linked to the breakdown point of the estimator of \beta.

 

 

Function \rho for Tukey's biweight is implemented in routine TBrho.

 

 

The \psi function

In the literature of robust statistics a \psi (psi) function denotes a function \psi which is the derivative of a \rho function \psi = \rho' such that

  1. \psi is odd and \psi(x) \geq 0 for x \geq 0

Function \psi for Tukey's biweight

\psi(x) = \begin{cases} x- \frac{2x^3}{c^2}+\frac{x^5}{c^4} = x \Big[ 1- \Big( \frac{x}{c}\Big)^2 \Big]^2 & \mbox{if} \quad |x| \leq c \\ \\ 0 & \mbox{if} \quad |x| > c, \\ \end{cases}

is implemented in routine TBpsi.

 

Regression, scale and affine equivariance

We say that an estimate \tilde \beta of \beta is regression, scale and affine equivariant, if it satisfies the following three properties: \begin{array} $\tilde \beta (X, y + X \gamma) & = & \tilde \beta(X,y)+ \gamma & \quad \textrm{for all} \quad & \gamma \in R^p \\ \tilde \beta (X, \lambda y) & = & \lambda \tilde \beta(X,y) & \quad \textrm{for all} \quad & \lambda \in R \\ \tilde \beta(X A, \lambda y) & = & A^{-1} \tilde\beta(X,\lambda y) & \quad \textrm{for all} \quad A \quad \textrm{with} \quad & \det|A| \neq 0 \end{array}

These are desirable properties because they allow us to know how the estimate changes under linear transformations of the data.

 

The regression M-estimate of location

The regression M-estimate of location (say \hat \beta_M) is the value that minimizes the following objective function (label{eq:rho})

\sum_{i=1}^{n} \rho \Bigg( \frac{ r_i(\hat\beta_M)}{\sigma} \Bigg) = \textrm{min}.

 

 Differentiation of eq. (ref{eq:rho}) gives (label{eq:psi})

\sum_{i=1}^{n} \psi \Bigg( \frac{ r_i(\hat\beta_M)}{\sigma} \Bigg) x_i = 0

 

Remark 1: If function \psi is monotone the solution to this equation are called monotone regression M estimates, while if \psi is redescending the solutions to ref{eq:psi}) are called redescending regression M-estimates.

Remark 2: it is possible to show that the regression M estimates have the following asymptotic distribution:

\hat \beta_M \approx N \Bigg( \beta, \sigma^2 \frac{E \psi (u/\sigma)^2}{[E \psi' (u/\sigma)]^2} (X'X)^{-1} \Bigg)

Thus the approximate covariance matrix of an M-estimate differs only by a constant factor from the LS estimate. Hence, its efficiency for normal perturbations does not depend on X.

 

The regression M-estimate of scale

In equation (ref{eq:rho}) it is assumed that \sigma is known. However, when this condition is not fulfilled, it is necessary to use an auxiliary robust scale estimate (\hat \sigma) to make \hat \beta_M scale equivariant.

An M-estimator of scale \hat \sigma_M is defined as the solution to the following equation label{eq:rhoscale}

\frac{1}{n} \sum_{i=1}^n \rho_1 \Bigg( \frac{r_i}{\hat \sigma_M} \Bigg)=K,

where

r_i = r_i(\hat \beta_M), \hat \sigma_M = \hat \sigma(r_1(\hat \beta_M), \dots, r_n(\hat \beta_M))

and K is a constant which is linked to the breakdown point of the scale estimator.

It is worthwhile to notice that in this equation  we have used the symbol \rho_1, because the \rho function which is used to obtain the scale estimator is not necessarily the same which is used in (label{eq:rho}).

It is interesting to notice that if in equation (ref{eq:rhoscale}) we choose \rho_1=I(|t|>1) and K=1/2 we obtain as a solution for the M estimator of scale the median of the absolute values of the residuals. In general if \rho_1=I(|t|>1), we obtain as a solution the h-th order statistic of the |r|_i with h=n-[nK].

 

The regression S-estimates and the LMS estimate

If we take the minimum value of \hat \sigma_M which satisfies equation (ref{eq:rhoscale}), we obtain the so called S-estimate of scale (\hat \sigma_S) and the associated estimate of the vector of regression coefficients (\hat \beta_S).

The word S estimator comes from the fact that it is derived from a scale statistic in an implicit way. More formally

\hat \beta_S = \min \limits_{\beta} \; \hat \sigma_M(r(\beta))

where

r_i = r_i(\hat \beta_S), \hat \sigma_S = \hat \sigma(r_1(\hat \beta_S), \dots, r_n(\hat \beta_S))

Because an S estimate is an M-estimate, it follows that the asymptotic distribution of an S estimate is the same as that of an M estimate. To see this note that \hat \beta_S satisfies the following inequality:

\sum_{i=1}^n \rho \Bigg( \frac{r_i(\hat \beta_S)}{\hat \sigma_S}\Bigg) \leq \sum_{i=1}^n \rho \Bigg( \frac{r_i(\tilde \beta)}{\hat \sigma_S}\Bigg)

for all \tilde \beta \in R^p.

The regression S estimate of location (\hat \beta_S) and scale (\hat \sigma_S) can be found using routine Sreg. For each elemental subset which is extracted (which satisfies certain conditions) we compute the minimum value of the scale calling routine Mscale. For the subsets with the 5 best (bestr) minimum values of the scale a series of refining steps (C-steps) are done.

 

Remark: if we choose as rho function \rho(t)=I(|t|>1) and in equation (label{eq:rhoscale}) we put K=n/2 we minimize the median of the squares (absolute values) of the residuals and we therefore obtain the Least Median of Squares estimator (LMS). 

In symbols, call

|r|_{(1)} \leq |r|_{(2)} \leq \dots \leq |r|_{(n)}

the ordered absolute values of the residuals,  \hat \beta_{LMS} is defined as

\hat \beta_{LMS}= \min \limits_{\beta} \; med |r(\beta)|_i

\hat \beta_{LMS} can be computed using routine LXS (option lms=1)

More generally, for a general K a solution \hat \sigma to equation (4) is the h order statistics of |r|_i (that is |r|_{(h)}), with h=n-[nK].

 

The regression MM-estimate of location

The MM-regression estimator \hat \beta_{MM} is defined as any local minimum of the following f function

f(\hat \beta_{MM})=\frac{1}{n} \sum_{i=1}^n \rho_2 \Big(\frac{r_i}{\hat \sigma} \Big)

where \rho_2 is possibly another \rho function. Function f is minimized with respect to \beta for fixed (\hat \sigma) Among the possible local minima which have been found, we choose the one for which this equation is smallest. In this equation \hat \sigma is any scale estimator satisfying equation (label{eq:rhoscale}). It is common however to use \hat \sigma_S (the minimum value).

The consistency and asymptotic distribution of MM-estimates when the observed data follow the central model (1) has been studied by Yohai (1987) for the case of random covariates, and by Salibian-Barrera (2006) for fixed designs. Consistency and asymptotic distribution of S-estimators has been studied by Rousseeuw and Yohai (1984), Davies (1990) and Salibian-Barrera (2006).

\hat \beta_{MM} can be computed using routines MMreg and MMregcore. More precisely, MMregcore assumes that user supplies an estimate of \sigma, on the other hand, function MMreg uses function Sreg to preliminary compute an estimate of \sigma and \beta.

 

The regression L-estimate of scale and the LTS estimate

An alternative to using an M-scale is to use an L-estimate of scale. The we can define scale estimates as linear combinations of the |r|_{(i)} in one of the two following forms:

\hat \sigma_L = \sum_{i=1}^n a_i |r|_{(i)} \quad \textrm{or} \quad \hat \sigma_L = \Bigg(\sum_{i=1}^n a_i |r|_{(i)}^2 \Bigg)^{1/2}

A particular version of the second form is the \alpha-trimmed squares scale where \alpha is in interval (0,1) and n-h=[n \alpha] of the largest absolute residuals are trimmed. If \alpha=1/2 the corresponding regression estimate is called the least trimmed squares (LTS)  \hat \beta_{LTS}. More formally

\hat \beta_{LTS} = \min \limits_{\beta} \sum_{i=1}^{[(n+p+1)/2]} |r(\beta)|_{(i)}  

Notice that asymptotically [(n+p+1)/2] tends to 1/2.

\hat \beta_{LTS} can be computed using routines LXS (option lms=0) , while the trimming proportion is controlled by option alpha

 


The developers of the toolbox The forward search group Terms of Use Acknowledgments