Technical introduction to Robust methods in regression

Notation

Consider the usual regression model

 

\[ y_i = x_i^T\beta +u_i \qquad i=1, \ldots, n, \]

where $u_i$ are random errors which have common variance equal to $\sigma^2$ and are independent from the covariates $x_i$.  $\beta,$ the location component of the linear model, is the parameter of interest. $\sigma$ is the so called scale component of the linear model.

Given an estimator of $\beta$, say $\hat \beta,$ the residuals are defined as

$$r_i(\hat \beta)=y_i- x_i^T \hat \beta$$

The traditional least squares estimate of $\beta$ is denoted with

$$\hat \beta = \hat \beta_{LS}=\hat \beta_{LS}(X,y)=(X'X)^{-1}X'y$$

We are concerned with the case where a certain proportion of the observations may not follow model above. Traditional robust estimators attempt to limit the influence of outliers by replacing in the estimation of $\beta$ the square of the residuals with a less rapidly increasing loss function or by a function ρ of the residuals themselves which is bounded.

 

The $\rho$ function

In the literature of robust statistics the $\rho$ function denotes a function that

  1. $\rho(x)$ is a non decreasing function of $|x|$
  2. $\rho(0)=0$
  3. $\rho(x)$ is increasing for $x>0$ such that $\rho(x) < \rho(\infty)$

 

Perhaps the most popular choice for the $\rho$ function in is Tukey’s biweight (bisquare) function 

\[ \rho(x) = \begin{cases} \frac{x^2}{2}-\frac{x^4}{2c^2}+\frac{x^6}{6c^4} & \mbox{if} \quad |x| \leq c \\ \\ \frac{c^2}{6} & \mbox{if} \quad |x| > c, \\ \end{cases} \] where $c>0$ is a tuning constant which is linked to the breakdown point of the estimator of $\beta$.

 

 

Function $\rho$ for Tukey's biweight is implemented in routine TBrho.

 

 

The $\psi$ function

In the literature of robust statistics a $\psi$ (psi) function denotes a function $\psi$ which is the derivative of a $\rho$ function $\psi = \rho'$ such that

  1. $\psi$ is odd and $\psi(x) \geq 0$ for $x \geq 0$

Function $\psi$ for Tukey's biweight

\[ \psi(x) = \begin{cases} x- \frac{2x^3}{c^2}+\frac{x^5}{c^4} = x \Big[ 1- \Big( \frac{x}{c}\Big)^2 \Big]^2 & \mbox{if} \quad |x| \leq c \\ \\ 0 & \mbox{if} \quad |x| > c, \\ \end{cases} \]

is implemented in routine TBpsi.

 

Regression, scale and affine equivariance

We say that an estimate $\tilde \beta$ of $\beta$ is regression, scale and affine equivariant, if it satisfies the following three properties: \begin{array} $\tilde \beta (X, y + X \gamma) & = & \tilde \beta(X,y)+ \gamma & \quad \textrm{for all} \quad & \gamma \in R^p \\ \tilde \beta (X, \lambda y) & = & \lambda \tilde \beta(X,y) & \quad \textrm{for all} \quad & \lambda \in R \\ \tilde \beta(X A, \lambda y) & = & A^{-1} \tilde\beta(X,\lambda y) & \quad \textrm{for all} \quad A \quad \textrm{with} \quad & \det|A| \neq 0 \end{array}

These are desirable properties because they allow us to know how the estimate changes under linear transformations of the data.

 

The regression M-estimate of location

The regression M-estimate of location (say $\hat \beta_M$) is the value that minimizes the following objective function (label{eq:rho})

$$\sum_{i=1}^{n} \rho \Bigg( \frac{ r_i(\hat\beta_M)}{\sigma} \Bigg) = \textrm{min}.$$

 

 Differentiation of eq. (ref{eq:rho}) gives (label{eq:psi})

$$\sum_{i=1}^{n} \psi \Bigg( \frac{ r_i(\hat\beta_M)}{\sigma} \Bigg) x_i = 0$$

 

Remark 1: If function $\psi$ is monotone the solution to this equation are called monotone regression M estimates, while if $\psi$ is redescending the solutions to ref{eq:psi}) are called redescending regression M-estimates.

Remark 2: it is possible to show that the regression M estimates have the following asymptotic distribution:

\[\hat \beta_M \approx N \Bigg( \beta, \sigma^2 \frac{E \psi (u/\sigma)^2}{[E \psi' (u/\sigma)]^2} (X'X)^{-1} \Bigg)\]

Thus the approximate covariance matrix of an M-estimate differs only by a constant factor from the LS estimate. Hence, its efficiency for normal perturbations does not depend on $X$.

 

The regression M-estimate of scale

In equation (ref{eq:rho}) it is assumed that $\sigma$ is known. However, when this condition is not fulfilled, it is necessary to use an auxiliary robust scale estimate $(\hat \sigma)$ to make $\hat \beta_M$ scale equivariant.

An M-estimator of scale $\hat \sigma_M$ is defined as the solution to the following equation label{eq:rhoscale}

$$\frac{1}{n} \sum_{i=1}^n \rho_1 \Bigg( \frac{r_i}{\hat \sigma_M} \Bigg)=K,$$

where

$$r_i = r_i(\hat \beta_M), \hat \sigma_M = \hat \sigma(r_1(\hat \beta_M), \dots, r_n(\hat \beta_M))$$

and $K$ is a constant which is linked to the breakdown point of the scale estimator.

It is worthwhile to notice that in this equation  we have used the symbol $\rho_1$, because the $\rho$ function which is used to obtain the scale estimator is not necessarily the same which is used in (label{eq:rho}).

It is interesting to notice that if in equation (ref{eq:rhoscale}) we choose $\rho_1=I(|t|>1)$ and $K=1/2$ we obtain as a solution for the M estimator of scale the median of the absolute values of the residuals. In general if $\rho_1=I(|t|>1)$, we obtain as a solution the $h$-th order statistic of the $|r|_i$ with $h=n-[nK]$.

 

The regression S-estimates and the LMS estimate

If we take the minimum value of $\hat \sigma_M$ which satisfies equation (ref{eq:rhoscale}), we obtain the so called S-estimate of scale $(\hat \sigma_S)$ and the associated estimate of the vector of regression coefficients $(\hat \beta_S)$.

The word S estimator comes from the fact that it is derived from a scale statistic in an implicit way. More formally

$$\hat \beta_S = \min \limits_{\beta} \; \hat \sigma_M(r(\beta))$$

where

$$r_i = r_i(\hat \beta_S), \hat \sigma_S = \hat \sigma(r_1(\hat \beta_S), \dots, r_n(\hat \beta_S))$$

Because an S estimate is an M-estimate, it follows that the asymptotic distribution of an S estimate is the same as that of an M estimate. To see this note that $\hat \beta_S$ satisfies the following inequality:

$$\sum_{i=1}^n \rho \Bigg( \frac{r_i(\hat \beta_S)}{\hat \sigma_S}\Bigg) \leq \sum_{i=1}^n \rho \Bigg( \frac{r_i(\tilde \beta)}{\hat \sigma_S}\Bigg)$$

for all $\tilde \beta \in R^p$.

The regression S estimate of location $(\hat \beta_S)$ and scale $(\hat \sigma_S)$ can be found using routine Sreg. For each elemental subset which is extracted (which satisfies certain conditions) we compute the minimum value of the scale calling routine Mscale. For the subsets with the 5 best (bestr) minimum values of the scale a series of refining steps (C-steps) are done.

 

Remark: if we choose as rho function $\rho(t)=I(|t|>1)$ and in equation (label{eq:rhoscale}) we put $K=n/2$ we minimize the median of the squares (absolute values) of the residuals and we therefore obtain the Least Median of Squares estimator (LMS). 

In symbols, call

$$|r|_{(1)} \leq |r|_{(2)} \leq \dots \leq |r|_{(n)}$$

the ordered absolute values of the residuals,  $\hat \beta_{LMS}$ is defined as

\[ \hat \beta_{LMS}= \min \limits_{\beta} \; med |r(\beta)|_i \]

$\hat \beta_{LMS}$ can be computed using routine LXS (option lms=1)

More generally, for a general $K$ a solution $\hat \sigma$ to equation (4) is the $h$ order statistics of $|r|_i$ (that is $|r|_{(h)}$), with $h=n-[nK]$.

 

The regression MM-estimate of location

The MM-regression estimator $\hat \beta_{MM}$ is defined as any local minimum of the following $f$ function

$$f(\hat \beta_{MM})=\frac{1}{n} \sum_{i=1}^n \rho_2 \Big(\frac{r_i}{\hat \sigma} \Big)$$

where $\rho_2$ is possibly another $\rho$ function. Function $f$ is minimized with respect to $\beta$ for fixed $(\hat \sigma)$ Among the possible local minima which have been found, we choose the one for which this equation is smallest. In this equation $\hat \sigma$ is any scale estimator satisfying equation (label{eq:rhoscale}). It is common however to use $\hat \sigma_S$ (the minimum value).

The consistency and asymptotic distribution of MM-estimates when the observed data follow the central model (1) has been studied by Yohai (1987) for the case of random covariates, and by Salibian-Barrera (2006) for fixed designs. Consistency and asymptotic distribution of S-estimators has been studied by Rousseeuw and Yohai (1984), Davies (1990) and Salibian-Barrera (2006).

$\hat \beta_{MM}$ can be computed using routines MMreg and MMregcore. More precisely, MMregcore assumes that user supplies an estimate of $\sigma$, on the other hand, function MMreg uses function Sreg to preliminary compute an estimate of $\sigma$ and $\beta$.

 

The regression L-estimate of scale and the LTS estimate

An alternative to using an M-scale is to use an L-estimate of scale. The we can define scale estimates as linear combinations of the $|r|_{(i)}$ in one of the two following forms:

$$\hat \sigma_L = \sum_{i=1}^n a_i |r|_{(i)} \quad \textrm{or} \quad \hat \sigma_L = \Bigg(\sum_{i=1}^n a_i |r|_{(i)}^2 \Bigg)^{1/2}$$

A particular version of the second form is the $\alpha$-trimmed squares scale where $\alpha$ is in interval (0,1) and $n-h=[n \alpha]$ of the largest absolute residuals are trimmed. If $\alpha=1/2$ the corresponding regression estimate is called the least trimmed squares (LTS)  $\hat \beta_{LTS}$. More formally

$$\hat \beta_{LTS} = \min \limits_{\beta} \sum_{i=1}^{[(n+p+1)/2]} |r(\beta)|_{(i)}$$  

Notice that asymptotically $[(n+p+1)/2]$ tends to 1/2.

$\hat \beta_{LTS}$ can be computed using routines LXS (option lms=0) , while the trimming proportion is controlled by option alpha