Introduction to robust estimators in linear regression

Standard Linear Regression Models are based on certain assumptions, such as a normal distribution of errors in the observed responses, omoscedasticity. If the distribution of errors is asymmetric or prone to contamination, model assumptions are invalidated, and parameter estimates, confidence intervals, and other computed statistics become unreliable. Outliers occur very frequently in real data, and they often go unnoticed because nowadays data are processed by computers, without careful inspection or screening. Not only the response variable can be outlying, but also the explanatory part, leading to so-called leverage points. Both types of outliers may totally spoil the analysis carried out by ordinary least squares. Often, such influential points remain hidden to the user, because they do not always show up in the usual least squares plots.

 

To remedy this problem, new statistical techniques have been developed that are not so easily affected by outliers. These are the so called robust methods. The Statistics Toolbox function robustfit implements a robust fitting method that is less sensitive than ordinary least squares to large changes in small parts of the data. Robust regression implemented in Statistics Toolbox works by assigning a weight to each data point. Weighting is done automatically and iteratively using a process called iteratively reweighted least squares. In the first iteration, each point is assigned equal weight and model coefficients are estimated using ordinary least squares. At subsequent iterations, weights are recomputed so that points farther from model predictions in the previous iteration are given lower weight. Model coefficients are then recomputed using weighted least squares. The process continues until the values of the coefficient estimates converge within a specified tolerance.  

 

 Two popular regression estimators which asymptotically have a breakdown point equal to 0.5 are the Least Median of Squares (LMS) and the Least trimmed squares (LTS) (Rousseeuuw and Leroy, 1987). The LTS regression method minimizes the sum of the $h$ smallest squared residuals, where $h$ must be at least half the number of observations. On the other hand the Least Median of Squares estimator minimizes the median of the squares of the residuals. The LTS and LMS are very robust methods in the sense that the estimated regression fit is not unduly influenced by outliers in the data, even if there are several outliers. Due to this robustness, we can detect outliers by their large LTS (LMS) residuals.

 

LTS and LMS use a weight function which is discontinuous. A viable alternative is to use a continuous weight function. This leads to the S and MM estimators. 

Because of the way in which models are fitted, either by least squares or by LTS, LMS, S, or MM, we can lose information about the effect of individual observations on inferences about the form and parameters of the model. The methods developed in this toolbox reveal how the fitted regression model depends on individual observations and on groups of observations. Robust procedures can sometimes reveal this structure, but downweight or discard some observations. The novelty in our approach to data analysis is to combine robustness and a “forward” search through the data with regression diagnostics and data visualization tools. We provide easily understood plots that use information from the whole sample to display the effect of each observation on a wide variety of aspects of the fitted model.

This chapter considers three procedures.