Processing math: 0%

Introduction to robust estimators in linear regression

Standard Linear Regression Models are based on certain assumptions, such as a normal distribution of errors in the observed responses, omoscedasticity. If the distribution of errors is asymmetric or prone to contamination, model assumptions are invalidated, and parameter estimates, confidence intervals, and other computed statistics become unreliable. Outliers occur very frequently in real data, and they often go unnoticed because nowadays data are processed by computers, without careful inspection or screening. Not only the response variable can be outlying, but also the explanatory part, leading to so-called leverage points. Both types of outliers may totally spoil the analysis carried out by ordinary least squares. Often, such influential points remain hidden to the user, because they do not always show up in the usual least squares plots.

 

To remedy this problem, new statistical techniques have been developed that are not so easily affected by outliers. These are the so called robust methods. The Statistics Toolbox function robustfit implements a robust fitting method that is less sensitive than ordinary least squares to large changes in small parts of the data. Robust regression implemented in Statistics Toolbox works by assigning a weight to each data point. Weighting is done automatically and iteratively using a process called iteratively reweighted least squares. In the first iteration, each point is assigned equal weight and model coefficients are estimated using ordinary least squares. At subsequent iterations, weights are recomputed so that points farther from model predictions in the previous iteration are given lower weight. Model coefficients are then recomputed using weighted least squares. The process continues until the values of the coefficient estimates converge within a specified tolerance.  

 

 Two popular regression estimators which asymptotically have a breakdown point equal to 0.5 are the Least Median of Squares (LMS) and the Least trimmed squares (LTS) (Rousseeuuw and Leroy, 1987). The LTS regression method minimizes the sum of the smallest squared residuals, where h must be at least half the number of observations. On the other hand the Least Median of Squares estimator minimizes the median of the squares of the residuals. The LTS and LMS are very robust methods in the sense that the estimated regression fit is not unduly influenced by outliers in the data, even if there are several outliers. Due to this robustness, we can detect outliers by their large LTS (LMS) residuals.

 

LTS and LMS use a weight function which is discontinuous. A viable alternative is to use a continuous weight function. This leads to the S and MM estimators. 

Because of the way in which models are fitted, either by least squares or by LTS, LMS, S, or MM, we can lose information about the effect of individual observations on inferences about the form and parameters of the model. The methods developed in this toolbox reveal how the fitted regression model depends on individual observations and on groups of observations. Robust procedures can sometimes reveal this structure, but downweight or discard some observations. The novelty in our approach to data analysis is to combine robustness and a “forward” search through the data with regression diagnostics and data visualization tools. We provide easily understood plots that use information from the whole sample to display the effect of each observation on a wide variety of aspects of the fitted model.

This chapter considers three procedures.

  • LXS implements Least trimmed squares and Least Median of Squares estimators. Using option rew it is also possible to consider their raw version or their reweighted version. In these estimators the percentage of trimming is fixed a priori and it is not possible to appraise the effect that each statistical unit exerts on the fitted model.

  • Sreg and Smult implement S and MM estimators in linear regression. In the S procedure, similarly to what happens for LXS, the breakdown point is fixed a priori. Once a robust estimate of the scale is found, one can obtain a nominal efficiency by using the S estimate as the starting point for an iterative procedure leading to MM estimators. 

  • FSR and FSReda provide two routines which implement the forward search in linear regression.
    FSR implements an automatic outlier detection procedure which has a simultaneous alpha close to the nominal and a great power.
    FSReda has exploratory purposes. It enables to store a series of quantities along the forward search (residuals, leverage, minimum deletion residual outside subset, maximum studentized residual, units belonging to subset in each step and other tests). Through the joint analysis of the plots which monitor the progression of the statistics along the forward search it is immediately clear to detect the observations that differ from the bulk of the data. These may be individual observations that do not belong to the general model, that is outliers. Or there may be a subset of the data that is systematically different from the majority. The monitoring of the progression of the statistics along the search not only enables the identification of such observations, but also let us appraise the effect that they have on parameter estimates and on inferences about models and their suitability.


The developers of the toolbox The forward search group Terms of Use Acknowledgments