To remedy this problem, new statistical techniques have been developed that are not so easily affected by outliers. These are the so called robust methods. The Statistics Toolbox function robustfit implements a robust fitting method that is less sensitive than ordinary least squares to large changes in small parts of the data. Robust regression implemented in Statistics Toolbox works by assigning a weight to each data point. Weighting is done automatically and iteratively using a process called iteratively reweighted least squares. In the first iteration, each point is assigned equal weight and model coefficients are estimated using ordinary least squares. At subsequent iterations, weights are recomputed so that points farther from model predictions in the previous iteration are given lower weight. Model coefficients are then recomputed using weighted least squares. The process continues until the values of the coefficient estimates converge within a specified tolerance.
Two popular regression estimators which asymptotically have a breakdown point equal to 0.5 are the Least Median of Squares (LMS) and the Least trimmed squares (LTS) (Rousseeuuw and Leroy, 1987). The LTS regression method minimizes the sum of the $h$ smallest squared residuals, where $h$ must be at least half the number of observations. On the other hand the Least Median of Squares estimator minimizes the median of the squares of the residuals. The LTS and LMS are very robust methods in the sense that the estimated regression fit is not unduly influenced by outliers in the data, even if there are several outliers. Due to this robustness, we can detect outliers by their large LTS (LMS) residuals.
LTS and LMS use a weight function which is discontinuous. A viable alternative is to use a continuous weight function. This leads to the S and MM estimators.
Because of the way in which models are fitted, either by least squares or by LTS, LMS, S, or MM, we can lose information about the effect of individual observations on inferences about the form and parameters of the model. The methods developed in this toolbox reveal how the fitted regression model depends on individual observations and on groups of observations. Robust procedures can sometimes reveal this structure, but downweight or discard some observations. The novelty in our approach to data analysis is to combine robustness and a “forward” search through the data with regression diagnostics and data visualization tools. We provide easily understood plots that use information from the whole sample to display the effect of each observation on a wide variety of aspects of the fitted model.This chapter considers three procedures.
LXS implements Least trimmed squares and Least Median of Squares estimators. Using option rew it is also possible to consider their raw version or their reweighted version. In these estimators the percentage of trimming is fixed a priori and it is not possible to appraise the effect that each statistical unit exerts on the fitted model.
Sreg and Smult implement S and MM estimators in linear regression. In the S procedure, similarly to what happens for LXS, the breakdown point is fixed a priori. Once a robust estimate of the scale is found, one can obtain a nominal efficiency by using the S estimate as the starting point for an iterative procedure leading to MM estimators.
FSR and
FSReda provide two routines which
implement the forward search in linear regression.
FSR implements an automatic outlier
detection procedure which has a simultaneous alpha close to the nominal and
a great power.
FSReda has exploratory purposes.
It enables to store a series of quantities along the forward search (residuals,
leverage, minimum deletion residual outside subset, maximum studentized residual,
units belonging to subset in each step and other tests). Through the joint analysis
of the plots which monitor the progression of the statistics along the forward
search it is immediately clear to detect the observations that differ from the
bulk of the data. These may be individual observations that do not belong to
the general model, that is outliers. Or there may be a subset of the data that
is systematically different from the majority. The monitoring of the progression
of the statistics along the search not only enables the identification of such
observations, but also let us appraise the effect that they have on parameter
estimates and on inferences about models and their suitability.