Introduction to Robust Statistics

Awareness of outliers in some form or another has existed for at least 2000 years. Thucydides, in his third book about the Peloponnesian War (III 20, 3-4), describes how the Plataeans used in 428 BC concepts of robust statistics in order to estimate the height of the ladder which was needed to overcome the fortifications built by the Peloponnesians and the Boeotians who were besieging their city.

"The same winter the Plataeans, who were still being besieged by the Peloponnesians and Boeotians, distressed by the failure of their provisions, and seeing no hope of relief from Athens, nor any other means of safety, formed a scheme with the Athenians besieged with them for escaping, if possible, by forcing their way over the enemy's walls; the attempt having been suggested by Theaenetus, son of Tolmides, a soothsayer, and Eupompides, son of Daimachus, one of their generals. At first all were to join: afterwards, half hung back, thinking the risk great; about two hundred and twenty, however, voluntarily persevered in the attempt, which was carried out in the following way. Ladders were made to match the height of the enemy's wall, which they measured by the layers of bricks, the side turned towards them not being thoroughly whitewashed. These were counted by many persons at once; and though some might miss the right calculation, most would hit upon it, particularly as they counted over and over again, and were no great way from the wall, but could see it easily enough for their purpose. The length required for the ladders was thus obtained, being calculated from the breadth of the brick”.

This simple example, which has been brought to our attention by Spyros Arsenis (EC, JRC), shows that robustness of statistical methods in the sense of insensitivity to grossly wrong measurements is probably as old as the experimental approach to science.

Already a century and a half ago Edgeworth stated that "The method of Least Squares is seen to be our best course when we have thrown overboard a certain portion of our data – a sort of sacrifice which often has to be made by those who sail upon the stormy seas of Probability”.

Providing sound methods for the analysis of data gathered through experimentation or observation is the goal of robust statistics.

This is how Stephan Morgenthaler (2007) opens his magisterial survey of robust statistics. We are grateful to him for having accepted to use an excerpt of his survey as an introduction to the history and prospects of the statistical field addressed by the FSDA toolbox. It would have been hard to achieve the same incisiveness and insight that distinguish all his contributions. On the other hand, the help pages of a software library are not the most appropriate medium to enrich with a new item to the long list of good surveys that robust statistics already exhibits. The rest of this page quotes, in blue text, Stephan's views. For a complete picture, we warmly recommend reading the original paper, which includes the views of several authoritative discussants (Christophe Croux, Peter Filzmoser, Laurie Davies, Ursula Gather, Ricardo A. Maronna, Victor J. Yohai, Hannu Oja, Frank Critchley, Daniel Pena, Peter J. Rousseeuw, Stefan Van Aelst, Anthony C. Atkinson, Marco Riani and Andrea Cerioli).

The first generation of robustness dealt mainly with outlier nomination, formal outlier rejection tools, and graphical methods that could help in spotting outliers. This may seem an easy task, but is in reality not so. For complex data types encountered in multiple regressions or multivariate estimation, finding the outliers can be overwhelming.

An essential ingredient of all data is their inconclusive nature. Knowledge gained through data and experimentation is on some level uncertain and fluid. Dating back to Gauss and the least squares method, statisticians have argued in favor of particular statistical methods on the basis of some optimality property. The usual argument was "my method is more efficient than any other”. Conveniently forgotten in this type of argumentation was the fact that efficiency was defined with respect to quite strong assumptions about the data. Within the emerging theoretical thinking about statistics, some started to question the validity of deducing the method to be used from the assumed model. Around the same time, more formal robustness demands appeared. At first, this touched mostly on the validity of tests (see for example Pearson, 1931) and was studied with the help of concrete alternatives to the base model. How did the test that was calibrated for a particular model behave under alternative specifications? In particular, how robust was it to non-normality (Box, 1953). This question can equally well be applied to estimators. What happens to the efficiency of an estimator if the data distribution does not conform to the model under which the estimator is optimal? Tukey (1960) discussed a telling example in which very low frequency events could utterly destroy the average performance of so-called optimal estimators. Even in situations where data sampled from a perturbed distribution could hardly be distinguished from data sampled from the model distribution, the seemingly best method was severely hampered. In parallel with these ideas, distribution-free tests and estimators were being developed. These seem at first blush a good answer to the critique of the overreliance on a particular model. The analysis of the methods derived from distribution- free procedures has led to a better understanding of the mechanics of typical robust procedures. The median, which can be derived from the sign test, offers an exemplary case study. It is a nonlinear function of the data, focuses its attention only on the central part of the sample, and does not change its value drastically even when up to half of the observations are faulty. Those are properties a robust estimator should possess. However, its asymptotic efficiency at the normal model is a lowly 64%. But, then, stepping from a particular model to a completely unspecified distribution is maybe too radical a change.

Few books in statistics have been as original and thought-provoking as Peter J. Huber’s Robust Statistics (1981). It pulled together the various vague ideas, case studies and emerging theories published previously and proposed a coherent theory for the second generation robustness. In Huber’s view, robustness starts with a traditional parametric model and then proceeds by considering broadened assumptions under which the estimation and testing of the parameter still made sense. Instead of assuming a perfect normal model for the measurement errors, under which the mean is the most efficient summary, he asked what would happen if the data were roughly normal, but not exactly so.

More formally Huber introduces a flexible class of estimators called "M estimators" which became a very useful tool, and he derived properties like consistency and asymptotic normality. The estimators are just a slight generalization of maximum likelihood estimators.

Instead of solving $-\sum \log f(x_i-T)=\min $ or $-\sum f'(x_i-T)/f(x_i-T)=0$ for the location estimate $T$ of $\theta$, with the $x_i$ distributed independently with density $f(x_i-T)$, Huber solves $\sum \rho(x_i-T)=\min$ or $-\sum \psi(x_i-T)=\sum \rho'(x_i-T)=0$ without assuming that $\rho$ and $\psi$; are of the form $-\log f$ or $-f'/f$ for any probability density $f$; and even if they are, $f$ need not be the density of the $x_i$.

He thus broadened the circumstances under which the performance of the statistical summary would be judged. In such a context the old style "most efficient statistical method" no longer exists. Its place is taken by a compromise method, whose efficiency remains high over the whole set of broadened circumstances. One can make this mathematically precise by considering for each statistical method its worst performance and then searching for the method with the best worst-case performance. It turns out that the least squares method, which is so popular among users of statistics, is particularly vulnerable under the best worst-case criterion. This follows from the fact that least squares estimators are linear functions of the observations and thus any perturbation in any of the observations is linearly transmitted to the estimators. This in turn leads to a catastrophic worst-case performance.

Robust methods bridge the gap between narrowly optimal classical methods and the distribution-free methods.

Huber introduced the "gross error model": instead of believing in a strict parametric model of the form $G(x-\theta)$ for known $G$ (e.g., $G$ = the standard normal), he assumes that a (known) fraction $\epsilon(0 \leq \epsilon < 1) $ of the data may consist of gross errors with an arbitrary (unknown) distribution $H(x-\theta)$, (where $\theta$ is only introduced to preserve the form of a location model). The distribution underlying the observations is thus $F(x-\theta)=(1-\epsilon)G(x-\theta)+\epsilon H(x-\theta) $.

In Huber’s world, the user must specify the size of the neighborhood. The estimation of the mean μ of a normal distribution is an illuminating example of this theory. Broadening this model can be done in various ways, for example by considering mixtures of 1−ϵ parts of the normal with ϵ parts of an arbitrary contamination. If the contamination is symmetric about μ, the new parameter in the enlarged problem is the median of the distribution, and the inference problem remains well-posed. For ϵ = 0, Huber’s estimator is the arithmetic mean and for ϵ = 0.5 it is the median. For intermediate values it turns out to be similar to a trimmed mean with a trimming rate that is monotone increasing with ϵ. More precisely, the solution corresponds to a modified least squares criterion in which the quadratic is replaced by a function with linear growth in the tail and quadratic behavior in the middle. The corresponding least-favorable model retains an exact normal form in the middle, but the tails are modified to a log-linear rather than a log-quadratic form. In this simple location model, robust methods correspond quite closely to the procedures that were earlier developed for data containing outliers and thus, robustness essentially turns out to offer protection against unanticipated heavy tails. For restricted parameter spaces additional phenomena may occur. In estimating scale for example there is a danger of implosion, which means seriously underestimating the scale. Because the deviations from the model that matter most are heavy tails, one can construct robust models directly. The estimators derived in traditional ways from robust models already have robustness properties and no additional fiddling is necessary. Methods of this kind are described in (Morgenthaler and Tukey, 1991 and Lange et al., 1989).

Other important contributions to robustness that emerged out of the idea of considering neighborhoods of models are the use of functional analytic tools to derive robustness indicators such as the influence function and related optimality theories (Hampel, 1974, Hampel et al., 1986). Such tools are based on the change in the performance of statistical methods under slight modifications of the original model. The influence function in particular has had a big impact. Its study suggests that robustness can be achieved by modifying the traditional likelihood score function through truncation. The maximal bias under arbitrary contaminations of size ϵ is another basic quantity. It leads to the notion of breakdown (Hodges, 1967, Hampel, 1974, and Rousseeuw, 1984) and bias robustness (see for example Donoho and Liu, 1988 or Martin,Yohai and Zamar, 1989).

In the three decades since the laying of the foundations of robustness, the theory of has been further consolidated. All the strands we mentioned above, from outlier identification through nonparametric and semiparametric modeling, have been generalized and expanded. The robustness indicator used when constructing methods, be it a high breakdown point, a limited susceptibility to bias, a low gross error sensitivity or a minimax performance over a set of alternative models is crucial. Because none of the possibilities is intrinsically preferable and all choices are defensible, the availability of robust methods has been greatly enhanced. Some of the books and articles describing such enhancements are as follows: Maronna (1976), Kleiner et al. (1979), Rousseeuw and Leroy (1987), Tyler (1987), Morgenthaler and Tukey (1991), Kent and Tyler (1991), Maronna et al. (1992), Rousseeuw and Croux (1993), Davies (1993, 1995), van Aelst and Rousseeuw (2000).

Robust inference poses computational challenges. It is often necessary to solve a high-dimensional quasi-convex optimization problem or even explore a search space to minimize some quite arbitrary criterion function. The work horse of robust statistics has been local linearization and the iteratively reweighted least squares algorithm. It has the advantage of being quite easily added to existing statistical packages, but does not always converge to the global optimum. For high-dimensional problems stochastic optimizers (simulated annealing or even guided random selection of potential solutions) have been proposed and shown their usefulness.

The current state of robustness

Statistics and with it robustness are undergoing big changes of emphasis, mainly for two reasons, both driven by the increase in cheap computing power. More complex techniques that require sophisticated numerical simulation are available to today’s users. At the same time, the models being fitted to data have also become much more intricate, which raises the question of the meaning of robustness in this changed context?

The first generation methods relied on least squares applied to cleaned data. For data sets without apparent outlier, no cleaning was needed and the raw data were used. This seems quite reasonable. Whether or not trimming or downweighting need to be applied to an observation can be decided after looking at the data and need not be decided beforehand. Even though it is not explicitly built into the method, second generation robust techniques do in fact act conditionally on the data, at least to some extent. Huber’s location estimator will be exactly equal to the mean for well-behaved data. Similarly, whether remote points in the predictor space are present should be checked in the actual data. Why should we worry about possible contamination at such points if there are none present in the data? For protection against heavy tailed error distributions one can, for example, formulate an extended model containing various error distributions with various tail indices. The most efficient estimator in the extended model averages over the estimates corresponding to the various tail indices (Morgenthaler and Tukey, 1991). Such an estimator is explicitly conditional on the observed data configuration. Another possible solution to the challenge discussed above consists in searching for the simplest accurate description of a large portion of the data. This point of view is similar to hoping that after outlier removal, good fits to the remaining observations can be obtained. In a prediction problem, for example, a model that fits 80% of the observations reasonably well with a linear equation and a single predictor might well be considered preferable to one that manages to increase this proportion to 100%, but at the cost of including three additional predictors in a nonlinear fashion. The 20% of the observations not accounted for by the simple model are sometimes called outliers, but this is a misleading term in our context. Statistical methods that are robust in this sense provide alternative and often surprising fits. In principle, quite distinct robust answers may well be equally good and may all provide insights. High-breakdown regression estimates (Rousseeuw, 1984, Davies, 1990) and forward search procedures (Atkinson and Riani, 2000, Atkinson Riani and Cerioli, 2004) provide examples of such methodology. As a bonus, these methods offer effective outlier identification.

The future of robustness

The stability of the conclusions drawn from experimental and observational data remains a pressing problem. The hope that the study of robustness would give birth to an entirely new way of doing statistics, a new way that would replace the classical methods entirely, has not been fulfilled. There seem to be several reasons for this. For one, statistics as a whole has moved on and incorporated the teachings of robustness. Also, leaving standard models such as the Gaussian, the Poisson, or the Binomial behind and abandoning the fitting by likelihood, multiplies the available choices. This is reflected in the diversity of the robust methods a user can choose from. The fact that no new standards have been created has not helped the acceptance of robustness.

More subtly though, the robust school has influenced modern statistics heavily. All statisticians today are aware of the dangers of a data analysis that owes more to model assumptions than experimental facts. And most research papers contain at least an acknowledgment of the robustness aspects of the problem being studied. Because the models being used have become more realistic and accurate, the need for subsequent robustness work has diminished. Statistics has been and remains today most helpful in situations where the random noise is at a relatively high level. These are precisely the situations where reliance on a restricted probability model can do the most damage by leading to totally wrong conclusions. The detection of a tiny signal hidden among a heap of Gaussian noise, for example, might be feasible if things were exactly this way, but when exact Gaussianity is replaced by a broader set of circumstances, the problem might become unsolvable.

Robust statistics is well-suited to play a leading role in two areas: routine data analyses performed with the help of statistical packages and data mining. For a long time, a lot of statisticians who advocated the use of robust procedures believed that robust methods would replace the standard least squares techniques that are preprogrammed in statistical packages. Such programs are mostly used by people untrained in statistics and thus the lack of robustness would ultimately prove to be too costly. That this has not happened is presumably because these users believe that the superiority of least squares is a fact. The costs of using run of the mill and inadequate models is quite real however. One merely has to think of all the assertions of public health findings that cannot be confirmed by subsequent research and which have cost billions in research money. Other fields have other traditions. The data mining and statistical learning community is quite open to try alternative methods. There, the emphasis is not so much on confirming experimental findings as on finding patterns and rules that can be used in a predictive manner. Robust methods, with their ability to focus on parts of the data only, can produce surprising and useful results in this context. Robust principal components or robust classification methods, for example, are sure to have an impact in this growing field.

For a recent survey of robust classification methods see Garcia-Escudero et al. (2010). Other general references are those listed by Huber and Ronchetti (2009) in their excellent book:

A chronologically ordered list of robust regression estimators: Μ (Huber, 1973a); GM (Mallows, 1975), with variants by Hampel, Krasker, and Welsch; RM (Siegel, 1982); LMS and LTS (Rousseeuw, 1984); S (Rousseeuw and Yohai, 1984); MM (Yohai, 1987); τ (Yohai and Zamar, 1988); and SRC (Simpson, Ruppert, and Carroll, 1992). For an excellent critical review of the most important estimates developed in that period, see Davies (1993). In the last decade, some conceptual consolidation has occurred—see the presentation of robust regression estimates in Maronna, Martin, and Yohai (2006)—but the situation remains bewildering. Already the discussants of Bickel (1976) had complained about the multiplicity of robust procedures and about their conceptual and computational complexity. Since then, the collection of estimates to choose from has become so extensive that it is worse than bewildering, namely counterproductive.

On the stability of robust estimators

In his survey, Morgenthaler remarked that, in spite of a consolidated tradition full of positive results, robust methods have not yet replaced the predominant role of the standard least squares techniques in real applications. As an argument on the possible reasons behind this fact, we quote again Huber and Ronchetti (2009). In discussing approches to robust regressions and the S-estimation minimization problem in particular, they write (Section 7.12, p. 195-198):

... with non-convex minimization, like here, one usually runs into the problem of getting stuck in local minima. Moreover, Davies (1993, Section 1.6) points out an inherent instability of S-estimators; in my opinion, any such instability disqualifies an estimate from being called robust. In fact, there are no known high breakdown point estimators of regression that are demonstrably stable; I think that for the time being they should be classified just as that, namely, as high breakdown point approaches rather than as robust methods.

Note that the mentioned instability has to do with the use of resampling methods for solving the optimization problem (locally, unless all possible subsamples are extracted), which invalidates the desirable property of permutation invariance. In fact, simple swaps of the rows of the data matrix X will presumably give rise to different estimators in runs starting from different extracted samples. It must be acknowledged that this lack of reproducibility is, indeed, disturbing.

FSDA accounts for the instability of resampling-based estimators with an efficient sampling strategy which allows the user to increase considerably the number of subsets to extract, without incurring in critical degradation of computational performances. In other popular robust libraries (RRcov, Robustbase and LIBRA), the default number of subsets extracted by the fast algorithms of MCD, LTS and S estimation is typically 500 (both for regression and multivariate analysis estimators). In FSDA we raised this number to 1000 for all methods except the traditional LTS algorithm (without concentration steps), for which we extract the 3000 subsets suggested in PROGRESS by Rousseeuw and Hubert (1997) for combinations of n and p giving rise to datasets considered, at that time, large. We encorage FSDA users to experiment with number of subsets bigger than these defaults, to appreciate the improved stability and accuracy of the estimators at no major performance degeneration cost.