Introduction to robust estimators in multivariate analysis

The normal distribution, perhaps following data transformation, has a central place in the analysis of multivariate data. Mahalanobis distances provide the standard test for outliers in such data. However, it is well known that the estimates of the mean and covariance matrix found by using all the data are extremely sensitive to the presence of outliers. When there are many outliers the parameter estimates may be so distorted that the outliers are ‘masked’ and the Mahalanobis distances fail to reveal any outliers, or indicate as outlying observations that are not in fact so.

Accordingly, several researchers have suggested the use of robust parameter estimates in the calculation of the distances. For example, Rousseeuw and van Zomeren (1990) used minimum volume ellipsoid estimators of both parameters in calculation of the Mahalanobis distances. More recent work such as Pison et al. (2002) or Hardin and Rocke (2005) uses the minimum covariance determinant (MCD) estimator.

Remark: There is now evidence that the application of highly robust procedures can produce a proportion of outliers much larger than expected. The implication is that the size of the outlier test may be very much larger than the nominal 5% or 1%. Many of these methods are designed to test whether individual observations are outlying. As do Becker and Gather (1999), we, however, stress the importance of multiple outlier testing and focus on simultaneous tests of outlyingness. Note that in this toolbox we develop methods that are intended, when the samples are multivariate normal, to find outliers in $\alpha \%$ of the data sets.

This chapter considers the following procedures.

An alternative approach to robust multivariate estimation is based on projecting the sample points onto a set of univariate directions. Peña and Prieto (2001) suggested considering the set of 2v directions that are obtained by maximizing and minimizing the kurtosis coefficient of the projected data. They also proposed an iterative algorithm where each observation is repeatedly tested for outlyingness in these directions, using subsamples of decreasing size with potential outliers removed. Their final robust estimates, $\hat \mu_{PP}$ and $\hat \Sigma_{PP}$, are computed by using the observations which are considered not to be outliers at the end of the iterations. A calibration factor $k_{PP}$ is still required to allow for bias in estimation of $\Sigma$. The resulting (squared) robust Mahalanobis distances

$$ d_{(PP)i}=k_{PP}(y_i-\hat \mu_{PP})^T \hat \Sigma_{PP}^{-1}(y_i-\hat \mu_{PP}), \quad \quad \quad i=1,\dots,n, $$

are compared with the $\{v(n-1)/(n-v)\}F_{v,n-v} $ distribution.

A comparison of the above procedures with the forward search in terms of size and power can be found in  Riani et al. (2009). In this paper the authors show that procedure implemented in function FSM has superior power as well as good size and so is to be recommended. Similar computations referred to regression can be found in Perrotta and Torti (2010).