Introduction to robust clustering

It is now widely recognized that the presence of outliers can affect the results of any statistical analysis. This is also the case of cluster analysis methods. Traditional methods of classification which assume spherical populations (i.e. k means) or allow for different size and orientation of the groups (i.e. Mclust methods) are well known to produce unreliable results when outliers are present. Recently, special attention in the robust clustering literature has been devoted to classification methods based on trimming which try to discard most outlying observations when carrying out the clustering process. The idea of trimming, together with the need of considering groups of different sizes and orientation, has led to the suggestion of maximization of very complicated functions with many parameters and a very high computational complexity due to the ``combinatorial'' nature of the problem and constraints in order to avoid spurious solutions.

Several ``mixture modeling'' and ``crisp clustering'' approaches to model-based Clustering can be found in the literature. Mixture modeling approaches assume that data at hand $y_1, ..., y_n$ in $R^v$ come from a probability distribution with density $\sum_{j=1}^k \pi_j \phi( \cdot, \theta_j)$ with $\phi( \cdot, \theta_j)$ being the $v$-variate (generally multivariate normal) densities with parameters $\theta_j$, $j=1, \ldots, k$. Generally $\theta_j= (\mu_j, \Sigma_j)$ where $\mu_j$ is the population mean and $\Sigma_j$ is the covariance matrix for component $j$. This leads to likelihoods of the form \begin{equation} \prod_{i=1}^n \sum_{j=1}^k \pi_j \phi (y_i; \; \theta_j). \qquad label\{mixlik\} \end{equation} On the other hand, ``crisp'' (0-1) clustering approaches assume classification likelihoods of the following form \begin{equation} \prod_{j=1}^k \prod _{i\in R_j} \phi (y_i; \; \theta_j), \qquad label\{clalik\} \end{equation} where $R_j$ contains the indexes of the observations which are assigned to group $j$, with the constraint that $\# \bigcup_{j=1}^k R_j=n$. In order to discard a fraction of most outlying observations (say equal to $\alpha$) and to take into account the different sizes of the groups when making the final group assignments, Garc\'{\i}a-Escudero et al. (2008) in the context of crisp assignment suggested to maximize the following expression (TCLUST):

\begin{equation} \prod_{j=1}^k \prod _{i\in R_j} \pi'_j \phi (y_i; \; \theta_j) \qquad label\{tclustlik\} \end{equation}

 or in the context of mixture models

\begin{equation} \prod _{i\in R_j} \sum_{j=1}^k \pi_j \phi (y_i; \; \theta_j). \qquad llabel\{mixtlik1\} \end{equation}

with the constraint that $\# \bigcup_{j=1}^k R_j=[n(1-\alpha)]$ where symbol $[ \cdot ]$ denotes the integer part. Note that in equation~(\ref{tclustlik}) we have used symbol $\pi'_j$ to stress that these parameters have a completely different interpretation from the $\pi_j$ in equation (\ref{mixlik}) or (\ref{mixtlik1}). They are intended to take into account the different sizes of the groups when making the final group assignments and they are not the weights of the mixture likelihood. In optional parameter mixt the user can specify if he wants to maximize (\ref{tclustlik}) or (\ref{mixtlik1}). TCLUST method also considers scatter constraints in terms of the group covariance matrices. More specifically, if $\lambda_l(\hat \Sigma_j)$ ($l=1, \ldots, v$; $j=1, \ldots, k$) are the estimated eigenvalues of the group covariance matrix $\hat \Sigma_j$, TCLUST in each iteration of the maximization routine imposes the constraint:

\begin{equation} \frac{ \max_{l=1, \ldots, v} \max_{j=1, \ldots, k} \lambda_l( \Sigma_j)}{ \min_{l=1, \ldots, v} \min_{j=1, \ldots, k} \lambda_l( \Sigma_j)} \leq{restrfactor}.
\qquad label\{restreig\}
 \end{equation}

Note that classic $k$-means procedure is simply obtained putting $\alpha=0$ and $\pi'_j=1$ in equation~(\ref{tclustlik}) and $restrfactor=1$ in equation~(\ref{restreig}). The idea of trimming under the eigenvalue constraint ratio of equation (\ref{restreig}) can also be applied in the context of the mixture likelihood given in equation (\ref{mixlik}) with important consequences. In the crisp assignment in each iteration of the maximization process, the selection of the $[n(1-\alpha)]$ units is made taking the $[n(1-\alpha)]$ largest values of $\phi^*_i$, where $\phi^*_i = \max_{j=1, \ldots, k} \hat \pi'_j \phi (y_i; \; \hat \theta_j)$ where $\hat \pi_j$ is estimated using proportion of untrimmed observations which are assigned to each group. Estimates of centers and the covariance matrices use respectively the unweighted sample mean, sample covariance matrices. On the other hand, in the context of mixture modelling, the quantities \[ \phi^*_{ij} = \frac{\hat \pi_j \phi (y_i; \; \hat \theta_j)}{ \sum_{j=1}^{k} \hat \pi_j \phi (y_i; \; \hat \theta_j) } \qquad i=1, 2, \ldots, n, \qquad j=1, \ldots, k \] are interpreted as posterior probabilities. and the selection of the $[n(1-\alpha)]$ units, is made taking the $[n(1-\alpha)]$ largest values of $\phi^*_i$. The criterion for selecting the units to trim remains the same, however, centers and the covariance matrices are updated with the weighted sample mean and weighted sample covariance matrices with the weights given by the posterior probabilities. The posterior probabilities for the $\alpha$ trimmed units are set to 0 for each group. Similarly, the $\hat \pi_j$ are updated using $\sum_{i=1}^n \phi^*_{ij}/[n(1-\alpha)]$. A feasible algorithm aimed at approximately solving the objective function was given in Garc\'{\i}a-Escudero et al. (2008) and Fritz et al. (2013). These algorithms belong to the family of Classification EM algorithms and, to perform the data-driven trimming, make use of the so called ``concentration'' steps (as those behind the Fast-MCD algorithm in Rousseeuw and van Driessen 2006).

The number of maximum concentration steps to be performed is given by input parameter refsteps of routine tclust. For approximately obtaining the global optimum, the system is initialized nsamp times and concentration steps are performed until convergence or refsteps is reached. When processing more complex data sets higher values of nsamp and refsteps have to be specified (obviously implying extra computation time). However, if more then 10 per cent of the iterations do not converge, a warning message is issued, indicating that nsamp has to be increased.