Multivariate Clustering Data Sets

File Description of Data Set
cluster3

The cluster3 dataset has been simulated by Gordaliza, García-Escudero & Mayo-Iscar during the Workshop ADVANCES IN ROBUST DATA ANALYSIS AND CLUSTERING held in Ispra on October 21st-25th 2013. It is a bivariate dataset of 1000 observations. It presents three components with radial outliers centered in the origin.

geyser2 The geyser2 dataset (R "tclust" library and Fritz, García-Escudero & Mayo-Iscar (2012)) is a bivariate dataset of 271 observations, obtained from the first column of the Old Faithful Geyser (R MASS library and Härdle (1991)). It contains the eruption length and the length of the previous eruption for 271 eruptions of this geyser in minutes.
M5data The dataset is proposed by Garcia-Escudero et al. (2008) for assessing some trimming-based robust clustering methods. The data are obtained from three normal bi variate distributions with fixed centers but different scales and proportions. One of the components is very overlapped with another one. A 10% background noise is added uniformly distributed in a rectangle containing the three mixture components, but without overlapping much with them. The third column contains the component id.
mixture100 The mixture100 dataset has been simulated by Fritz, García-Escudero & Mayo-Iscar, 2012 (page 14 fig. 8). It could either be interpreted as a mixture of three components or a mixture of two components with a 10% outlier proportion.
structurednoise The structurednoise dataset has been simulated by Fritz, García-Escudero & Mayo-Iscar, 2012 (page 13 fig. 7 c-d). It is composed by two evident elliptical clusters plus a structured noise pattern with “helix” shape which accounts for 10% of the data.
X The X dataset has been simulated by Gordaliza, García-Escudero & Mayo-Iscar during the Workshop ADVANCES IN ROBUST DATA ANALYSIS AND CLUSTERING held in Ispra on October 21st-25th 2013. It is a bivariate dataset of 200 observations. It presents two parallel components without contamination.

Regression Clustering Data Sets

File Description of Data Set
pinus

The pinus dataset was introduced by García-Escudero et al. (2010) and further discussed by Dotto et al. (2016). It consists of the heights and diameters of a sample of 362 Pinus nigra trees, located in the north of Palencia (Spain).

girdles Real trade data. In particular: 153 imports of girdles and panty girdles from Israel to Austria.
sprockets Real trade data. In particular: 1681 imports of toothed wheels, chain sprockets and other transmission elements from Switzerland to Austria.
TDuniform Simulated trade data from uniform distribution.
TDtweedie Simulated trade data from tweedy distribution.

The FSDA team thanks Carlos Gabriel Matrán Bea ; Alfonso Gordaliza Ramos ; Luis Angel García Escudero ; Agustín Mayo Iscar (University of Valladolid) for sharing the data and for the continuos collaboration and lively discussions on robust clustering.