Multivariate Clustering Data Sets

File Description of Data Set
BigTradeEvents   The data consist of counts of particular events concerning unusual weights or values derived from customs declarations that EU authorities collect for all categories of traded goods. The 76 rows of the dataset represent wide good categories, identified by a two-digit code (the so-called ``Chapters'' of the Harmonised System (HS) Nomenclature developed by the World Customs Organization), while the 25 columns correspond to countries, indicated by their two-letter code (following ISO 3166-1 alpha-2). The dataset refers to the 2010--2022 period and contains 28,523 counts.

The cluster3 dataset has been simulated by Gordaliza, García-Escudero & Mayo-Iscar during the Workshop ADVANCES IN ROBUST DATA ANALYSIS AND CLUSTERING held in Ispra on October 21st-25th 2013. It is a bivariate dataset of 1000 observations. It presents three components with radial outliers centered in the origin.

Rows 1-300: group 1, rows 301-600: group 2, rows 601:950 group 3, rows 951:1000 outliers.


The cvode data set used in García-Escudero L.A. Mayo-Iscar A. and Riani M. (2022) refers to
 SARS-CoV-2 symptoms on 156 patients from ASL3 Genoa Hospital in Italy.

x1 = heart rate, the number of beats of the heart per minute;

x2 = Oxygen Uptake Efficiency Slope, index of functional reserve derived from the logarithmic relation between oxygen uptake and minute ventilation during incremental exercise;

 x3 =watts reached by the patient during the stress test on a cycle ergometer (stationary bike) at the aerobic threshold, that is, when the patient ’begins to struggle’;

x4 = watts peak (watts reached at maximum effort, during exercise test on exercise bike;

x5 = value of the maximum repetition, maximum force of muscle contraction of the quadriceps femoris of the dominant limb expressed in kg;

x6 =previous variable corrected on the subject (in relation to the patient’s weight).

id= doctor classification 1=covid patient, 0= non covid patient.

Data have been collected by “Post-COVID Outpatient Rehabilitation CenterASL3 Liguria Region Health System” and approved by the Ethics Committee of Liguria region (Italy).



The data are 424 readings on four properties of cows suffering from phlegmon, a form of foot rot. The four variables y1, y2, y3 and y4 are numerical properties calculated from photographic measurements of the cows. The data come from measurements at seven different farms (variable id).

These data have been analyzed in  Riani M., Atkinson A.C., Cerioli A., Corbellini A. (2019), Efficient robust methods via monitoring for clustering and multivariate data analysis, Pattern Recognition, Volume 88, Pages 246-260,



There are 200 rows and 8 columns. The variables describe 5 morphological measurements of the species Leptograpsus variegatus collected at Fremantle, W. Australia. First column is species (0=B, 1=O) and second column is sex (1=MALE, 2=FEMALE).

FL frontal lobe size (mm).
RW rear width (mm).
CL carapace length (mm).
CW carapace width (mm).
BD body depth (mm).

The source is Campbell, N.A. and Mahon, R.J. (1974) A multivariate study of variation in two species of rock crab of genus Leptograpsus. Australian Journal of Zoology 22, 417–425.
facemasks There are 352 imports of FFP2 and FFP3 masks (product 6307909810) into the European Union extracted in a day of November 2020. Vertical axes: traded value, horizontal axes: traded weight (W) and number of units (SU).
W= weight declared
SU = supplementary units declared
V = value declared
 In this example it is not clear at all how many groups are present, and/or if there are outliers.
The source is Torti F., Riani m., Morelli G. (2021). Semiautomatic robust regression clustering of international trade data, Statistical Methods and Applications, (open access) 30, 863–894

Flea beatle measurement. This data is from a paper by A. A. Lubischew, "On the Use of Discriminant Functions in Taxonomy", Biometrics, Dec 1962, pp.455-477.

  • tars1, width of the first joint of the first tarsus in microns (the sum of measurements for both tarsi)

  • tars2, the same for the second joint

  • head, the maximal width of the head between the external edges of the eyes in 0.01 mm

  • ade1, the maximal width of the aedeagus in the fore-part in microns

  • ade2, the front angle of the aedeagus ( 1 unit = 7.5 degrees)

  • ade3, the aedeagus width from the side in microns

  • species, which species is being examined - concinna, heptapotamica, heikertingeri

geyser The geyser dataset is taken from the MASS library (Venables and Ripley, 2002). It contains information on 272 successive eruptions of the ‘old faithful’ geyser in Yellowstone National Park, Wyoming. The variables are:

- y1: the duration of the ith eruption
- y2: the waiting time to the start of that eruption from the start of eruption i − 1.
geyser2 The geyser2 dataset (R "tclust" library and Fritz, García-Escudero & Mayo-Iscar (2012)) is a bivariate dataset of 271 observations, obtained from the first column of the Old Faithful Geyser (R MASS library and Härdle (1991)). It contains the eruption length and the length of the previous eruption for 271 eruptions of this geyser in minutes.
kidney Presence or absence of chronic kidney disease from diagnostic features.  This datasetcontains 203 rows and 12 columns. Colum 1 (cnamed cclcass) contains doctor classification. Columns 2-12 represent the following 11 measurements on: age, blood_pressure, blood_glucose_random, blood_urea, serum_creatinine, sodium, potassium, hemoglobin, packed_cell_volume, white_blood_cell_count, red_blood_cell_count. See  for original source
M5data  The data are obtained from three normal bi variate distributions with fixed centers but different scales and proportions. One of the components is very overlapped with another one. A 10% background noise is added uniformly distributed in a rectangle containing the three mixture components, but without overlapping much with them. The third column contains the component id.
melody There are 776 observations with 18 variables (features of melodies), and a true class vector as 19th variable. Two classes, folk songs from Luxembourg and Warmia/Poland. These are the “true” classes. The features were produced by a software called FANTASTIC. These data have been used in Coretto and Hennig (2017),
mixture100 The mixture100 dataset has been simulated by Fritz, García-Escudero & Mayo-Iscar, 2012 (page 14 fig. 8). It could either be interpreted as a mixture of three components or a mixture of two components with a 10% outlier proportion.
oliveoil This data set represents eight chemical measurements on different specimen of olive oil produced in various regions in Italy (northern Apulia, southern Apulia, Calabria, Sicily, inland Sardinia and coast Sardinia, eastern and western Liguria, Umbria) and further classifiable into three macro-areas: Centre-North, South, Sardinia. There are 572 rows, each corresponding to a different specimen of olive oil, and 10 columns. The first and the second column correspond to the macro-area and the region of origin of the olive oils respectively; here, the term "region" refers to a geographical area and only partially to administrative borders. Columns 3-10 represent the following eight chemical measurements on the acid components for the oil specimens: palmitic, palmitoleic, stearic, oleic, linoleic, linolenic, arachidic, eicosenoic.
spam A data set collected at Hewlett-Packard Labs, that classifies 4601 e-mails as spam or non-spam. In addition to this class label there are 57 variables indicating the frequency of certain words and characters in the e-mail. The first 48 variables contain the frequency of the variable name (e.g., business) in the e-mail. If the variable name starts with num (e.g., num650) the it indicates the frequency of the corresponding number (e.g., 650). The variables 49-54 indicate the frequency of the characters ‘;’, ‘(’, ‘[’, ‘!’, ‘$’, and ‘#’. The variables 55-57 contain the average, longest and total run-length of capital letters. Variable 58 indicates the type of the mail and is either "nonspam" or "spam", i.e. unsolicited commercial e-mail.
USArrest This data set contains statistics, in arrests per 100,000 residents for assault, murder, and rape in each of the 50 US states in 1973. Also given is the percent of the population living in urban areas. McNeil, D. R. (1977) Interactive Data Analysis. New York: Wiley.
structurednoise The structurednoise dataset has been simulated by Fritz, García-Escudero & Mayo-Iscar, 2012 (page 13 fig. 7 c-d). It is composed by two evident elliptical clusters plus a structured noise pattern with “helix” shape which accounts for 10% of the data.
thyroid Data on five laboratory tests administered to a sample of 215 patients. The tests are used to predict whether a patient's thyroid can be classified as euthyroidism (normal thyroid gland function), hypothyroidism (underactive thyroid not producing enough thyroid hormone) or hyperthyroidism (overactive thyroid producing and secreting excessive amounts of the free thyroid hormones T3 and/or thyroxine T4). Diagnosis of thyroid operation was based on a complete medical record, including anamnesis, scan, etc.
Diagnosis of thyroid operation: Hypo (1), Normal (2), and Hyper (3).
RT3U T3-resin uptake test (percentage).
T4 Total Serum thyroxin as measured by the isotopic displacement method.
T3 Total serum triiodothyronine as measured by radioimmuno assay.
TSH Basal thyroid-stimulating hormone (TSH) as measured by radioimmuno assay.
DTSH Maximal absolute difference of TSH value after injection of 200 micro grams of thyrotropin-releasing hormone as compared to the basal value.
X The X dataset has been simulated by Gordaliza, García-Escudero & Mayo-Iscar during the Workshop ADVANCES IN ROBUST DATA ANALYSIS AND CLUSTERING held in Ispra on October 21st-25th 2013. It is a bivariate dataset of 200 observations. It presents two parallel components without contamination.
wholesale Wholesale customers dataset.  The data set refers to clients of a wholesale distributor. It includes the annual spending in monetary units  on diverse product categories. The variables are
REGION: customers' Region - Lisbon (coded as1), Porto (coded as 2) or Other (coded as 3) (Nominal)
FRESH: annual spending (m.u.) on fresh products (Continuous);
MILK: annual spending (m.u.) on milk products (Continuous);
GROCERY: annual spending (m.u.)on grocery products (Continuous);
FROZEN: annual spending (m.u.)on frozen products (Continuous)
DETERGENTS_PAPER: annual spending (m.u.) on detergents and paper products (Continuous) DELICATESSEN: annual spending (m.u.)on and delicatessen products (Continuous); CHANNEL: customers' Channel - Horeca (Hotel/Restaurant/Café) or Retail channel (Nominal)
Horeca is coded as 1 and Retail channel is coded as 2.
Source  Abreu, N. (2011). Analise do perfil do cliente Recheio e desenvolvimento de um sistema promocional. Mestrado em Marketing, ISCTE-IUL, Lisbon.
wine Data from the machine learning repository. A chemical analysis of 178 Italian wines from three different cultivars yielded 13 measurements. This dataset is often used to test and compare the performance of various classification algorithms. There are 3 classes. Additional information can be found in Wine recognition dataset.
fakenews_v Data from the Convolutional Neural Network (CNN) embeddings representation of 1697 tweets, including 1287 trues and 412 fakes, which have been used for validation purposes by a novel procedure on disinformation detection (see reference). The dataset lists in sequence the trues and the fakes, as reflected by the grouping variable in the first column (1=trues, 2 = fakes). The CNN embeddings consist of 300 variables. RobPCA (in the form described by Hubert et al, 2005) was run on the CNN embeddings to reduce the dimensionality of the problem, and fakenews_v contains therefore the first 5 scores returned by robPCA that were found to explain more than 70% of the total variance. The second column reports an indicator on potential outliers found by the robPCA. (1=good units, 0 = outliers).
fakenews_t The same as fakenews_v, but fakenews_t is used for testing. It includes 1700 tweets divided in 1294 trues and 406 fakes.

Regression Clustering Data Sets

File/th> Description of Data Set

The pinus dataset was introduced by García-Escudero et al. (2010) and further discussed by Dotto et al. (2016). It consists of the heights and diameters of a sample of 362 Pinus nigra trees, located in the north of Palencia (Spain).

girdles Real trade data. In particular: 153 imports of girdles and panty girdles from Israel to Austria.
sprockets Real trade data. In particular: 1681 imports of toothed wheels, chain sprockets and other transmission elements from Switzerland to Austria.
TDuniform Simulated trade data from uniform distribution.
TDtweedie Simulated trade data from tweedy distribution.

The FSDA team thanks Carlos Gabriel Matrán Bea ; Alfonso Gordaliza Ramos ; Luis Angel García Escudero ; Agustín Mayo Iscar (University of Valladolid) for sharing the data and for the continuos collaboration and lively discussions on robust clustering.