Multivariate Data Sets

File Description of Data Set
baby The babyfood dataset, introduced by Box and Draper (1987), consists of 27 readings on the viscosity of 27 babyfood. There are 5 explanatory variables (x1, ..., x5) and four dependent variables:

- y1 the initial viscosity;
- y2 the viscosity after 3 months;
- y3 the viscosity after 6 months;
- y4 the viscosity after 9 months.
biochemical The Biochemical dataset (from Seber, 1984, Table 9.12) contains measurements of phosphate and chloride in the urine of 12 men with similar weights. Observation 3, which has the lowest phosphate value, stands out clearly from the rest. However the normal Q-Q plot of phosphate, does not reveal any atypical value, and the same occurs in the Q-Q plot of chloride. Thus the atypical character of observation 3 is visible only when considering both variables simultaneously. The omission of this observation has no important effect on means or variances, but the correlation almost doubles in magnitude, i.e., the influence of the outlier has been to decrease the correlation by a factor of two relative to that without the outlier.
bus The data set bus (Hettich and Bay, 1999) corresponds to a study in automatic vehicle recognition (Siebert, 1987). Each of the 218 rows corresponds to a view of a bus silhouette, and contains 18 attributes of the image.
clus2over The clus2over dataset, introduced by Atkinson and Riani (2007), is a synthetic dataset with 1000 five-dimensional observations generated from two multi-normal distributions.
DS12 The dataset DS12 comes from a laboratory of an international pharmaceutical company. It relates to a problem of validation of immunoassays used for anti-drug antibodies detection. There are 50 units, corresponding to chemical measurements on serum samples taken from multiple groups of 50 individuals not exposed to the agent or drug responsible for a factor to be detected (negative control samples). There are 12 variables, one for each replicate of measurements on the individuals. The variables are not obviously correlated. Moreover the three groups, corresponding to three different plates where the serum samples were analyzed, are quite distinct.
DS17 The dataset DS17 comes from an international pharmaceutical company. It relates to a problem of validation of immunoassays used for anti-drug antibodies detection. There are 50 units, corresponding to chemical measurements on serum samples taken from multiple groups of 50 individuals, and 17 variables, one variable for each replicate of measurements on the individuals. The variables are very correlated. The individuals are 25 males and 25 females, but it is not clear whether the two groups have to be treated as a unique population. There seems to be 3 outliers, which would suggest excluding three individuals from the set of 50 negative control samples.
databri The databri dataset, introduced by Atkinson et al. (2004), is a modification of the sixty_eighty dataset, to reduce the sharpness of division between groups. It consists of 170 observations and two variables.
diabetes The diabetes dataset, introduced by Reaven and Miller (1979), consists of 145 observations (patients). For each patient three measurements are reported:

- plasma glucose response to oral glucose;
- plasma insulin response to oral glucose;
- degree of insulin resistance.
dyestuff The dyestuff dataset, introduced by Box and Draper (1987), consists of 64 observations at the points of a 2^6 factorial and three responses: strength, hue and brightness. Box and Draper show that only three explanatory variables have significant effects on the responses. 
dystroall The dystroall dataset, introduced by Rencher (1995), consists of 194 observations (patients affected by Duchenne Muscular Dystrophy). For each patient 6 variables are reported:

- age;
- month of the year;
- level of creatine kinase;
- level of hemopexin;
- level of  lactate dehydrogenase;
- level of  pyruvate kinase;
dystrosmall The dystrosmall dataset is a subset of dystroall with 73 observations.
electrodes The electrodes dataset, given and described by Flury and Riedwyl (1988), contains measurements of 50+50 electrodes respectively from two machines manufacturing supposedly identical electrodes. There are five measurements on each electrode: y1, y2 and y5 are diameters, while y3 and y4 are lengths.
emilia2001 The dataset of the municipalities in Emilia Romagna, introduced by Atkinson et al. (2004, contains 341 records for 341 municipalities of Emilia Romagna (an Italian region) for 28 demographic variables, which are:

- population aged less than 10;
- population aged more than 75;
- single-member families;
- widows and widowers;
- population aged more than 25 who are graduates;
- of those aged over 6 having no education;
- activity rate;
- unemployment rate;
- standardised natural increase in population;
- standardised change in population due to migration;
- average birth rate over 1992-94;
- fecundity: three-year average birth rate amongst women of child-bearing age;
- occupied houses built since 1982;
- occupied houses with 2 or more WCs;
- occupied houses with fixed heating system;
- TV licence holders;
- number of cars for 100 inhabitants;
- luxury cars;
- working in hotels and restaurants;
- working in banking and finance;
- average declared income amongst those filing income tax returns;
- inhabitants filing income tax returns;
- residents employed in factories and public services;
- employees employed in factories withy more tha 10 employees;
- employees employed in factories withy more tha 50 employees;
- artisanal enterprises;
- enterpreneous and skilled self-employed among those of working age.
fondi The fondi data set, introduced by Zani (2000), consists of 103 investment funds operating in Italy since April 1996. The variables are:

-  short term (12 month) performance;
-  medium term (36 month) performance;
-  medium term (36 month) volatility.
fondi_large They are a superset of the 103 investment funds described above. There are 309  investment funds, 99  report a loss and on the profitability. It is clear from scatterplots of the data that the negative responses have a lower variance than the positive ones and a different relationship with the explanatory variables. Because the data include negative responses, the Box-Cox transformation cannot be used. A a robust version of an extension to the Yeo-Johnson transformation which allows different transformations for positive and negative responses is neede to analyze these data.
geyser The geyser dataset is taken from the MASS library (Venables and Ripley, 2002). It contains information on 272 successive eruptions of the ‘old faithful’ geyser in Yellowstone National Park, Wyoming. The variables are:

- y1: the duration of the ith eruption
- y2: the waiting time to the start of that eruption from the start of eruption i − 1.
head The Swiss Heads dataset was introduced by B. Flury and H. Riedwyl (1988). It contains information on six variables describing the dimensions of the heads of 200 twenty year old Swiss soldiers.
ir The ir dataset, introduced by Anderson (1935), contains 150 measurements on three species of iris (50 measurements on each specie). Four measurements of characteristic dimensions of the flowers were made:

-y1: sepal length;
-y2: sepal width;
-y3: petal length;
-y4: petal width.
milk The milk dataset, introduced by Daudin, Duby and Trecourt (1988), consists of 85 observations about the composition of milk containers. For each container 8 measures are reported:

- y1: density;
- y2: fat content;
- y3: protein content;
 -y4: casein content;
- y5: cheese dry substances measured in the factory;
- y6: cheese dry substances measured in the laboratory;
- y7: milk dry substance;
- y8: cheese produced.
mussels The horse mussels dataset, introduced by Cook and Weisberg (1994), consists of 82 observations on horse mussels from New Zeland. In particular they are reported:

- shell length, mm;
- shell width, mm;
- shell height, mm;
- shell mass, grams;
- muscle mass, grams.
quality The quality dataset, introduced by Atkinson et al. (2004), consists of information on the quality of life in 103 Italian provinces. For each province 6 measures are reported:

- y1: average amount of bank deposits per inhabitant;
- y2: number of robberies per 100000 inhabitants;
- y3: number of housebreakings per 100000 inhabitants;
- y4: number of suicides, committed or attempted, per 100000 inhabitants;
- y5: number of gyms per 100000 inhabitants;
- y6: average expenditure on theatre and concerts per inhabitant.
recordfg The recordfg (National Track Records for Women) data set, introduced by Johnson and Wichern (1997), contains the national records for women from 55 countries for seven races which are:

- y1: 100 metres in seconds;
- y2: 200 metres in seconds;
- y3: 400 metres in seconds;
- y4: 800 metres in seconds;
- y5: 1500 metres in seconds;
- y6: 3000 metres in seconds;
- y7: marathon.
sixty_eighty The sixty_eighty dataset, introduced by Atkinson et al. (2004), is a simulated bivariate dataset  with two well separated clusters: a very dense cluster of 60 observations and a diffuse cluster of 80 observations.
swiss_banknotes The swiss_banknotes dataset, introduced by Flury and Riedwyl (1988), contains 200 records on Swiss bank notes, printed before the second world war, 100 of which are genuine and 100 forged. For each bank note 6 measurement of the size are reported:

- length of bank note near the top;
- left-hand height of bank note;
- right-hand height of bank note;
- distance from bottom of bank note to beginning of patterned border;
- distance from top of bank note to beginning of patterned border;
- diagonal distance.
three_clust_2outl The three_clust_2outl dataset, introduced by Atkinson et al. (2004), is a simulated bivariate dataset, similar to sixty_eighty dataset, but with a third additional cluster (units 141-158) and two outliers (units 159 and 160). The sizes of the groups are therefore 80, 60, 18 and 2.
wine The data set wine is a part of one given in Hettich and Bay (1999). It contains, for each of 59 wines grown in the same region in Italy, the quantities of 13 constituents. The original purpose of the analysis (DeVel, Aeberhard and Coomans, 1993) was to classify wines from different cultivars by means of these measurements. Here we report only one cultivar. The plots of the classical squared distances as a function of observation number and their Q-Q plots with respect to the Chi2 distribution do not show any clear outliers. By using a robust estimate, at least seven points stand out clearly.