Multivariate Data Sets

File Description of Data Set
baby The babyfood dataset, introduced by Box and Draper (1987), consists of 27 readings on the viscosity of 27 babyfood. There are 5 explanatory variables (x1, ..., x5) and four dependent variables:

- y1 the initial viscosity;
- y2 the viscosity after 3 months;
- y3 the viscosity after 6 months;
- y4 the viscosity after 9 months.
biochemical The Biochemical dataset (from Seber, 1984, Table 9.12) contains measurements of phosphate and chloride in the urine of 12 men with similar weights. Observation 3, which has the lowest phosphate value, stands out clearly from the rest. However the normal Q-Q plot of phosphate, does not reveal any atypical value, and the same occurs in the Q-Q plot of chloride. Thus the atypical character of observation 3 is visible only when considering both variables simultaneously. The omission of this observation has no important effect on means or variances, but the correlation almost doubles in magnitude, i.e., the influence of the outlier has been to decrease the correlation by a factor of two relative to that without the outlier.
bus The data set bus (Hettich and Bay, 1999) corresponds to a study in automatic vehicle recognition (Siebert, 1987). Each of the 218 rows corresponds to a view of a bus silhouette, and contains 18 attributes of the image.
car The car dataset is a contingency table referred to number of adults out of 1578 stating that a particular make of vehicle had a specific quality. In all there are 11713 counts. These data are taken from the 2014 Auto Brand Perception survey by Consumer Reports.
clus2over The clus2over dataset, introduced by Atkinson and Riani (2007), is a synthetic dataset with 1000 five-dimensional observations generated from two multi-normal distributions.
clothes The clothes data contains a contingency table between the 28 member states of the European Union (data collected well before Brexit) and 5 price segments. These are occurrences of country trade flows, for a wide set of clothes: x1 denotes the lowest price segment and x5 the highest price segment. In all there are 4373 counts.
clothes33 The clothes33 data contains a contingency table between the 28 member states of the European Union (data collected well before Brexit) and 33 price segments. These are occurrences of country trade flows, for a wide set of clothes: x1 denotes the lowest price segment and x33 the highest price segment. In all there are 11874 counts.
csdPerceptions The csdPerceptions dataset contains a contingency table displaying the proportion of people in a sample to associate various attributes with different brands of carbonated soft drink (CSD). The contingency table has size 8x11. In the rows there are the 8 brands and in the columns the 11 attributes. This dataset has been used very much in the correspondence analysis literature.
citiesItaly The citiesItaly dataset contains 107 records for the 107 provinces of Italy for seven indicators concerning the quality of life: The 7 variables are:
addedval = added value per capita
depost = amount of bank deposits per inhabitant
unemploy = unemployment rate
export = percentage of export divided by GNP of the province
bankrup = percentage of bankruptcy declarations on active companies
billsover = bills overdue indicator (percentage of bills protested on bills issued)
The data come from the Italian financial newspaper Il Sole 24 Ore, which every year produces the quality of life ranking of the Italian provinces.
DS12 The dataset DS12 comes from a laboratory of an international pharmaceutical company. It relates to a problem of validation of immunoassays used for anti-drug antibodies detection. There are 50 units, corresponding to chemical measurements on serum samples taken from multiple groups of 50 individuals not exposed to the agent or drug responsible for a factor to be detected (negative control samples). There are 12 variables, one for each replicate of measurements on the individuals. The variables are not obviously correlated. Moreover the three groups, corresponding to three different plates where the serum samples were analyzed, are quite distinct.
DS17 The dataset DS17 comes from an international pharmaceutical company. It relates to a problem of validation of immunoassays used for anti-drug antibodies detection. There are 50 units, corresponding to chemical measurements on serum samples taken from multiple groups of 50 individuals, and 17 variables, one variable for each replicate of measurements on the individuals. The variables are very correlated. The individuals are 25 males and 25 females, but it is not clear whether the two groups have to be treated as a unique population. There seems to be 3 outliers, which would suggest excluding three individuals from the set of 50 negative control samples.
databri The databri dataset, introduced by Atkinson et al. (2004), is a modification of the sixty_eighty dataset, to reduce the sharpness of division between groups. It consists of 170 observations and two variables.
diabetes The diabetes dataset, introduced by Reaven and Miller (1979), consists of 145 observations (patients). For each patient three measurements are reported:

- plasma glucose response to oral glucose;
- plasma insulin response to oral glucose;
- degree of insulin resistance.
dyestuff The dyestuff dataset, introduced by Box and Draper (1987), consists of 64 observations at the points of a 2^6 factorial and three responses: strength, hue and brightness. Box and Draper show that only three explanatory variables have significant effects on the responses. 
dystroall The dystroall dataset, introduced by Rencher (1995), consists of 194 observations (patients affected by Duchenne Muscular Dystrophy). For each patient 6 variables are reported:

- age;
- month of the year;
- level of creatine kinase;
- level of hemopexin;
- level of  lactate dehydrogenase;
- level of  pyruvate kinase;
dystrosmall The dystrosmall dataset is a subset of dystroall with 73 observations.
electrodes The electrodes dataset, given and described by Flury and Riedwyl (1988), contains measurements of 50+50 electrodes respectively from two machines manufacturing supposedly identical electrodes. There are five measurements on each electrode: y1, y2 and y5 are diameters, while y3 and y4 are lengths.
emilia2001 The dataset of the municipalities in Emilia Romagna, introduced by Atkinson et al. (2004, contains 341 records for 341 municipalities of Emilia Romagna (an Italian region) for 28 demographic variables, which are:

- population aged less than 10;
- population aged more than 75;
- single-member families;
- widows and widowers;
- population aged more than 25 who are graduates;
- of those aged over 6 having no education;
- activity rate;
- unemployment rate;
- standardised natural increase in population;
- standardised change in population due to migration;
- average birth rate over 1992-94;
- fecundity: three-year average birth rate amongst women of child-bearing age;
- occupied houses built since 1982;
- occupied houses with 2 or more WCs;
- occupied houses with fixed heating system;
- TV licence holders;
- number of cars for 100 inhabitants;
- luxury cars;
- working in hotels and restaurants;
- working in banking and finance;
- average declared income amongst those filing income tax returns;
- inhabitants filing income tax returns;
- residents employed in factories and public services;
- employees employed in factories withy more tha 10 employees;
- employees employed in factories withy more tha 50 employees;
- artisanal enterprises;
- enterpreneous and skilled self-employed among those of working age.
fondi The fondi data set, introduced by Zani (2000), consists of 103 investment funds operating in Italy since April 1996. The variables are:

-  short term (12 month) performance;
-  medium term (36 month) performance;
-  medium term (36 month) volatility.
fondi_large They are a superset of the 103 investment funds described above. There are 309  investment funds, 99  report a loss and on the profitability. It is clear from scatterplots of the data that the negative responses have a lower variance than the positive ones and a different relationship with the explanatory variables. Because the data include negative responses, the Box-Cox transformation cannot be used. A a robust version of an extension to the Yeo-Johnson transformation which allows different transformations for positive and negative responses is neede to analyze these data.
head The Swiss Heads dataset was introduced by B. Flury and H. Riedwyl (1988). It contains information on six variables describing the dimensions of the heads of 200 twenty year old Swiss soldiers.
ir The ir dataset, introduced by Anderson (1935), contains 150 measurements on three species of iris (50 measurements on each specie). Four measurements of characteristic dimensions of the flowers were made:

-y1: sepal length;
-y2: sepal width;
-y3: petal length;
-y4: petal width.
milk The milk dataset, introduced by Daudin, Duby and Trecourt (1988), consists of 85 observations about the composition of milk containers. For each container 8 measures are reported:

- y1: density;
- y2: fat content;
- y3: protein content;
 -y4: casein content;
- y5: cheese dry substances measured in the factory;
- y6: cheese dry substances measured in the laboratory;
- y7: milk dry substance;
- y8: cheese produced.
mobilephone The mobilephone dataset contains a contingency table displaying the number of people in a sample to associate various mobile carriers with various attributes. The contingency table has size 8x11. In the rows there are the 8 brands and in the columns the 19 attributes. This dataset has been used very much in the correspondence analysis literature. The correspondence analysis chart quickly allows us to see that One-tel was seen by consumers as being associated with Here today, gone tomorrow (the company was in financial trouble at the time of the study), the new entrants to the market, AAPT and Virgin are shown as Don’t know much about them and the market leader, Telstra, skews towards Good coverage and Bureaucratic.
mussels The horse mussels dataset, introduced by Cook and Weisberg (1994), consists of 82 observations on horse mussels from New Zeland. In particular they are reported:

- shell length, mm;
- shell width, mm;
- shell height, mm;
- shell mass, grams;
- muscle mass, grams.
quality The quality dataset, introduced by Atkinson et al. (2004), consists of information on the quality of life in 103 Italian provinces. For each province 6 measures are reported:

- y1: average amount of bank deposits per inhabitant;
- y2: number of robberies per 100000 inhabitants;
- y3: number of housebreakings per 100000 inhabitants;
- y4: number of suicides, committed or attempted, per 100000 inhabitants;
- y5: number of gyms per 100000 inhabitants;
- y6: average expenditure on theatre and concerts per inhabitant.
recordfg The recordfg (National Track Records for Women) data set, introduced by Johnson and Wichern (1997), contains the national records for women from 55 countries for seven races which are:

- y1: 100 metres in seconds;
- y2: 200 metres in seconds;
- y3: 400 metres in seconds;
- y4: 800 metres in seconds;
- y5: 1500 metres in seconds;
- y6: 3000 metres in seconds;
- y7: marathon.
sixty_eighty The sixty_eighty dataset, introduced by Atkinson et al. (2004), is a simulated bivariate dataset  with two well separated clusters: a very dense cluster of 60 observations and a diffuse cluster of 80 observations.
swiss_banknotes The swiss_banknotes dataset, introduced by Flury and Riedwyl (1988), contains 200 records on Swiss bank notes, printed before the second world war, 100 of which are genuine and 100 forged. For each bank note 6 measurement of the size are reported:

- length of bank note near the top;
- left-hand height of bank note;
- right-hand height of bank note;
- distance from bottom of bank note to beginning of patterned border;
- distance from top of bank note to beginning of patterned border;
- diagonal distance.
three_clust_2outl The three_clust_2outl dataset, introduced by Atkinson et al. (2004), is a simulated bivariate dataset, similar to sixty_eighty dataset, but with a third additional cluster (units 141-158) and two outliers (units 159 and 160). The sizes of the groups are therefore 80, 60, 18 and 2.
wine The data set wine is a part of one given in Hettich and Bay (1999). It contains, for each of 59 wines grown in the same region in Italy, the quantities of 13 constituents. The original purpose of the analysis (DeVel, Aeberhard and Coomans, 1993) was to classify wines from different cultivars by means of these measurements. Here we report only one cultivar. The plots of the classical squared distances as a function of observation number and their Q-Q plots with respect to the Chi2 distribution do not show any clear outliers. By using a robust estimate, at least seven points stand out clearly.