Flexible Statistics and Data Analysis Toolbox Functions -

Multivariate Data Sets (data matrices and contingency tables)

Data matrices

File	Description of Data Set
`baby`	The babyfood dataset, introduced by Box and Draper (1987), consists of 27 readings on the viscosity of 27 babyfood. There are 5 explanatory variables (x1, ..., x5) and four dependent variables: - y1 the initial viscosity; - y2 the viscosity after 3 months; - y3 the viscosity after 6 months; - y4 the viscosity after 9 months.
`biochemical`	The Biochemical dataset (from Seber, 1984, Table 9.12) contains measurements of phosphate and chloride in the urine of 12 men with similar weights. Observation 3, which has the lowest phosphate value, stands out clearly from the rest. However the normal Q-Q plot of phosphate, does not reveal any atypical value, and the same occurs in the Q-Q plot of chloride. Thus the atypical character of observation 3 is visible only when considering both variables simultaneously. The omission of this observation has no important effect on means or variances, but the correlation almost doubles in magnitude, i.e., the influence of the outlier has been to decrease the correlation by a factor of two relative to that without the outlier.
`bus`	The data set bus (Hettich and Bay, 1999) corresponds to a study in automatic vehicle recognition (Siebert, 1987). Each of the 218 rows corresponds to a view of a bus silhouette, and contains 18 attributes of the image.
`clus2over`	The clus2over dataset, introduced by Atkinson and Riani (2007), is a synthetic dataset with 1000 five-dimensional observations generated from two multi-normal distributions.
`citiesItaly`	The citiesItaly dataset contains 103 records for the old 103 provinces of Italy for seven indicators concerning the quality of life: The 7 variables are: addedval = added value per capita depost = amount of bank deposits per inhabitant unemploy = unemployment rate export = percentage of export divided by GNP of the province bankrup = percentage of bankruptcy declarations on active companies billsover = bills overdue indicator (percentage of bills protested on bills issued) The data come from the Italian financial newspaper Il Sole 24 Ore, which every year produces the quality of life ranking of the Italian provinces.
`citiesItaly2024`	The citiesItaly2024 dataset contains 107 records for the 107 provinces of Italy for 12 indicators concerning the 2024 survey of quality of life. The 12 variables are: Deposit = Bank Deposits Bankrup = Bankrupt Companies UrbanFra = Index of Urban Fragility Paym30D = Invoice Payments Within 30 Days ElecPar = Electoral Participation QualLif = Quality of Life of Children, Young People and the Elderly Protest = Protests per capita SalaryA = Average Annual Salary SpendingA = Family Spending Employm = Employment Rate AddedVa = Value Added PerCapita LowISEE = Families with low ISEE The data come from the Italian financial newspaper Il Sole 24 Ore which every year produces the quality of life ranking of the Italian provinces. IlSole24ORE/QDV2024 Inside `citesItaly2024.Properties.UserData` it is possible to find a cell array with two elements. The first element contains the geotable with the shape information for each province. Note that in order to properly load the geotable you need to have the mapping toolbox installed, otherwise the following warning is issued Warning: Cannot load an object of class 'mappolyshape': Its class cannot be found. Warning: The variable 'Shape' failed to load, and has been replaced with an empty array. This might have happened if its class does not exist on the path. The second element contains an array of size 107x2 with the Latitudes and Longitudes for each province.
`DS12`	The dataset DS12 comes from a laboratory of an international pharmaceutical company. It relates to a problem of validation of immunoassays used for anti-drug antibodies detection. There are 50 units, corresponding to chemical measurements on serum samples taken from multiple groups of 50 individuals not exposed to the agent or drug responsible for a factor to be detected (negative control samples). There are 12 variables, one for each replicate of measurements on the individuals. The variables are not obviously correlated. Moreover the three groups, corresponding to three different plates where the serum samples were analyzed, are quite distinct.
`DS17`	The dataset DS17 comes from an international pharmaceutical company. It relates to a problem of validation of immunoassays used for anti-drug antibodies detection. There are 50 units, corresponding to chemical measurements on serum samples taken from multiple groups of 50 individuals, and 17 variables, one variable for each replicate of measurements on the individuals. The variables are very correlated. The individuals are 25 males and 25 females, but it is not clear whether the two groups have to be treated as a unique population. There seems to be 3 outliers, which would suggest excluding three individuals from the set of 50 negative control samples.
`databri`	The databri dataset, introduced by Atkinson et al. (2004), is a modification of the sixty_eighty dataset, to reduce the sharpness of division between groups. It consists of 170 observations and two variables.
`diabetes`	The diabetes dataset, introduced by Reaven and Miller (1979), consists of 145 observations (patients). For each patient three measurements are reported: - plasma glucose response to oral glucose; - plasma insulin response to oral glucose; - degree of insulin resistance.
`dyestuff`	The dyestuff dataset, introduced by Box and Draper (1987), consists of 64 observations at the points of a 2^6 factorial and three responses: strength, hue and brightness. Box and Draper show that only three explanatory variables have significant effects on the responses.
`dystroall`	The dystroall dataset, introduced by Rencher (1995), consists of 194 observations (patients affected by Duchenne Muscular Dystrophy). For each patient 6 variables are reported: - age; - month of the year; - level of creatine kinase; - level of hemopexin; - level of lactate dehydrogenase; - level of pyruvate kinase;
`dystrosmall`	The dystrosmall dataset is a subset of dystroall with 73 observations.
`electrodes`	The electrodes dataset, given and described by Flury and Riedwyl (1988), contains measurements of 50+50 electrodes respectively from two machines manufacturing supposedly identical electrodes. There are five measurements on each electrode: y1, y2 and y5 are diameters, while y3 and y4 are lengths.
`emilia2001`	The dataset of the municipalities in Emilia Romagna, introduced by Atkinson et al. (2004, contains 341 records for 341 municipalities of Emilia Romagna (an Italian region) for 28 demographic variables, which are: - population aged less than 10; - population aged more than 75; - single-member families; - widows and widowers; - population aged more than 25 who are graduates; - of those aged over 6 having no education; - activity rate; - unemployment rate; - standardised natural increase in population; - standardised change in population due to migration; - average birth rate over 1992-94; - fecundity: three-year average birth rate amongst women of child-bearing age; - occupied houses built since 1982; - occupied houses with 2 or more WCs; - occupied houses with fixed heating system; - TV licence holders; - number of cars for 100 inhabitants; - luxury cars; - working in hotels and restaurants; - working in banking and finance; - average declared income amongst those filing income tax returns; - inhabitants filing income tax returns; - residents employed in factories and public services; - employees employed in factories withy more tha 10 employees; - employees employed in factories withy more tha 50 employees; - artisanal enterprises; - enterpreneous and skilled self-employed among those of working age.
`fat`	Phisical measurements of 251 males. The variables are - body_fat: Percent body fat using Brozek's equation, 457/Density - 414.2 - body_fat_siri: Percent body fat using Siri's equation, 495/Density - 450 - density: Density (gm/cm^2) - age: Age (yrs) - weight: Weight (lbs) - height: Height (inches) - BMI: Adiposity index = Weight/Height^2 (kg/m^2) - ffweight: Fat Free Weight = (1 - fraction of body fat) * Weight, using Brozek's formula (lbs) - neck: Neck circumference (cm) - chest: Chest circumference (cm) - abdomen: Abdomen circumference (cm) "at the umbilicus and level with the iliac crest" - hip: Hip circumference (cm) - thigh: Thigh circumference (cm) - knee: Knee circumference (cm) - ankle: Ankle circumference (cm) - bicep: Extended biceps circumference (cm) - forearm: Forearm circumference (cm) - wrist: Wrist circumference (cm) "distal to the styloid processes" Note that observation 182 in the original dataset has been removed because it reported a percent body fat estimate equal to 0. The purpose is to predict body_fat from the other measurements. The source of the data is attributed to Dr. A. Garth Fisher, Human Performance Research Center, Brigham Young University, Provo, Utah 84602,
`fondi`	The fondi data set, introduced by Zani (2000), consists of 103 investment funds operating in Italy since April 1996. The variables are: - short term (12 month) performance; - medium term (36 month) performance; - medium term (36 month) volatility.
`fondi_large`	They are a superset of the 103 investment funds described above. There are 309 investment funds, 99 report a loss and on the profitability. It is clear from scatterplots of the data that the negative responses have a lower variance than the positive ones and a different relationship with the explanatory variables. Because the data include negative responses, the Box-Cox transformation cannot be used. A a robust version of an extension to the Yeo-Johnson transformation which allows different transformations for positive and negative responses is neede to analyze these data.
`head`	The Swiss Heads dataset was introduced by B. Flury and H. Riedwyl (1988). It contains information on six variables describing the dimensions of the heads of 200 twenty year old Swiss soldiers.
`hprice`	Sales prices of 546 houses in the city of Windsor, Ontario, Canada, during July, August and September, 1987. The variables are - lotsize: the lot size of a property in square feet - bedrooms: number of bedrooms - bathrms: number of full bathrooms - stories number of stories excluding basement - driveway does the house has a driveway? - recroom does the house has a recreational room? - fullbase: does the house has a full finished basement? - gashw: does the house uses gas for hot water heating? - airco: does the house has central air conditioning? - garagepl: number of garage places - prefarea: is the house located in the preferred neighbourhood of the city? -price: sale price of a house. The reference is Verbeek, Marno (2004) A Guide to Modern Econometrics, John Wiley and Sons, chapter 3. Journal of Applied Econometrics data archive : http://qed.econ.queensu.ca/jae/ .
`milk`	The milk dataset, introduced by Daudin, Duby and Trecourt (1988), consists of 85 observations about the composition of milk containers. For each container 8 measures are reported: - y1: density; - y2: fat content; - y3: protein content; -y4: casein content; - y5: cheese dry substances measured in the factory; - y6: cheese dry substances measured in the laboratory; - y7: milk dry substance; - y8: cheese produced.
`mussels`	The horse mussels dataset, introduced by Cook and Weisberg (1994), consists of 82 observations on horse mussels from New Zeland. In particular they are reported: - shell length, mm; - shell width, mm; - shell height, mm; - shell mass, grams; - muscle mass, grams.
`quality`	The quality dataset, introduced by Atkinson et al. (2004), consists of information on the quality of life in 103 Italian provinces. For each province 6 measures are reported: - y1: average amount of bank deposits per inhabitant; - y2: number of robberies per 100000 inhabitants; - y3: number of housebreakings per 100000 inhabitants; - y4: number of suicides, committed or attempted, per 100000 inhabitants; - y5: number of gyms per 100000 inhabitants; - y6: average expenditure on theatre and concerts per inhabitant.
`recordfg`	The recordfg (National Track Records for Women) data set, introduced by Johnson and Wichern (1997), contains the national records for women from 55 countries for seven races which are: - y1: 100 metres in seconds; - y2: 200 metres in seconds; - y3: 400 metres in seconds; - y4: 800 metres in seconds; - y5: 1500 metres in seconds; - y6: 3000 metres in seconds; - y7: marathon.
`sixty_eighty`	The sixty_eighty dataset, introduced by Atkinson et al. (2004), is a simulated bivariate dataset with two well separated clusters: a very dense cluster of 60 observations and a diffuse cluster of 80 observations.
`swiss_banknotes`	The swiss_banknotes dataset, introduced by Flury and Riedwyl (1988), contains 200 records on Swiss bank notes, printed before the second world war, 100 of which are genuine and 100 forged. For each bank note 6 measurement of the size are reported: - length of bank note near the top; - left-hand height of bank note; - right-hand height of bank note; - distance from bottom of bank note to beginning of patterned border; - distance from top of bank note to beginning of patterned border; - diagonal distance.
`three_clust_2outl`	The three_clust_2outl dataset, introduced by Atkinson et al. (2004), is a simulated bivariate dataset, similar to sixty_eighty dataset, but with a third additional cluster (units 141-158) and two outliers (units 159 and 160). The sizes of the groups are therefore 80, 60, 18 and 2.
`USArrest`	This data set contains statistics, in arrests per 100,000 residents for assault, murder, and rape in each of the 50 US states in 1973. Also given is the percent of the population living in urban areas. McNeil, D. R. (1977) Interactive Data Analysis. New York: Wiley.
`wine`	The original reference of this dataset is Hettich and Bay (1999). It contains, for each of 178 wines grown in the same region in Italy, the quantities of 13 constituents. There are 3 cultivars . Rows 1:60 refer to the first cultivar Rows 61:131 refer to the second cultivar Rows 132:178 refer to the third cultivar The original purpose of the analysis (DeVel, Aeberhard and Coomans, 1993) was to classify wines from different cultivars by means of these measurements. The variables are: - Alcohol = Alcohol - Malicacid = Malic acid - Ash = Ash - Acl = Alcalinity of ash - Mg = Magnesium - Phenols = Total phenols - Flavanoids - Nonflavanoid.phenols = Nonflavanoid phenols - Proanth = Proanthocyanins - Colorint = Color intensity - Hue = Hue - OD Proline = OD280/OD315 of diluted wines The plots of the classical squared distances as a function of observation number and their Q-Q plots with respect to the Chi2 distribution do not show any clear outliers. By using a robust estimate, at least seven points stand out clearly.

Contingency tables

File	Description of Data Set
`car`	The car dataset is a contingency table referred to number of adults out of 1578 stating that a particular make of vehicle had a specific quality. In all there are 11713 counts. These data are taken from the 2014 Auto Brand Perception survey by Consumer Reports.
`cinema`	The cinema dataset contains a contingency table between the age (in classes) and opinion on the movie watched. The age classes are 16-24, 25-34, 35-44, 45-54, 55-64, 65-74, 75+. The reaction to the movie are BAD, AVERAGE, GOOD, VERY GOOD. In all there are 1357 people interviewed.
`clothes`	The clothes data contains a contingency table between the 28 member states of the European Union (data collected well before Brexit) and 5 price segments. These are occurrences of country trade flows, for a wide set of clothes: x1 denotes the lowest price segment and x5 the highest price segment. In all there are 4373 counts.
`clothes33`	The clothes33 data contains a contingency table between the 28 member states of the European Union (data collected well before Brexit) and 33 price segments. These are occurrences of country trade flows, for a wide set of clothes: x1 denotes the lowest price segment and x33 the highest price segment. In all there are 11874 counts.
`csdPerceptions`	The csdPerceptions dataset contains a contingency table displaying the proportion of people in a sample to associate various attributes with different brands of carbonated soft drink (CSD). The contingency table has size 8x11. In the rows there are the 8 brands and in the columns the 11 attributes. This dataset has been used very much in the correspondence analysis literature.
`mobilephone`	The mobilephone dataset contains a contingency table displaying the number of people in a sample to associate various mobile carriers with various attributes. The contingency table has size 8x11. In the rows there are the 8 brands and in the columns the 19 attributes. This dataset has been used very much in the correspondence analysis literature. The correspondence analysis chart quickly allows us to see that One-tel was seen by consumers as being associated with Here today, gone tomorrow (the company was in financial trouble at the time of the study), the new entrants to the market, AAPT and Virgin are shown as Don’t know much about them and the market leader, Telstra, skews towards Good coverage and Bureaucratic.
`SportHealth`	The SportHealth dataset contains a contingency table between "Physical Activity Frequency" and self assesment "Quality of Life Ratings". The number of people interviewed is 303.

By Category

Functions

Alphabetical List

• The developers of the toolbox• The forward search group • Terms of Use• Acknowledgments

Documentation

Multivariate Data Sets (data matrices and contingency tables)

Data matrices

Contingency tables