Regression Data Sets

File Description of Data Set
algae The algae dataset (from Hettich and Bay, 1999) contains 90 measurements at a river in some place in Europe. There are 11 predictors. The first three are categorical: the season of the year, river size (small, medium and large) and fluid velocity (low, medium and high). The other eight are the concentrations of several chemical substances. The response is the logarithm of the abundance of a certain class of algae. The normal Q-Q plots of the residuals corresponding to the LS estimate gives the impression of short-tailed residuals, while the residuals from a robust fit indicate the existence of at least two outliers, i.e. observations 36 and 77.
aircraft The aircraft dataset (Rousseeuw and Leroy, 1987, page 154, table 22) contains 23 single-engine aircraft built over the years 1947-1979, from Office of Naval Research. The dependent variable is cost in units of $100,000 (last column) and the explanatory variables are aspect ratio, lift-to-drag ratio, weight of plane (in pounds) and maximal thrust. Based on MCD without correction factor, 4 observayions, included observation 15, are detected as outliers. Based on the corrected MCD, observation 15 is no longer detected as outlier.
animals The animals dataset contains Brain and Body Weights for 65 Species of land animals. References for these dataset are contained in
Balancesheets The Balancesheets dataset contains 6 balancesheets voices of 1405 Italian capital companies extracted from the Buerau Van Dijks AIDA database. The voices are:
- Y = Return On Sales (ROS),
- X1 = Labour share,
- X2 = ratio of tangible fixed assets to added value,
- X3 = ratio of intangible assets to total assets,
- X4 = ratio of industrial equipment tototal asset,
- X5 = firm’s interest burden.
credit_card Credit card data, introduced by Riani (2011 forthcoming), are formed by 1,000 observations on the most active customers of an Italian bank. There is one response and nine explanatory variables which are:
- x1: Direct debts to the bank;
- x2: Assigned debts from third parties;
- x3: Amount of shares (in thousands of Euros);
- x4: Amount invested in investment funds (in thousands of Euros);
- x5: Amount of money invested in insurances products from the bank (in thousands of Euros);
- x6: Amount invested in bonds (in thousands of Euros);
- x7: Number of telepasses (Italian electronic toll collection system) of the current account holder;
- x8: Number of persons from the bank dealing with the management of the portfolio of the customer;
- x9: Index of use of point of sale services;
- y: Amount of use of credit, debit and pre-paid card services.
Through the analysis of the generalized candlestick plot, it is possible to identify a subset of significant variables and identify the outliers with reference to these significant variables.
fish   Two websites,  and  present data on the weight of 159 fish caught in a lake near Tampere, Finland. Interest is in the relationship between weight and five measurements of dimensions of the fish. There are 7 species of fish including pike. These behave rather differently from the other six species. The variables are:
- Species
- Weight
-Length from the nose to the beginning of the tail (in cm)
- Length from the nose to the notch of the tail (in cm)
- Length from the nose to the end of the tail (in cm)
- Height
- Width
fishery Data extracted from monthly aggregates (flows) of trade declarations (Riani et al. 2008). The dataset is formed by 677 flows of a fishery product imported in the European Union from a third country in a period of one year. Among the many variables available we provide:
- x: the quantity of the trade flow;
- y:the value of the trade flow;

By regressing the variable "value" against the "quantity" one can see that the dataset is characterized by the presence of a mixture of linear groups, which roughly correspond to the clusters indicated by the subject matter expert. Riani et al. (2008) have shown how the FS can estimate such a mixture, allocate the units to the components of the mixture and identify in the dataset possible outliers, i.e. units that do not belong to any component. The three identified components are consistent with the clusters identified by the subject matter experts. The dataset is one among thousands of similar datasets that have to be analyzed automatically, for which there is no subject matter classification available.
fishery2002 It has the characteristics of dataset fishery but it contains information on the importing Eu country (declarant) and on the period (date) in which the transaction took place. It referes to year 2002.
fishery2003 It has the characteristics of dataset fishery but it contains information on the importing Eu country (declarant) and on the period (date) in which the transaction took place. It referes to year 2003.
fishery2004 It has the characteristics of dataset fishery but it contains information on the importing Eu country (declarant) and on the period (date) in which the transaction took place. It referes to year 2004.
forbes Forbes' data on air pressure in the Alps and the boiling point of water (Weisberg, 1985). There are 17 observations on the boiling point of water at different pressures, obtained from measurements at a variety of elevations in the Alps. The purpose of the experiment was to allow prediction of pressure from boiling point, which is easily measured, and so to provide an estimate of altitude: the higher the altitude, the lower the pressure. The variables are:
- x: boiling point
- y: 100×log(pressure)
The dataset is characterized by one clear outlier.
gasoline The data in Table 1 Chen Lockart and Stephens (2002) are 107 readings with response the distance driven and explanatory variable the amount of gasoline consumed.. The variables are:
- x: liters of gasoline used
- y: Values of distance driven in kilometers
The fanplot provides a very clear indication that the gasoline data have no transformation potential.
hawkins Hawkins' data simulated to baffle data analysts (by Hawkins). There are 128 observations and eight explanatory variables. The scatter plot matrix of the data does not reveal an interpretable structure; there seems to be no relationship between y and seven of the eight explanatory variables, the exception being X8. However with the FS it is possible to find four groups in the data.
hospitalFS Hospital data (Neter et al., 1996). Data on the logged survival time of 108 patients undergoing liver surgery, together with four potential explanatory variables. Data are composed of 54 observations plus other 54 observations, introduced to check the model fitted to the first 54. Their comparison suggests there is no systematic difference between the two sets. However by looking at some FS plots (Riani and Atkinson, 2007), we conclude that these two groups are significantly different.
illness07 Kleinbaum and Kupper (1978, p.148) describe observational data on the assessment of mental illness of 53 patients. A psychiatrist assigns values for mental retardation and degree of distrust of doctors in newly hospitalized patients (explanatory variables). After six months of treatment, a value is assigned for the degree of illness of each patient (response). Atkinson Riani and Corbellini (2020) explore the Box-Cox transformation of degree of illness with regression on the two initial assessments. The data support the log transformation. There is significant regression on both variables with a t value of 2.88 for the relationship with the initial assessment of retardation and -2.21 for distrust of doctors. The QQ-plots of residuals show an appreciable improvement in normality after transformation.
JohnDraper John and Draper (1980) present data on the subjective assessment of the thickness of pipe. Five inspectors assessed wall thickness at four different locations on the pipe. The experiment was repeated three times. The sixty responses are a multiple of the difference between the inspector's assessment and the `true' value determined by an ultrasonic reader. If both readings were available the Box-Cox transformation could be applied to all 120 readings and the difference analysed in the transformed scale. But the ultrasonic readings are no longer known. 
krafft The krafft dataset (Jalali-Heravi and Knouz, 2002) contains the measures for 32 chemical compounds of a physical property called the Krafft point, together with several molecular descriptors, in order to find a predictive equation. The dataset is characterised by two linear structures. The points in the smaller group correspond to compounds called sulfonates. The LS estimate fits neither of the two sub-groups.
loyalty Loyalty cards data. They contain 509 observations on the behavior of customers with loyalty cards from a supermarket chain in Northern Italy. The response (y) is the amount, in Euros, spent at the shop over six months and the explanatory variables are:
- x1, the number of visits to the supermarket in the six month period;
- x2, the age of the customer;
- x3, the number of members of the customer's family.
By transforming the data it is possible to see in the FS plots a group of customer characterized by a different purchasing behaviour  (Atkinson and Riani, 2006).
Marketing_Data The advertising experiment between Social Media Budget and Sales (in Thousands $ ). 200 experiments.
- x1, youtube
- x2, facebook
- x3, newspaper
- y, sales
The source of the data is Marketing Linear Multiple Regression | Kaggle
The dataset has been used in Riani, Atkinson and Corbellini  (2022).
mineral The mineral dataset (Smith, Campbell and Lichfield, 1984) contains the measurement of the contents (in parts per million) of 2 chemical elements (zinc and copper) in 53 samples of rocks in Western Australia. Observation 15 stands out as clearly atypical, having a very large abscissa and too high an ordinate. The LS and L1 fits are seen to be influenced more by this observation than by the rest. By contrast, the LS fit omitting observation 15 gives a good fit to the rest of the data. Neither The Q–Q plot and the plot of residuals vs. fitted values for the LS estimate reveal the existence of an outlier as indicated by an exceptionally large residual. However, the second figure shows an approximate linear relationship between residuals and fitted values (excepting observation 15) and this indicates that the fit is not correct.
ms212 Students in an introductory statistics class at the University of Queensland participated in a simple experiment. They took their own pulse rate. They were then
randomized to run in place for one minute or to sit for that minute. Then every
one measured their pulse rate again. There are 109 complete observations, nine
explanatory variables covering physiological and lifestyle data and the two pulse
rates. One research question was how does the difference in pulse rate before and af
ter the minute depend on lifestyle and physiological measurements? It is expected
to depend heavily on whether the students ran or not. The data, posted by John
Eccleston and Richard Wilson, are available with a more complete description at
multiple_regression Multiple regression data showing the effect of masking (Atkinson and Riani, 2000). There are 60 observations on a response y with the values of three explanatory variables. The scatter plot matrix of the data shows y increasing with each of x1, x2 and x3. The plot of residuals against fitted values shows no obvious pattern. However the FS finds that there are 6 masked outliers.
oats The data set oats (Scheffe, 1959, p. 138) lists the yield of grain for eight varieties of oats in five replications of a randomized-block experiment. Fitting by LS yields residuals with no noticeable structure and the usual F-tests for row and column effects have highly significant p-values of 0.00002 and 0.001, respectively. To show the effect of outliers on the classical procedure, five data values have been modified (see oats_mod dataset).
oats_mod It is a modification of the dataset Oats to show the effect of outliers on the classical procedure. As for the oats data, the normal Q-Q plot of t(i) show nothing suspicious. But the p-values of the F-tests are quite high. The diagnostics have thus failed to point out a departure from the model.
ozone Ozone data: ozone concentration at Upland (CA, USA) as a function of eight meteorological variables (Breiman and Friedman, 1985). Data come from the first 80 observations on a series of measurements of ozone concentration and meteorological variables in California, starting from the beginning of 1976. The variables are:
- x1: Sandburg air force base temperature;
- x2: inversion base height (feet);
- x3: Daggett pressure gradient (mm Hg);
- x4: visibility (miles);
- x5: Vandenburg 500 millibar height (m);
- x6: humidity (percent);
- x7: inversion base temperature;
- x8: wind speed (mph);
- y: Upland ozone concentration.
Through the analysis of the fan plot it is possible to see how the data need appropriate transformation before being analyzed.
ozone_330_obs They are a superset of the 80 ozone data described above. Through the analysis of the fan plot, the minimum deletion residual plot and the generalised candlestick plot, it is possible to see that the data should be transformed in the logarithmic scale, that only two out of eight variables are significant and that four outliers can be detected.
P12119085 P12119085 dataset (Rousseeuw et al. 2018) contains monthly trade volumes of imports of plants (used primarily in perfumery, pharmacy or for insecticidal, fungicidal or similar purposes) from Kenya into the UK in a four-year period. A downward level shift is evident in position 27-28.
P17049075 P17049075 dataset (Rousseeuw et al. 2018) contains monthly trade volumes of imports of sugar from Ukraine into Lithuania in a four-year period. A downward level shift is evident in position 35.
poison Poison data (by Box and Cox, 1964) are about the time to death of animals in a 3 × 4 factorial experiment with four observations at each factor combination. There are no outliers or influential observations that cannot be reconciled with the greater part of the data by a suitable transformation.
rats The rats dataset (Bond, 1979) corresponds to an experiment on the speed of learning of rats. Times were recorded for a rat to go through a shuttlebox in successive attempts. If the time exceeded 5 seconds, the rat received an electric shock for the duration of the next attempt. The data are the number of shocks received and the average time for all attempts between shocks. The relationship between the variables is roughly linear except for observations 1, 2, and 4. The LS line does not fit the bulk of the data, being a compromise between those three points and the rest.
salinity Measurements on water in Pamlico Sound, North Carolina. The data are taken from Ruppert and Carroll (1980). There are 28 observations on the salinity of water in the spring in Pamlico Sound, North Carolina. Analysis of the data was originally undertaken as part of a project for forecasting the shrimp harvest. The response is the biweekly average of salinity. There are three explanatory variables: the salinity in the previous two-week time period, a dummy variable for the time period during March and April and the river discharge. Thus the variables are:
- x1: salinity lagged two weeks
- x2: trend, a dummy variable for the time period
- x3: water flow, that is, river discharge
- y: biweekly average salinity.
The data seem to include one outlier. This could either be omitted, or changed to agree with the rest of the data. We make this change and use the forward search to show that the “corrected” observation is not in any way outlying or influential.
stack_loss Brownlee's stack loss data on the oxidation of ammonia (Brownlee, 1965). There are observations from 21 days of operation of a plant for the oxidation of ammonia as a stage in the production of nitric acid. The variables are:
- x1: air flow
- x2: cooling water inlet temperature
- x3: 10 × (acid concentration -50)
- y: stack loss; 10 times the percentage of ingoing ammonia escaping unconverted up a stack, or chimney.
The air flow (x1) measures the rate of operation of the plant. The nitric oxides produced are absorbed in a countercurrent absorption tower; x2 is the inlet temperature of cooling water circulating through coils in this tower and x3 is proportional to the concentration of acid in the tower. Small values of the response correspond to efficient absorption of the nitric oxides. Standard statistical techniques identify some observations as outliers. However through FS plots of t-statistics, R2 and leverage it is possible to identify a more complex structure in the dataset.
stars The stars dataset, introduced by Humpreys (1978), consists of 47 observations about the light intensity (y2) and the superfice temperature measured in Kelin degrees (y1) of 47 stars of the CYG OB1 cluster that one can observe in the direction of the constellation Cygnus. Both variables are in the logaritmic scale (base 10). The temperature is the explanatory variable. The dataset is interesting from the statistical point of view because contains four strong outliers (observations 11, 20, 30, and 34), correponding to giant stars, which affect the OLS regression parameters. Robustness aspects linked to this dataset are discussed in Rousseeuw and Leroy (1987, p. 27).
TableF61_Greene The dataset (Greene 2012, chapter 9) contains cost data for U.S. Airlines: 90 oservations on 6 firms for 15 Years, 1970-1984. These data are a subset of a larger data set provided by professor Moshe Kim. They were originally constructed by Christensen Associates of Madison, Wisconsin. The variables are: I = Airline, T = Year, Q = Output in revenue passenger miles, index number, C = Total cost in $1000, PF = Fuel price, LF = Load factor, the average capacity utilization of the fleet.
TableF91_Greene The dataset (Greene 2003, chapter 11) gives monthly credit card expenditure for 100 individuals, sampled from a larger sample of 13,444 people.
toxicity The toxicity dataset (Maguna, Nunez, Okulik and Castro, 2003) contains the measurement of the aquatic toxicity of 38 carboxylic acids, together with nine molecular descriptors, in order to find a predicting equation for y = log(toxicity) The plot of the residuals vs. fit and the normal Q-Q plot for the LS estimate show no outliers. On the other hand, the 85% normal efficiency MM-estimate show 10 outliers, in particular observations 13, 23, 32, 34,35, 36.
tradeH The tradeH dataset contains data on coalfish traded from Maldives to Italy. The two variables are the value (dependent variable) and quantity (independent variable) exchanged. The data are characterized by strong heteroscedasticity.
wool Number of cycles to failure of samples of worsted yarn in a 33/sup> experiment (Box and Cox, 1964). The wool data give the number of cycles to failure of a worsted yarn under cycles of repeated loading. The results are from a single 33 factorial experiment. The three factors and their levels are:
- x1: length of test specimen (25, 30, 35 cm)
- x2: amplitude of loading cycle (8, 9, 10 mm)
- x3: load (40, 45, 50 g).
The number of cycles to failure ranges from 90, for the shortest specimen subject to the most severe conditions, to 3,636 for observation 19 which comes from the longest specimen subjected to the mildest conditions. In their analysis Box and Cox(1964) recommend that the data be fitted after the log transformation of y. The FS plots explain the effect of the ordering of the data during the FS on the estimates of regression coefficients and the error variance and on a score statistic for transformation of the response.