Regression Data Sets

File Description of Data Set
affairs Infidelity data, known as Fair’s Affairs. Cross-section data from a survey conducted by Psychology Today in 1969.
The variables are
gender: Factor indicating gender.
age: Numeric variable coding age in years:
17.5 = under 20, 22 = 20–24, 27 = 25–29, 32 = 30–34, 37 = 35–39, 42 = 40–44, 47 = 45–49, 52 = 50–54, 57 = 55 or over.
yearsmarried: Numeric variable coding number of years married: 0.125 = 3 months or less, 0.417 = 4–6 months, 0.75 = 6 months–1 year, 1.5 = 1–2 years, 4 = 3–5 years, 7 = 6–8 years, 10 = 9–11 years, 15 = 12 or more years.
children factor: Are there children in the marriage?
religiousness: Numeric variable coding religiousness: 1 = anti, 2 = not at all, 3 = slightly, 4 = somewhat, 5 = very.
education: Numeric variable coding level of education: 9 = grade school, 12 = high school graduate, 14 = some college, 16 = college graduate, 17 = some graduate work, 18 = master’s degree, 20 = Ph.D., M.D., or other advanced degree. occupation: Numeric variable coding occupation according to Hollingshead classification (reverse numbering).
affairs (y) numeric. How often engaged in extramarital sexual intercourse during the past year? 0 = none, 1 = once, 2 = twice, 3 = 3 times, 7 = 4–10 times, 12 = monthly, 12 = weekly, 12 = daily.
In the example of Kleiber and Zeileis (2008, p. 142), the number of a person's extramarital sexual inter-courses ("affairs") in the past year is regressed on the person's age, number of years married, religiousness, occupation, and won rating of the marriage. The dependent variable is left-censored at zero and not right-censored. Hence this is a standard Tobit model which be estimated by functions regressCens and  FSRedaCens
air_pollution This dataset on air pollution and mortality in 60 metropolitan statistical areas in the United States accompanies the book Ramsey, F. L. and Schafer, D. W. (2013). The Statistical Sleuth: A Course in Methods of Data Analysis (3rd ed). Cengage Learning.
The response variable is the total age-adjusted mortality from all causes, counted as deaths per 100,000 population ('Mortality'). The explanatory variables are listed below:
Precip: Mean annual precipitation (in inches)
Humidity: Percent relative humidity (annual average at 1:00pm)
JanTemp: Mean January temperature (degrees Fahrenheit)
JulyTemp: Mean July temperature (degrees Fahrenheit)
Over65: Percentage of the population aged 65 years or over
House: Population per household
Educ: Median number of school years completed for persons 25 years or older
Sound: Percentage of housing that is sound with all facilities
Density: Population density (in persons per square mile of urbanized area)
NonWhite: Percentage of population that is nonwhite
WhiteCol: Percentage of employment in white collar occupations
Poor: Percentage of households with annual income under 3,000 USD in 1960
HC: Relative pollution potential of hydrocarbons
NOX: Relative pollution potential of oxides of nitrogen
SO2: Relative pollution potential of sulfur dioxide
The explanatory variables include four climate variables, eight demographic variables and three pollution related variables. ``Relative pollution potential'' is the product of the tons emitted per day per square kilometer and a factor correcting for the area dimension and exposure. The three pollution variables are skewed.
algae The algae dataset (from Hettich and Bay, 1999) contains 90 measurements at a river in some place in Europe. There are 11 predictors. The first three are categorical: the season of the year, river size (small, medium and large) and fluid velocity (low, medium and high). The other eight are the concentrations of several chemical substances. The response is the logarithm of the abundance of a certain class of algae. The normal Q-Q plots of the residuals corresponding to the LS estimate gives the impression of short-tailed residuals, while the residuals from a robust fit indicate the existence of at least two outliers, i.e. observations 36 and 77.
aircraft The aircraft dataset (Rousseeuw and Leroy, 1987, page 154, table 22) contains 23 single-engine aircraft built over the years 1947-1979, from Office of Naval Research. The dependent variable is cost in units of $100,000 (last column) and the explanatory variables are aspect ratio, lift-to-drag ratio, weight of plane (in pounds) and maximal thrust. Based on MCD without correction factor, 4 observayions, included observation 15, are detected as outliers. Based on the corrected MCD, observation 15 is no longer detected as outlier.
animals The animals dataset contains Brain and Body Weights for 65 Species of land animals. References for these dataset are contained in https://vincentarelbundock.github.io/Rdatasets/doc/robustbase/Animals2.html
autompg The auto mpg dataset has 398 rows and 9 columns and provides mileage, horsepower, model year, and other technical specifications for cars. The number of rows without missing values is 392. The web site where the data have been downloaded https://code.datasciencedojo.com/datasciencedojo/datasets/tree/master/Auto MPG states that ``this dataset is recommended for learning and practicing your skills in exploratory data analysis, data visualization, and regression modelling techniques''. The variables are:
cylinders (X_1):       number of cylinders in the engine;
displacement (X2):  engine displacement (in cubic inches);
horsepower (X3):    engine horsepower;
weight (X4):           vehicle weight (in pounds);
acceleration (X5):   time to accelerate from 0 to 60 mph (in seconds);
modelYear (X6):     model year;
origin (X7):             origin of car (1: American, 2: European, 3: Japanese).
mpg (y):                   fuel efficiency measured in miles per gallon (mpg);
The goal is to predict mpg (the fuel-efficiency of a car).
A different version of this dataset is also contained in the Statistics and Machine Learning Toolbox and is called carbig. The number of observations in carbig is 406 because it contains observations with missing values. Note that load carbig  loads into workspace the information about each variable separately and origin is a char. load autompg loads (as usual for all the FSDA datasets) the dataset in table format. Given that original variable model had repeated entries in order to include this variable as RowNames in the table we added the  row number in front of the char representing the model.
Balancesheets The Balancesheets dataset contains 6 balancesheets voices of 1405 Italian capital companies extracted from the Buerau Van Dijks AIDA database. The voices are:
- X1: Labour share,
- X2: ratio of tangible fixed assets to added value,
- X3: ratio of intangible assets to total assets,
- X4: ratio of industrial equipment tototal asset,
- X5: firm’s interest burden.
- y: Return On Sales (ROS),
bank_data There are 1,949 univariate observations on the amount of money made from individual personal banking customers over a year for an Italian bank. Because of the linking of products, it is not straightforward for the bank to attribute the profit to individual sources. The bank made a preliminary classification of its 700 products into 48 macrocategories (macroservices). Among these 48 macrocategories, the 13 most important ones according to the bank are listed below and form our set of explanatory variables. All explanatory variables are discrete, taking values 0, 1, 2, . . . , the number of services (inside each macroservice) that each customer has signed up for – number of credit cards, number of domestic direct debits, number of current accounts and so forth. The voices are:
- X1 = Personal loans,
- X2 = Financing and hire-purchase,
- X3 = Mortgages,
- X4 = Life insurance
- X5 = Share account
- X6 = Bond account
- X7 = Current account
- X8 = Salary deposits
- X9 = Debit cards
- X10 = Credit cards
- X11 =Telephone banking
- X12 = Domestic direct debits
- X13 = Money transfers.
- y = Profit/loss,
bank_proft The data are the annual profit from 1903 customers, all of whom were selected by the bank as the target for a specific campaign. The data are available in the FSDA toolbox under the title BankProfit. The nine explanatory variables are either amounts at a particular time point, or totals over the year. Together with the response they are:
- X1 : number of products bought by the customers;
- X2 : current account balance plus holding of bonds issued by the bank;
- X3 : holding of investments for which the bank acted as an agent;
- X4 : amount in deposit and savings accounts with the bank;
- X5 : number of activities in all accounts;
- X6 : total value of all transactions;
- X7 : total value of debit card spending (recorded with a negative sign);
- X8 : number of credit and debit cards;
- X9 : total value of credit card spending
- y : annual profit or loss per customer;
The matrix R containing prior information is given in file bank_proftR.txt
Additional information about these data is given in Atkinson Corbellini and Riani (2016) TEST DOI 10.1007/s11749-017-0542-6
cement Heat evolved in setting of cement, as a function of its chemical composition.

13 observations on the following 5 variables.

x1: percentage weight in clinkers of 3CaO.Al2O3
x2: percentage weight in clinkers of 3CaO.SiO2
x3: percentage weight in clinkers of 4CaO.Al2O3.Fe2O3
x4: percentage weight in clinkers of 2CaO.SiO2
y: heat evolved (calories/gram)
The source is Woods, H., Steinour, H. H. and Starke, H. R. (1932) Effect of composition of Portland cement on heat evolved during hardening. Industrial Engineering and Chemistry24, 1207–1214.
ConsLoyaltyRet Consumer loylaty in retail. The data contain the results of a consumer loyalty questionnaire based on 1711 units and come from the website https://data.world/cesarpolo/consumer-loyalty-in-retail.
The response is Loyalty. The variables are:
- CustomerID: ID of the customer. The recors are orderd according to Negative publicity
- X1 : Price;
- X2 : Quality;
- X3 : CommunityOutreach;
- X4 : Trust;
- X5 : CustomerSatisfaction;
- X6 : NegativePublicity;
- y : Loyalty (loyalty of the customer);
All variables are taken from the sujective responses to the quetionnaire. Consequently, the variables maybe highly non linear and may benefit from transfomations to produce a simple regression model.
credit_card Credit card data, introduced by Riani  and Atkinson (2010), are formed by 1,000 observations on the most active customers of an Italian bank. There is one response and nine explanatory variables which are:
- x1: Direct debts to the bank;
- x2: Assigned debts from third parties;
- x3: Amount of shares (in thousands of Euros);
- x4: Amount invested in investment funds (in thousands of Euros);
- x5: Amount of money invested in insurances products from the bank (in thousands of Euros);
- x6: Amount invested in bonds (in thousands of Euros);
- x7: Number of telepasses (Italian electronic toll collection system) of the current account holder;
- x8: Number of persons from the bank dealing with the management of the portfolio of the customer;
- x9: Index of use of point of sale services;
- y: Amount of use of credit, debit and pre-paid card services.
Through the analysis of the generalized candlestick plot, it is possible to identify a subset of significant variables and identify the outliers with reference to these significant variables.
D1 Dataset introduced in Atkinson et al. 2024. There are 3 explanatory variables and a response. Sample size is 200. The question is whether these data need transformation of the reponse.
D2 Dataset introduced in Atkinson et al. 2024. There are 3 explanatory variables and a response. Sample size is 200. The question is whether these data need transformation of the reponse.
D3 Dataset introduced in Atkinson et al. 2024. There are 3 explanatory variables and a response. Sample size is 200. The question is whether these data need transformation of the reponse.
Esselunga Mrketing data derived from loyalty card infor mation for four branches of the Esselunga supermarket chain in the city of Parma, Italy. There are 493 observations (customer records) on expenditure in a series of categories depending both on the goods being purchased and on the pricing level (line). There are two lines provided by the supermarket; bargain priced, often for essential products, not labelled with the supermarket’s name and the standard labelled supermarket brand. There are also goods from established brands, for example Barilla, Fanta and Ferrero, including Ferrero Rocher and Kinder. The response was total expenditure on standard non-food supermarket lines and there are ten explanatory variables
- X1: Total expenditure, branded lines;
- X2: Household products, branded lines;
- X3: Meat, branded lines;
- X4: Expenditure on special promotions of standard supermarket lines;
- X5: Meat, standard supermarket line;
- X6: Personal care, standard supermarket line;
- X7: Groceries, standard supermarket line;
- X8: Groceries, bargain prices;
- X9: Total expenditure, bargain prices;
-X10: Frequency of purchases, branded lines;
fat Body measurements to predict percentage of body fat in males. A data set containing 18 physical measurements of 252 males. Most of the variables can be measured with a scale or tape measure. Can they be used to predict the percentage of body fat? If so, this offers an easy alternative to an underwater weighing technique.
The variables are
-body_fat Percent body fat using Brozek's equation, 457/Density - 414.2
-body_fat_siri Percent body fat using Siri's equation, 495/Density - 450
-density Density (gm/cm^2)
-age Age (yrs)
-weight Weight (lbs)
-height Height (inches)
-BMI Adiposity index = Weight/Height^2 (kg/m^2)
-ffweight Fat Free Weight = (1 - fraction of body fat) * Weight, using Brozek's formula (lbs)
-neck Neck circumference (cm)
-chest Chest circumference (cm)
-abdomen Abdomen circumference (cm) "at the umbilicus and level with the iliac crest"
-hip Hip circumference (cm)
-thigh Thigh circumference (cm)
-knee Knee circumference (cm)
-ankle Ankle circumference (cm)
-bicep Extended biceps circumference (cm)
-forearm Forearm circumference (cm)
-wrist Wrist circumference (cm) "distal to the styloid processes"
This data set comes from the collection of the Journal of Statistics Education at http://jse.amstat.org/datasets/fat.txt
fish   Two websites, https://www.kaggle.com/aungpyaeap/fish-market  and http://jse.amstat.org/datasets/fishcatch.txt  present data on the weight of 159 fish caught in a lake near Tampere, Finland. Interest is in the relationship between weight and five measurements of dimensions of the fish. There are 7 species of fish including pike. These behave rather differently from the other six species. The variables are:
- Species
- Weight
-Length from the nose to the beginning of the tail (in cm)
- Length from the nose to the notch of the tail (in cm)
- Length from the nose to the end of the tail (in cm)
- Height
- Width
fishery Data extracted from monthly aggregates (flows) of trade declarations (Riani et al. 2008). The dataset is formed by 677 flows of a fishery product imported in the European Union from a third country in a period of one year. Among the many variables available we provide:
 
- x: the quantity of the trade flow;
- y:the value of the trade flow;

By regressing the variable "value" against the "quantity" one can see that the dataset is characterized by the presence of a mixture of linear groups, which roughly correspond to the clusters indicated by the subject matter expert. Riani et al. (2008) have shown how the FS can estimate such a mixture, allocate the units to the components of the mixture and identify in the dataset possible outliers, i.e. units that do not belong to any component. The three identified components are consistent with the clusters identified by the subject matter experts. The dataset is one among thousands of similar datasets that have to be analyzed automatically, for which there is no subject matter classification available.
fishery2002 It has the characteristics of dataset fishery but it contains information on the importing Eu country (declarant) and on the period (date) in which the transaction took place. It referes to year 2002.
fishery2003 It has the characteristics of dataset fishery but it contains information on the importing Eu country (declarant) and on the period (date) in which the transaction took place. It referes to year 2003.
fishery2004 It has the characteristics of dataset fishery but it contains information on the importing Eu country (declarant) and on the period (date) in which the transaction took place. It referes to year 2004.
forbes Forbes' data on air pressure in the Alps and the boiling point of water (Weisberg, 1985). There are 17 observations on the boiling point of water at different pressures, obtained from measurements at a variety of elevations in the Alps. The purpose of the experiment was to allow prediction of pressure from boiling point, which is easily measured, and so to provide an estimate of altitude: the higher the altitude, the lower the pressure. The variables are:
- x: boiling point
- y: 100×log(pressure)
The dataset is characterized by one clear outlier.
gasoline The data in Table 1 Chen Lockart and Stephens (2002) are 107 readings with response the distance driven and explanatory variable the amount of gasoline consumed.. The variables are:
- x: liters of gasoline used
- y: Values of distance driven in kilometers
The fanplot provides a very clear indication that the gasoline data have no transformation potential.
hawkins Hawkins' data simulated to baffle data analysts (by Hawkins). There are 128 observations and eight explanatory variables. The scatter plot matrix of the data does not reveal an interpretable structure; there seems to be no relationship between y and seven of the eight explanatory variables, the exception being X8. However with the FS it is possible to find four groups in the data.
hospitalFS Hospital data (Neter et al., 1996). Data on the logged survival time of 108 patients undergoing liver surgery, together with four potential explanatory variables. Data are composed of 54 observations plus other 54 observations, introduced to check the model fitted to the first 54. Their comparison suggests there is no systematic difference between the two sets. However by looking at some FS plots (Riani and Atkinson, 2007), we conclude that these two groups are significantly different.
illness07 Kleinbaum and Kupper (1978, p.148) describe observational data on the assessment of mental illness of 53 patients. A psychiatrist assigns values for mental retardation and degree of distrust of doctors in newly hospitalized patients (explanatory variables). After six months of treatment, a value is assigned for the degree of illness of each patient (response). Atkinson Riani and Corbellini (2020) explore the Box-Cox transformation of degree of illness with regression on the two initial assessments. The data support the log transformation. There is significant regression on both variables with a t value of 2.88 for the relationship with the initial assessment of retardation and -2.21 for distrust of doctors. The QQ-plots of residuals show an appreciable improvement in normality after transformation.
Income1 Income1 dataset from Census Bereau. Atkinson et al. (2023) describe income data taken from the United States Census Bureau, Annual Social and Economic Supplements (2021). https://www2.census.gov/programs-surveys/cps/datasets/2021/march/asecpub21csv.zip  The data are a random sample of 200 observations referred to the following variables:
- AR = Number of persons in household
- HOTHVAL = All other types of income except HEARNVAL Recode - Total other household income
- HSSVAL =  household income - social security
- HTOTVAL =  total household income (dollar amount).
The goal is to predict HTOTVAL.
Income2 Income from a municipality. Atkinson et al. (2023) describe income data. They are a sample of 200 observations of full time employees from a municipality in Northern Italy who have declared extra income from investment sources. The variables are:
- Age = Age of the person (the minimum is 19 and the maximum is 73).
- Educational-num = Number of years of education (the minimum valuee is 5 is primary school, and the maximum value is 16 bachelor degree)
- Gender = 1 is Male and 0 is Female
- ExtraGain = Income from investment sources (profit-losses) apart from wages/salary
- Hours=  total number of declared hours worked during the week. The minimum value is 35 and the maximum is 99.
- Income==  total yearly income (Euro amount).
The goal is the possibility in predicting income level based on the individual’s personal information.
inttrade International trade data: 180 imports flows.  Two regressione lines (two tradeded prices) are evident. First column is Quantity and second column is Value.
inttrade1 International trade data 1: 3867 imports 3867 import flows of product with TARIC code 4801000000 from Switzerland into Italy (POD\_4801000000\_CH\_IT). First column is Quantity and second column is Value.
inttrade2 International trade data 2: 1302 import flows of product with TARIC code 0307491800. First column is Quantity and second column is Value.
inttrade3 International trade data 3: 389 import flows of product with TARIC code 0307591000 from Serbia into Italy (POD\_0307591000\_SN\_IT). First column is Quantity and second column is Value.
JohnDraper John and Draper (1980) present data on the subjective assessment of the thickness of pipe. Five inspectors assessed wall thickness at four different locations on the pipe. The experiment was repeated three times. The sixty responses are a multiple of the difference between the inspector's assessment and the `true' value determined by an ultrasonic reader. If both readings were available the Box-Cox transformation could be applied to all 120 readings and the difference analysed in the transformed scale. But the ultrasonic readings are no longer known. 
krafft The krafft dataset (Jalali-Heravi and Knouz, 2002) contains the measures for 32 chemical compounds of a physical property called the Krafft point, together with several molecular descriptors, in order to find a predictive equation. The dataset is characterised by two linear structures. The points in the smaller group correspond to compounds called sulfonates. The LS estimate fits neither of the two sub-groups.
leafpine Shortleaf pine data. It contains 70 observations on the volume in cubic feet of shortleaf pine, from Bruce and Schumacher (1935) together with x1 (the girth of each tree), that is the diameter at breast height, in inches and x2, the height of the tree in feet.
- x1, girth of the tree
- x2, height of the tree;
- y, valume of the tree.
The girth and, to a lesser extent the height, are easily measured, but it is the volume of usable timber that determines the value of a tree.
loyalty Loyalty cards data. They contain 509 observations on the behavior of customers with loyalty cards from a supermarket chain in Northern Italy. The response (y) is the amount, in Euros, spent at the shop over six months and the explanatory variables are:
- x1, the number of visits to the supermarket in the six month period;
- x2, the age of the customer;
- x3, the number of members of the customer's family.
By transforming the data it is possible to see in the FS plots a group of customer characterized by a different purchasing behaviour  (Atkinson and Riani, 2006).
Marketing_Data The advertising experiment between Social Media Budget and Sales (in Thousands $ ). 200 experiments.
- x1, youtube
- x2, facebook
- x3, newspaper
- y, sales
The source of the data is Marketing Linear Multiple Regression | Kaggle
The dataset has been used in Riani, Atkinson and Corbellini  (2022).
mandible Mandible length and gestational age for 167 foetuses from the 12th week of gestation onwards. The variables are
- Age, the gestational age (in weeks)
- Length, the mandible length (in mm)
The source of this dataset is Patrick Royston and Douglas G. Altman (1994) Regression using fractional polynomials of continuous covariates: Parsimonious parametric modelling. Applied Statistics, 43(3), 429–467.
mineral The mineral dataset (Smith, Campbell and Lichfield, 1984) contains the measurement of the contents (in parts per million) of 2 chemical elements (zinc and copper) in 53 samples of rocks in Western Australia. Observation 15 stands out as clearly atypical, having a very large abscissa and too high an ordinate. The LS and L1 fits are seen to be influenced more by this observation than by the rest. By contrast, the LS fit omitting observation 15 gives a good fit to the rest of the data. Neither The Q–Q plot and the plot of residuals vs. fitted values for the LS estimate reveal the existence of an outlier as indicated by an exceptionally large residual. However, the second figure shows an approximate linear relationship between residuals and fitted values (excepting observation 15) and this indicates that the fit is not correct.
ms212 Pulse (heart) rate dataset. Students in an introductory statistics class at the University of Queensland participated in a simple experiment. They took their own pulse rate. They were then
randomized to run in place for one minute or to sit for that minute. Then every
one measured their pulse rate again. There are 109 complete observations, nine
explanatory variables covering physiological and lifestyle data and the two pulse
rates. One research question was how does the difference in pulse rate before and af
ter the minute depend on lifestyle and physiological measurements? It is expected
to depend heavily on whether the students ran or not. The data, posted by John
Eccleston and Richard Wilson, are available with a more complete description at
http://www.statsci.org/data/oz/ms212.html
multiple_regression Multiple regression data showing the effect of masking (Atkinson and Riani, 2000). There are 60 observations on a response y with the values of three explanatory variables. The scatter plot matrix of the data shows y increasing with each of x1, x2 and x3. The plot of residuals against fitted values shows no obvious pattern. However the FS finds that there are 6 masked outliers.
nci60 NCI-60 cancer cell panel data. The data set is a pre-processed version of the NCI-60 cancer cell panel as used in Alfons, Croux & Gelper (2013). One observation was removed since all values in the gene expression data were missing. The number of observations is 59 and the number of explanatory variables is 100. The response (which in the dataset is called y) is KRT18 antibody, which constitutes the variable with the largest MAD. The response variable measures the expression levels of the protein keratin 18, which is known to be persistently expressed in carcinomas. The explanatory variables are the most correlated 100 variables.
oats The data set oats (Scheffe, 1959, p. 138) lists the yield of grain for eight varieties of oats in five replications of a randomized-block experiment. Fitting by LS yields residuals with no noticeable structure and the usual F-tests for row and column effects have highly significant p-values of 0.00002 and 0.001, respectively. To show the effect of outliers on the classical procedure, five data values have been modified (see oats_mod dataset).
oats_mod It is a modification of the dataset Oats to show the effect of outliers on the classical procedure. As for the oats data, the normal Q-Q plot of t(i) show nothing suspicious. But the p-values of the F-tests are quite high. The diagnostics have thus failed to point out a departure from the model.
ozone Ozone data: ozone concentration at Upland (CA, USA) as a function of eight meteorological variables (Breiman and Friedman, 1985). Data come from the first 80 observations on a series of measurements of ozone concentration and meteorological variables in California, starting from the beginning of 1976. The variables are:
- x1: Sandburg air force base temperature;
- x2: inversion base height (feet);
- x3: Daggett pressure gradient (mm Hg);
- x4: visibility (miles);
- x5: Vandenburg 500 millibar height (m);
- x6: humidity (percent);
- x7: inversion base temperature;
- x8: wind speed (mph);
- y: Upland ozone concentration.
Through the analysis of the fan plot it is possible to see how the data need appropriate transformation before being analyzed.
ozone_330_obs They are a superset of the 80 ozone data described above. Through the analysis of the fan plot, the minimum deletion residual plot and the generalised candlestick plot, it is possible to see that the data should be transformed in the logarithmic scale, that only two out of eight variables are significant and that four outliers can be detected.
poison Poison data (by Box and Cox, 1964) are about the time to death of animals in a 3 × 4 factorial experiment with four observations at each factor combination. There are no outliers or influential observations that cannot be reconciled with the greater part of the data by a suitable transformation.
pollution see dataset air_pollution in this table
rats The rats dataset (Bond, 1979) corresponds to an experiment on the speed of learning of rats. Times were recorded for a rat to go through a shuttlebox in successive attempts. If the time exceeded 5 seconds, the rat received an electric shock for the duration of the next attempt. The data are the number of shocks received and the average time for all attempts between shocks. The relationship between the variables is roughly linear except for observations 1, 2, and 4. The LS line does not fit the bulk of the data, being a compromise between those three points and the rest.
salinity Salinity dataset. Measurements on water in Pamlico Sound, North Carolina. The data are taken from Ruppert and Carroll (1980). There are 28 observations on the salinity of water in the spring in Pamlico Sound, North Carolina. Analysis of the data was originally undertaken as part of a project for forecasting the shrimp harvest. The response is the biweekly average of salinity. There are three explanatory variables: the salinity in the previous two-week time period, a dummy variable for the time period during March and April and the river discharge. Thus the variables are:
- x1: salinity lagged two weeks
- x2: trend, a dummy variable for the time period
- x3: water flow, that is, river discharge
- y: biweekly average salinity.
The data seem to include one outlier. This could either be omitted, or changed to agree with the rest of the data. We make this change and use the forward search to show that the “corrected” observation is not in any way outlying or influential.
stack_loss Stack loss dataset.Brownlee's stack loss data on the oxidation of ammonia (Brownlee, 1965). There are observations from 21 days of operation of a plant for the oxidation of ammonia as a stage in the production of nitric acid. The variables are:
- x1: air flow
- x2: cooling water inlet temperature
- x3: 10 × (acid concentration -50)
- y: stack loss; 10 times the percentage of ingoing ammonia escaping unconverted up a stack, or chimney.
The air flow (x1) measures the rate of operation of the plant. The nitric oxides produced are absorbed in a countercurrent absorption tower; x2 is the inlet temperature of cooling water circulating through coils in this tower and x3 is proportional to the concentration of acid in the tower. Small values of the response correspond to efficient absorption of the nitric oxides. Standard statistical techniques identify some observations as outliers. However through FS plots of t-statistics, R2 and leverage it is possible to identify a more complex structure in the dataset.Con
stars The stars dataset, introduced by Humpreys (1978), consists of 47 observations about the light intensity (y2) and the superfice temperature measured in Kelin degrees (y1) of 47 stars of the CYG OB1 cluster that one can observe in the direction of the constellation Cygnus. Both variables are in the logaritmic scale (base 10). The temperature is the explanatory variable. The dataset is interesting from the statistical point of view because contains four strong outliers (observations 11, 20, 30, and 34), correponding to giant stars, which affect the OLS regression parameters. Robustness aspects linked to this dataset are discussed in Rousseeuw and Leroy (1987, p. 27).
TableF61_Greene The dataset (Greene 2012, chapter 9) contains cost data for U.S. Airlines: 90 oservations on 6 firms for 15 Years, 1970-1984. These data are a subset of a larger data set provided by professor Moshe Kim. They were originally constructed by Christensen Associates of Madison, Wisconsin. The variables are: I = Airline, T = Year, Q = Output in revenue passenger miles, index number, C = Total cost in $1000, PF = Fuel price, LF = Load factor, the average capacity utilization of the fleet.
TableF91_Greene The dataset (Greene 2003, chapter 11) gives monthly credit card expenditure for 100 individuals, sampled from a larger sample of 13,444 people.
toxicity The toxicity dataset (Maguna, Nunez, Okulik and Castro, 2003) contains the measurement of the aquatic toxicity of 38 carboxylic acids, together with nine molecular descriptors, in order to find a predicting equation for y = log(toxicity) The plot of the residuals vs. fit and the normal Q-Q plot for the LS estimate show no outliers. On the other hand, the 85% normal efficiency MM-estimate show 10 outliers, in particular observations 13, 23, 32, 34,35, 36.
tradeH The tradeH dataset contains data on coalfish traded from Maldives to Italy. The two variables are the value (dependent variable) and quantity (independent variable) exchanged. The data are characterized by strong heteroscedasticity.
valueadded Data on value added by industry from the UNIDO INDSTAT database.
 - dependent variable is life expectancy (life_exp) from HDI
 - independent variables are low, medium and high - low, medium and high
 technology level share of Manufacturing Value Added
 - hdi_score, exp_school and mean_school are other components of HDI
 - mvapc: manufacturing value added per capita
 - countrygroup and group code give the country groupings according to their  industrial development
- country (UN country code), countrycode (ISO country code) and countryname
wool Number of cycles to failure of samples of worsted yarn in a 3x3 experiment (Box and Cox, 1964). The wool data give the number of cycles to failure of a worsted yarn under cycles of repeated loading. The results are from a single 33 factorial experiment. The three factors and their levels are:
- x1: length of test specimen (25, 30, 35 cm)
- x2: amplitude of loading cycle (8, 9, 10 mm)
- x3: load (40, 45, 50 g).
The number of cycles to failure ranges from 90, for the shortest specimen subject to the most severe conditions, to 3,636 for observation 19 which comes from the longest specimen subjected to the mildest conditions. In their analysis Box and Cox(1964) recommend that the data be fitted after the log transformation of y. The FS plots explain the effect of the ordering of the data during the FS on the estimates of regression coefficients and the error variance and on a score statistic for transformation of the response.
Time series The following dataset are stored in timetable format (for these datasets the first two letters are always TT)
TTaccessories Weekly time series of export quantities (in kg) for a product in the category of parts and accessories of motor vehicles to Belarur (first column) and Kazakhstan  (second column) in the period 04-Jan 2021 to 11-Sep 2023.
TTplant Monthly time series of quantities (in tons) of plants imported from Kenya  (used primarily in perfumery, pharmacy or for insecticidal, fungicidal or similar purposes) from Kenya into the UK in the period Jan 2008 - Dec 2018. In principle dataset TTplant should be a super-set of dataset P12119085 contained in mat file TTP12119085. However there are small differences because of the revisions of the trade data done in the years in order to correct possible mistakes.
TTP12119085 P12119085 dataset (Rousseeuw et al. 2018) contains monthly trade volumes of imports of plants (used primarily in perfumery, pharmacy or for insecticidal, fungicidal or similar purposes) from Kenya into the UK in a four-year period. The first observation is referred to August 2008. A downward level shift is evident in position 27-28. A superset of this dataset is dataset plant contained in .mat file TTplant
TTP17049075 P17049075 dataset (Rousseeuw et al. 2018) contains monthly trade volumes of imports of sugar from Ukraine into Lithuania in a four-year period. The first observation is referred to August 2008. A downward level shift is evident in position 35. A superset of this dataset is sugar contained in .mat file TTsugar.
TTsalmon Monthly time series of consumption price  of salmon per kg in Denmark. The observed period goes from January 2010 to July 2022. The data are monthly. This dataset comes from EUMOFA by European Commission.
TTsesame Monthly time series of quantities (in Kg) of sesame seeds imported from India (first column) and from Pakistan (second column) into the EU. The observed period goes from February 2020 to September 2022.
TTsugar Monthly times series of quantites (in tons) and values of sugar imported  from Ukraine into Lithuania. The observed period goes from January 2008 to January 2018 (this is product with nomenclature P17049075).
In principle dataset TTsugar should be a super-set of dataset P17049075 contained in mat file TTP17049075. However there are small differences because of the revisions of the trade data done in the years in order to correct possible mistakes.