univariatems performs preliminary univariate robust model selection in linear regression
In the situations in which the number of potential explanatory variables is very large it is necessary to preliminary understand the subset of variables which surely must be excluded before running the proper variable selection procedure. This procedure estimates a univariate regression model (with intercept) between each column of X and the response. Just the variables that have an R2 greater than (or a -value smaller than) a certain threshold in the univariate regressions are retained. The p-value or R2 threshold are based on robust univariate models (with intercept), but the unrobust models can be chosen using option thresh. Option fsr enables the user to select the preferred robust regression procedure.
Call of univariatems with a personalized threshold for p-value.Tsel
=univariatems(y
,
X
,
Name, Value
)
n=30;
out=simulateLM(n,'R2',0.8);
y=out.y;
% Just the first 3 variables are significant.
X=[out.X randn(n,57)];
Tsel=univariatems(y,X);
disp('Table with the details of the vars that are significant in univ. regr.')
Generate a regression model with 3 expl. variables.
n=30;
out=simulateLM(n,'R2',0.5);
y=out.y;
% Just the first 3 variables are significant.
X=[out.X randn(n,7)];
% The variables that are retained are those for which the robust p-value
% of univariate regression is not greater than 0.2
mypval=0.20;
Tsel=univariatems(y,X,'thresh',mypval);
n=30;
out=simulateLM(n,'R2',0.8,'nexpl',5);
y=out.y;
% Just the first 10 variables are significant.
X=[out.X randn(n,20)];
% The variables that are retained are those for which the robust R2
% of univariate regression is not smaller than 0.08.
thresh=struct;
thresh.R2rob=0.08;
Tsel=univariatems(y,X,'thresh',thresh);
disp(Tsel)
rng(1)
n=30;
out=simulateLM(n,'R2',0.8);
y=out.y;
% Just variables 6:8 are significant.
X=[randn(n,5) out.X randn(n,52)];
% Use a threshold of p-values based on non robust models.
Tsel=univariatems(y,X);
Xtab=array2table(X(:,Tsel.i),'VariableNames',Tsel.Properties.RowNames);
mdl=stepwiselm(Xtab,y,'Upper','linear','Verbose',0);
fprintf('<strong>Final chosen model after univariatems filter and then stepwiselm</strong>')
disp(mdl)
disp('<strong>Final chosen model after applying directly stepwiselm</strong>')
mdltrad=stepwiselm(X,y,'Upper','linear','Verbose',0);
disp(mdltrad)
% Lasso part
cvlassoParameter=5;
[B,FitInfo] = lasso(Xtab{:,:},y,'CV',cvlassoParameter);
idxLambdaMinMSE = FitInfo.IndexMinMSE;
seqp=1:size(Xtab,2);
minMSEModelPredictors = seqp(B(:,idxLambdaMinMSE)~=0);
mdl=fitlm(Xtab(:,minMSEModelPredictors),y);
disp('<strong>Final chosen model after univariatems filter and then lasso</strong>')
disp(mdl)
disp('<strong>Final chosen model after applying directly lasso</strong>')
[B,FitInfo] = lasso(X,y,'CV',cvlassoParameter);
idxLambdaMinMSE = FitInfo.IndexMinMSE;
seqp=1:size(X,2);
minMSEModelPredictors = seqp(B(:,idxLambdaMinMSE)~=0);
mdl=fitlm(X(:,minMSEModelPredictors),y,'VarNames',["X"+minMSEModelPredictors "y"]);
disp(mdl)
<strong>Final chosen model after univariatems filter and then stepwiselm</strong> Linear regression model: y ~ 1 + X6 + X7 + X8 Estimated Coefficients: Estimate SE tStat pValue ________ _______ _______ __________ (Intercept) 0.046141 0.17791 0.25935 0.79741 X6 1.0712 0.16059 6.6705 4.4672e-07 X7 0.94031 0.1986 4.7348 6.7681e-05 X8 0.83425 0.19591 4.2584 0.00023769 Number of observations: 30, Error degrees of freedom: 26 Root Mean Squared Error: 0.962 R-squared: 0.763, Adjusted R-Squared: 0.735 F-statistic vs. constant model: 27.9, p-value = 2.78e-08 <strong>Final chosen model after applying directly stepwiselm</strong> Linear regression model: y ~ 1 + x6 + x7 + x8 + x14 + x33 + x38 + x42 + x43 + x60 Estimated Coefficients: Estimate SE tStat pValue ________ ________ _______ __________ (Intercept) 0.036482 0.11885 0.30695 0.76205 x6 1.0523 0.097845 10.755 9.1827e-10 x7 1.3944 0.14186 9.8291 4.224e-09 x8 0.75831 0.12019 6.3095 3.6942e-06 x14 -0.17313 0.076607 -2.2599 0.035139 x33 -0.26114 0.10221 -2.555 0.018873 x38 0.47328 0.13463 3.5155 0.0021758 x42 0.68126 0.17095 3.9852 0.00072821 x43 0.33055 0.11266 2.9339 0.0082075 x60 -0.85072 0.18011 -4.7234 0.00013016 Number of observations: 30, Error degrees of freedom: 20 Root Mean Squared Error: 0.551 R-squared: 0.94, Adjusted R-Squared: 0.913 F-statistic vs. constant model: 35, p-value = 2.59e-10 <strong>Final chosen model after univariatems filter and then lasso</strong> Linear regression model: y ~ 1 + X6 + X32 + X7 + X8 + X26 + X15 + X50 Estimated Coefficients: Estimate SE tStat pValue ________ _______ _______ __________ (Intercept) 0.039614 0.18901 0.20959 0.83592 X6 0.9015 0.18904 4.7689 9.2322e-05 X32 0.22098 0.21242 1.0403 0.3095 X7 0.83778 0.22394 3.741 0.0011319 X8 0.79432 0.20381 3.8974 0.00077435 X26 0.091074 0.25484 0.35738 0.72421 X15 0.10139 0.17995 0.56344 0.57884 X50 0.19768 0.19432 1.0173 0.32009 Number of observations: 30, Error degrees of freedom: 22 Root Mean Squared Error: 0.977 R-squared: 0.793, Adjusted R-Squared: 0.728 F-statistic vs. constant model: 12.1, p-value = 3.03e-06 <strong>Final chosen model after applying directly lasso</strong> Linear regression model: y ~ 1 + X4 + X6 + X7 + X8 + X11 + X19 + X32 + X48 Estimated Coefficients: Estimate SE tStat pValue ________ _______ _______ __________ (Intercept) 0.1483 0.14682 1.01 0.32397 X4 0.2335 0.13506 1.7289 0.098499 X6 0.90094 0.14003 6.4338 2.2382e-06 X7 0.69224 0.16398 4.2216 0.00038249 X8 0.56179 0.17276 3.2518 0.0038152 X11 -0.40163 0.15509 -2.5896 0.0171 X19 -0.33449 0.15965 -2.0952 0.04846 X32 0.48524 0.16388 2.961 0.0074569 X48 0.32956 0.13666 2.4115 0.025125 Number of observations: 30, Error degrees of freedom: 21 Root Mean Squared Error: 0.712 R-squared: 0.895, Adjusted R-Squared: 0.855 F-statistic vs. constant model: 22.4, p-value = 1.23e-08
rng(1)
n=30;
thresh=struct;
thresh.pval=0.10;
out=simulateLM(n,'R2',0.8);
y=out.y;
% Just the first 3 variables are significant.
X=[out.X randn(n,57)];
% Suppose that it is known that variable 11:20 must have a positive sign in
% univariate regressions.
theoreticalSigns=zeros(1,size(X,2));
theoreticalSigns(11:20)=1;
% and that variables 41:50 must have a negative sign in
% univariate regressions.
theoreticalSigns(41:50)=-1;
% call to univariatems with option theoreticalSigns.
[Tsel,Texcl]=univariatems(y,X,'theoreticalSigns',theoreticalSigns);
This example is taken from https://stats.oarc.ucla.edu/r/dae/tobit-models/ Consider the situation in which we have a measure of academic aptitude (scaled 200-800) which we want to model using reading and math test scores, as well as, the type of program the student is enrolled in (academic, general, or vocational). For this example, tobit regression (see function regressCens) is more appropriate. We just use this dataset to show the case in which one of the variables is categorical and X is table.
link="https://stats.idre.ucla.edu/stat/data/tobit.csv";
XX=readtable(link,"ReadRowNames",true);
XX.prog=categorical(XX.prog);
% Define y and X
y=XX(:,"apt");
X=XX(:,["read", "math" "prog"]);
% In this case both y and X are tables
[Tsel,Texcl]=univariatems(y,X);
y
— Response variable.
Vector or table with just one column.A vector with n elements that contains the response variables.
Missing values (NaN's) and infinite values (Inf's) are allowed, since observations (rows) with missing or infinite values will automatically be excluded from the computations.
Data Types: vector or table
X
— Predictor variables.
Matrix or table.Data matrix of explanatory variables (also called 'regressors') of dimension (n-by-p). Note that, in this case, p can be greater than n. Rows of X represent observations, and columns represent variables. Missing values (NaN's) and infinite values (Inf's) are allowed, since observations (rows) with missing or infinite values will automatically be excluded from the computations. If X is table for the variables, which are defined as categorical function dummyvar is called, and the p-value (R2) refers to the univariate model with all the levels (minus 1) of the categorical variable.
Data Types: matrix or table
Specify optional comma-separated pairs of Name,Value
arguments.
Name
is the argument name and Value
is the corresponding value. Name
must appear
inside single quotes (' '
).
You can specify several name and value pair arguments in any order as
Name1,Value1,...,NameN,ValueN
.
'fsr',false
, 'h',round(n*0,75)
, 'lms',1
, 'nsamp',1000
, 'PredictorNames',{'X1','X2'}
, 'thresh',0.10
,
fsr
—Method to use for robust regression.boolean.If fsr is true univariate robust is based on function FSR (forward search) else it is based on function LXS (LTS or LMS). The default value of fsr is true.
Example: 'fsr',false
Data Types: logical
h
—The number of observations that have determined the robust
regression estimator for each univariate regression.scalar.h is an integer greater or equal than [(n+p+1)/2], but smaller than n.
Example: 'h',round(n*0,75)
Data Types: double
lms
—Criterion to use to find the robust beta
in the univariate regressions.scalar.If lms=1 (default) Least Median of Squares is computed, else Least Trimmed of Squares is computed.
Example: 'lms',1
Data Types: double
nsamp
—Number of subsamples that will be extracted to find the
robust estimator.scalar.If nsamp=0, all subsets will be extracted.
They will be (n choose p).
Remark. If the number of all possible subset is <1000, the default is to extract all subsets otherwise just 1000.
Example: 'nsamp',1000
Data Types: double
PredictorNames
—names of the explanatory variables.cell array of characters | string array.Cell array of characters or string array of length p containing the names of the explanatory variables.
If PredictorNames is missing and X is not a table, the following sequence of strings will be automatically created for labelling the column of matrix X ("X1", "X2", "X3", ..., "Xp").
Example: 'PredictorNames',{'X1','X2'}
Data Types: cell array of characters or string array
thresh
—threshold that defines the variables to retain.scalar | struct.The default value of thresh is 0.10, that is all variables which in univariate robust regression had a p-value smaller than 0.10 are retained.
If thresh is a struct it is possible to specify whether the threshold is based on (robust) p-values or (robust) R2.
If thresh is a struct it may contain one of the the following fields:
Value | Description |
---|---|
pval |
all variables that, in univariate regression, have a p-value smaller or equal to thresh.pval are selected. |
pvalrob |
all variables that, in univariate regression, have a robust p-value smaller or equal to thresh.pvalr are selected. |
R2 |
all variables that, in univariate regression, have a R2 square greater or equal to thresh.R2 are selected. |
R2rob |
all variables that, in univariate regression, have a robust R2 square greater or equal to thresh.R2r are selected. Note that, if thresh is a struct with both fields an error is produced because just one between thresh.pval, thresh.pvalrob, thresh.R2, thresh.R2rob must be present. |
Example: 'thresh',0.10
Data Types: double
theoreticalSigns
—theoretical sign that the beta from univariate
regression must have.vector of length p.1 denotes a positive sign for the corresponding variable, while -1 denotes a negative sign for the corresponding variable. 0 denotes that any sign is possible. For example, if p is equal to 5, if theoreticalSigns=[1 1 0 -1 1] means that variables 1, 2, and 5 must have a positive robust estimate of the slope in the univariate regression, while variable 4 must have a negative estimate for the robust beta coefficient. Finally, variable 3 can have any sign.
Example: 'theoreticalSigns',[-1 1 1 0 -1].
If theoreticalSigns is empty or it is not specified, no filter
based on sign is applied.
Data Types: double
Tsel
—Details of the variables that were
important from univariate analysis.
tableThe details of table Tsel are as follows.
The rownames of this table contain the names of the variables.
1st col: the indexes of the important variables 2nd-4th col: estimates of beta p-value and R2 from non robust univariate regression 5th-7th col: estimates of beta p-value and R2 from robust univariate regression 8th col: number of units declared as outliers in robust univariate regression.
The rows of table Tsel are ordered in terms of variable importance, in the sense that row 1 refers to the variable with highest robust R2 (smallest robust p-value). Row 2 contains the variable with the second highest robust R2 (second smallest robust p-value)...
If no explanatory variable survives the criteria, Tsel is a 0×8 empty table.
Texcl
—Details of the variables that were
not important from univariate analysis.
tableThe details of table Texcl are as follows.
The rownames of this table contain the names of the selected variables.
1st col: the indexes of the important variables;
2nd-4th col: estimates of beta p-value and R2 from non robust univariate regression;
5th-7th col: estimates of beta p-value and R2 from robust univariate regression;
8th col: number of units declared as outliers in robust univariate regression.
If no explanatory variable is excluded Texcl is a 0×8 empty table.