univariatems performs preliminary univariate robust model selection in linear regression
In the situations in which the number of potential explanatory variables is very large it is necessary to preliminary understand the subset of variables which surely must be excluded before running the proper variable selection procedure. This procedure estimates a univariate regression model (with intercept) between each column of X and the response. Just the variables which have an R2 greater than (or a $p$-value smaller than) a certain threshold in the univariate regressions are retained. The p-value or R2 threshold are based on robust univariate models (with intercept) but the unrobust models can be chosen using option thresh. Option fsr enables the user to select the preferred robust regression procedure.
Call of univariatems with a personalized threshold for p-value.Tsel
=univariatems(y
,
X
,
Name, Value
)
n=30; out=simulateLM(n,'R2',0.8); y=out.y; % Just the first 3 variables are significant X=[out.X randn(n,57)]; Tsel=univariatems(y,X); disp('Table with the details of the vars which are significant in univ. regr.')
Generate a regression model with 3 expl. variables.
n=30; out=simulateLM(n,'R2',0.5); y=out.y; % Just the first 3 variables are significant X=[out.X randn(n,7)]; % The variables which are retained are those for which the robust p-value % of univiate regression is not greater than 0.2 mypval=0.20; Tsel=univariatems(y,X,'thresh',mypval);
n=30; out=simulateLM(n,'R2',0.8,'nexpl',5); y=out.y; % Just the first 10 variables are significant X=[out.X randn(n,20)]; % The variables which are retained are those for which the robust R2 % of univariate regression is not smaller than 0.08 thresh=struct; thresh.R2rob=0.08; Tsel=univariatems(y,X,'thresh',thresh); disp(Tsel)
rng(1) n=30; out=simulateLM(n,'R2',0.8); y=out.y; % Just variables 6:8 are significant X=[randn(n,5) out.X randn(n,52)]; % Use a threshold of p-values based on non robust models Tsel=univariatems(y,X); Xtab=array2table(X(:,Tsel.i),'VariableNames',Tsel.Properties.RowNames); mdl=stepwiselm(Xtab,y,'Upper','linear','Verbose',0); fprintf('<strong>Final chosen model after univariatems filter and then stepwiselm</strong>') disp(mdl) disp('<strong>Final chosen model after applying directly stepwiselm</strong>') mdltrad=stepwiselm(X,y,'Upper','linear','Verbose',0); disp(mdltrad) % Lasso part cvlassoParameter=5; [B,FitInfo] = lasso(Xtab{:,:},y,'CV',cvlassoParameter); idxLambdaMinMSE = FitInfo.IndexMinMSE; seqp=1:size(Xtab,2); minMSEModelPredictors = seqp(B(:,idxLambdaMinMSE)~=0); mdl=fitlm(Xtab(:,minMSEModelPredictors),y); disp('<strong>Final chosen model after univariatems filter and then lasso</strong>') disp(mdl) disp('<strong>Final chosen model after applying directly lasso</strong>') [B,FitInfo] = lasso(X,y,'CV',cvlassoParameter); idxLambdaMinMSE = FitInfo.IndexMinMSE; seqp=1:size(X,2); minMSEModelPredictors = seqp(B(:,idxLambdaMinMSE)~=0); mdl=fitlm(X(:,minMSEModelPredictors),y,'VarNames',["X"+minMSEModelPredictors "y"]); disp(mdl)
<strong>Final chosen model after univariatems filter and then stepwiselm</strong> Linear regression model: y ~ 1 + X6 + X7 + X8 Estimated Coefficients: Estimate SE tStat pValue ________ _______ _______ __________ (Intercept) 0.046141 0.17791 0.25935 0.79741 X6 1.0712 0.16059 6.6705 4.4672e-07 X7 0.94031 0.1986 4.7348 6.7681e-05 X8 0.83425 0.19591 4.2584 0.00023769 Number of observations: 30, Error degrees of freedom: 26 Root Mean Squared Error: 0.962 R-squared: 0.763, Adjusted R-Squared: 0.735 F-statistic vs. constant model: 27.9, p-value = 2.78e-08 <strong>Final chosen model after applying directly stepwiselm</strong> Linear regression model: y ~ 1 + x6 + x7 + x8 + x14 + x33 + x38 + x42 + x43 + x60 Estimated Coefficients: Estimate SE tStat pValue ________ ________ _______ __________ (Intercept) 0.036482 0.11885 0.30695 0.76205 x6 1.0523 0.097845 10.755 9.1827e-10 x7 1.3944 0.14186 9.8291 4.224e-09 x8 0.75831 0.12019 6.3095 3.6942e-06 x14 -0.17313 0.076607 -2.2599 0.035139 x33 -0.26114 0.10221 -2.555 0.018873 x38 0.47328 0.13463 3.5155 0.0021758 x42 0.68126 0.17095 3.9852 0.00072821 x43 0.33055 0.11266 2.9339 0.0082075 x60 -0.85072 0.18011 -4.7234 0.00013016 Number of observations: 30, Error degrees of freedom: 20 Root Mean Squared Error: 0.551 R-squared: 0.94, Adjusted R-Squared: 0.913 F-statistic vs. constant model: 35, p-value = 2.59e-10 <strong>Final chosen model after univariatems filter and then lasso</strong> Linear regression model: y ~ 1 + X6 + X32 + X7 + X8 + X26 + X15 + X50 Estimated Coefficients: Estimate SE tStat pValue ________ _______ _______ __________ (Intercept) 0.039614 0.18901 0.20959 0.83592 X6 0.9015 0.18904 4.7689 9.2322e-05 X32 0.22098 0.21242 1.0403 0.3095 X7 0.83778 0.22394 3.741 0.0011319 X8 0.79432 0.20381 3.8974 0.00077435 X26 0.091074 0.25484 0.35738 0.72421 X15 0.10139 0.17995 0.56344 0.57884 X50 0.19768 0.19432 1.0173 0.32009 Number of observations: 30, Error degrees of freedom: 22 Root Mean Squared Error: 0.977 R-squared: 0.793, Adjusted R-Squared: 0.728 F-statistic vs. constant model: 12.1, p-value = 3.03e-06 <strong>Final chosen model after applying directly lasso</strong> Linear regression model: y ~ 1 + X4 + X6 + X7 + X8 + X11 + X19 + X32 + X48 Estimated Coefficients: Estimate SE tStat pValue ________ _______ _______ __________ (Intercept) 0.1483 0.14682 1.01 0.32397 X4 0.2335 0.13506 1.7289 0.098499 X6 0.90094 0.14003 6.4338 2.2382e-06 X7 0.69224 0.16398 4.2216 0.00038249 X8 0.56179 0.17276 3.2518 0.0038152 X11 -0.40163 0.15509 -2.5896 0.0171 X19 -0.33449 0.15965 -2.0952 0.04846 X32 0.48524 0.16388 2.961 0.0074569 X48 0.32956 0.13666 2.4115 0.025125 Number of observations: 30, Error degrees of freedom: 21 Root Mean Squared Error: 0.712 R-squared: 0.895, Adjusted R-Squared: 0.855 F-statistic vs. constant model: 22.4, p-value = 1.23e-08
rng(1) n=30; thresh=struct; thresh.pval=0.10; out=simulateLM(n,'R2',0.8); y=out.y; % Just the first 3 variables are significant X=[out.X randn(n,57)]; % Suppose that it is known that variable 11:20 must have a positive sign in % univariate regressions theoreticalSigns=zeros(1,size(X,2)); theoreticalSigns(11:20)=1; % and that variables 41:50 must have a negative sign in % univariate regressions theoreticalSigns(41:50)=-1; % call to univariatems with option theoreticalSigns [Tsel,Texcl]=univariatems(y,X,'theoreticalSigns',theoreticalSigns);
This example is taken from https://stats.oarc.ucla.edu/r/dae/tobit-models/ Consider the situation in which we have a measure of academic aptitude (scaled 200-800) which we want to model using reading and math test scores, as well as, the type of program the student is enrolled in (academic, general, or vocational). For this example tobit regression (see function regressCens) is more appropriate. We just use this dataset to show the case in which one of the variables is categorial and X is table.
XX=readtable("https://stats.idre.ucla.edu/stat/data/tobit.csv","ReadRowNames",true); XX.prog=categorical(XX.prog); % Define y and X y=XX(:,"apt"); X=XX(:,["read", "math" "prog"]); % In this case both y and X are tables [Tsel,Texcl]=univariatems(y,X);
y
— Response variable.
Vector or table with just one column.A vector with n elements that contains the response variables.
Missing values (NaN's) and infinite values (Inf's) are allowed, since observations (rows) with missing or infinite values will automatically be excluded from the computations.
Data Types: vector or table
X
— Predictor variables.
Matrix or table.Data matrix of explanatory variables (also called 'regressors') of dimension (n-by-p). Note that in this case p can be gerater than n. Rows of X represent observations, and columns represent variables. Missing values (NaN's) and infinite values (Inf's) are allowed, since observations (rows) with missing or infinite values will automatically be excluded from the computations. If X is table for the variables which are defined as categorical function dummyvar is called and the p-value (R2) refers to the univariate model with all the level (minus 1) of the categorical variable.
Data Types: matrix or table
Specify optional comma-separated pairs of Name,Value
arguments.
Name
is the argument name and Value
is the corresponding value. Name
must appear
inside single quotes (' '
).
You can specify several name and value pair arguments in any order as
Name1,Value1,...,NameN,ValueN
.
'fsr',false
, 'h',round(n*0,75)
, 'lms',1
, 'nsamp',1000
, 'PredictorNames',{'X1','X2'}
, 'thresh',0.10
,
fsr
—Method to use for robust regression.boolean.If fsr is true univariate robust is based on function FSR (forward search) else it is based on function LXS (LTS or LMS). The default value of fsr is true.
Example: 'fsr',false
Data Types: logical
h
—The number of observations that have determined the robust
regression estimator for each univariate regression.scalar.h is an integer greater or equal than [(n+p+1)/2] but smaller then n
Example: 'h',round(n*0,75)
Data Types: double
lms
—Criterion to use to find the robust beta
in the univariate regressions.scalar.If lms=1 (default) Least Median of Squares is computed, else Least Trimmed of Squares is computed.
Example: 'lms',1
Data Types: double
nsamp
—Number of subsamples which will be extracted to find the
robust estimator.scalar.Number of subsamples which will be extracted to find the robust estimator. If nsamp=0, all subsets will be extracted.
They will be (n choose p).
Remark. If the number of all possible subset is <1000 the default is to extract all subsets otherwise just 1000.
Example: 'nsamp',1000
Data Types: double
PredictorNames
—names of the explanatory variables.cell array of characters | string array.Cell array of characters or string array of length p containing the names of the explanatory variables.
If PredictorNames is a missing and X is not a table the following sequence of strings will be automatically created for labelling the column of matrix X ("X1", "X2", "X3", ..., "Xp")
Example: 'PredictorNames',{'X1','X2'}
Data Types: cell array of characters or string array
thresh
—threshold which defines the variables to retain.scalar | struct.The default value of thresh is 0.10 that is all variables which in univariate robust regression had a p-value smaller than 0.10 are retained.
If thresh is a struct it is possible to specify whether the threshold is based on (robust) p-values or (robust) R2.
If thresh is a struct it may contain one of the the following fields
Value | Description |
---|---|
pval |
all variables which in univariate regression have a p-value smaller or equal than thresh.pval are selected. |
pvalrob |
all variables which in univariate regression have a robust p-value smaller or equal than thresh.pvalr are selected. |
R2 |
all variables which in univariate regression have a R2 square greater or equal than thresh.R2 are selected. |
R2rob |
all variables which in univariate regression have a robust R2 square greater or equal than thresh.R2r are selected. Note that if thresh is a struct with both fields an error is produced because just one between thresh.pval, thresh.pvalrob, thresh.R2, thresh.R2rob must be present |
Example: 'thresh',0.10
Data Types: double
theoreticalSigns
—theoretical sign which the beta from univariate
regression must have.vector of length p.1 denotes a positive sign for the corresponding variable while -1 denotes a negative sign for the corresponding variable. 0 denotes that any sign is possible. For example if p is equal to 5 if theoreticalSigns=[1 1 0 -1 1] means that variables 1, 2 and 5 must have a positive robust estimate of the slope in the univariate regression while variable 4 must have a negative estimate for the robust beta cofficient. Finally, variable 3 can have any sign.
Example: 'theoreticalSigns',[-1 1 1 0 -1]. If
theoreticalSigns is empty or it is not specified no filter
based on sign is applied.
Data Types: double
Tsel
—Details of the variables which were
important from univariate analysis.
tableThe details of table Tsel are as follows.
The rownames of this table contain the names of the variables.
1st col: the indexes of the important variables 2nd-4th col: estimates of beta pvalue and R2 from non robust univariate regression 5th-7th col: estimates of beta pvalue and R2 from robust univariate regression 8th col: number of units declared as outliers in robust univariate regression.
The rows of table Tsel are ordered in terms of variable importance in the sense that row 1 refers to the variable with highest robust R2 (smallest robust p-value). Row 2 contains the variable with the second highest robust R2 (second smallest robust p-value)...
If no explanatory variable survives the criteria, Tsel is a 0×8 empty table
Texcl
—Details of the variables which were
not important from univariate analysis.
tableThe details of table Texcl are as follows.
The rownames of this table contain the names of the selected variables.
1st col: the indexes of the important variables;
2nd-4th col: estimates of beta pvalue and R2 from non robust univariate regression;
5th-7th col: estimates of beta pvalue and R2 from robust univariate regression;
8th col: number of units declared as outliers in robust univariate regression.
If no explanatory variable is excluded Texcl is a 0×8 empty table.