univariatems

univariatems performs preliminary univariate robust model selection in linear regression

Syntax

Tsel=univariatems(y,X)example
Tsel=univariatems(y,X,Name,Value)example
[Tsel, Texcl]=univariatems(___)example

Description

In the situations in which the number of potential explanatory variables is very large it is necessary to preliminary understand the subset of variables which surely must be excluded before running the proper variable selection procedure. This procedure estimates a univariate regression model (with intercept) between each column of X and the response. Just the variables that have an R2 greater than (or a $p$ -value smaller than) a certain threshold in the univariate regressions are retained. The p-value or R2 threshold are based on robust univariate models (with intercept), but the unrobust models can be chosen using option thresh. Option fsr enables the user to select the preferred robust regression procedure.

example

Tsel =univariatems(y, X) Call of univariatems with all default arguments.

example

Tsel =univariatems(y, X, Name, Value) Call of univariatems with a personalized threshold for p-value.

example

[Tsel, Texcl] =univariatems(___) Call of univariatems with a personalized threshold based on R2rob.

Examples

expand all

Call of univariatems with all default arguments.

n=30;
out=simulateLM(n,'R2',0.8);
y=out.y;
% Just the first 3 variables are significant.
X=[out.X randn(n,57)];
Tsel=univariatems(y,X);
disp('Table with the details of the vars that are significant in univ. regr.')

Call of univariatems with a personalized threshold for p-value.

Generate a regression model with 3 expl. variables.

n=30;
out=simulateLM(n,'R2',0.5);
y=out.y;
% Just the first 3 variables are significant.
X=[out.X randn(n,7)];
% The variables that are retained are those for which the robust p-value
% of univariate regression is not greater than 0.2
mypval=0.20;
Tsel=univariatems(y,X,'thresh',mypval);

Call of univariatems with a personalized threshold based on R2rob.

n=30;
out=simulateLM(n,'R2',0.8,'nexpl',5);
y=out.y;
% Just the first 10 variables are significant.
X=[out.X randn(n,20)];
% The variables that are retained are those for which the robust R2 
% of univariate regression is not smaller than 0.08.
thresh=struct;
thresh.R2rob=0.08;
Tsel=univariatems(y,X,'thresh',thresh);
disp(Tsel)

Related Examples

expand all

Compare stepwiselm and lasso with and without preliminary calling univariatems.

rng(1)
n=30;
out=simulateLM(n,'R2',0.8);
y=out.y;
% Just variables 6:8 are significant.
X=[randn(n,5) out.X randn(n,52)];
% Use a threshold of p-values based on non robust models.
Tsel=univariatems(y,X);
Xtab=array2table(X(:,Tsel.i),'VariableNames',Tsel.Properties.RowNames);
mdl=stepwiselm(Xtab,y,'Upper','linear','Verbose',0);
fprintf('<strong>Final chosen model after univariatems filter and then stepwiselm</strong>')
disp(mdl)
disp('<strong>Final chosen model after applying directly stepwiselm</strong>')
mdltrad=stepwiselm(X,y,'Upper','linear','Verbose',0);
disp(mdltrad)
% Lasso part
cvlassoParameter=5;
[B,FitInfo] = lasso(Xtab{:,:},y,'CV',cvlassoParameter);
idxLambdaMinMSE = FitInfo.IndexMinMSE;
seqp=1:size(Xtab,2);
minMSEModelPredictors = seqp(B(:,idxLambdaMinMSE)~=0);
mdl=fitlm(Xtab(:,minMSEModelPredictors),y);
disp('<strong>Final chosen model after univariatems filter and then lasso</strong>')
disp(mdl)
disp('<strong>Final chosen model after applying directly lasso</strong>')
[B,FitInfo] = lasso(X,y,'CV',cvlassoParameter);
idxLambdaMinMSE = FitInfo.IndexMinMSE;
seqp=1:size(X,2);
minMSEModelPredictors = seqp(B(:,idxLambdaMinMSE)~=0);
mdl=fitlm(X(:,minMSEModelPredictors),y,'VarNames',["X"+minMSEModelPredictors "y"]);
disp(mdl)

<strong>Final chosen model after univariatems filter and then stepwiselm</strong>
Linear regression model:
    y ~ 1 + X6 + X7 + X8

Estimated Coefficients:
                   Estimate      SE        tStat       pValue  
                   ________    _______    _______    __________

    (Intercept)    0.046141    0.17791    0.25935       0.79741
    X6               1.0712    0.16059     6.6705    4.4672e-07
    X7              0.94031     0.1986     4.7348    6.7681e-05
    X8              0.83425    0.19591     4.2584    0.00023769


Number of observations: 30, Error degrees of freedom: 26
Root Mean Squared Error: 0.962
R-squared: 0.763,  Adjusted R-Squared: 0.735
F-statistic vs. constant model: 27.9, p-value = 2.78e-08
<strong>Final chosen model after applying directly stepwiselm</strong>

Linear regression model:
    y ~ 1 + x6 + x7 + x8 + x14 + x33 + x38 + x42 + x43 + x60

Estimated Coefficients:
                   Estimate       SE        tStat       pValue  
                   ________    ________    _______    __________

    (Intercept)    0.036482     0.11885    0.30695       0.76205
    x6               1.0523    0.097845     10.755    9.1827e-10
    x7               1.3944     0.14186     9.8291     4.224e-09
    x8              0.75831     0.12019     6.3095    3.6942e-06
    x14            -0.17313    0.076607    -2.2599      0.035139
    x33            -0.26114     0.10221     -2.555      0.018873
    x38             0.47328     0.13463     3.5155     0.0021758
    x42             0.68126     0.17095     3.9852    0.00072821
    x43             0.33055     0.11266     2.9339     0.0082075
    x60            -0.85072     0.18011    -4.7234    0.00013016


Number of observations: 30, Error degrees of freedom: 20
Root Mean Squared Error: 0.551
R-squared: 0.94,  Adjusted R-Squared: 0.913
F-statistic vs. constant model: 35, p-value = 2.59e-10
<strong>Final chosen model after univariatems filter and then lasso</strong>

Linear regression model:
    y ~ 1 + X6 + X32 + X7 + X8 + X26 + X15 + X50

Estimated Coefficients:
                   Estimate      SE        tStat       pValue  
                   ________    _______    _______    __________

    (Intercept)    0.039614    0.18901    0.20959       0.83592
    X6               0.9015    0.18904     4.7689    9.2322e-05
    X32             0.22098    0.21242     1.0403        0.3095
    X7              0.83778    0.22394      3.741     0.0011319
    X8              0.79432    0.20381     3.8974    0.00077435
    X26            0.091074    0.25484    0.35738       0.72421
    X15             0.10139    0.17995    0.56344       0.57884
    X50             0.19768    0.19432     1.0173       0.32009


Number of observations: 30, Error degrees of freedom: 22
Root Mean Squared Error: 0.977
R-squared: 0.793,  Adjusted R-Squared: 0.728
F-statistic vs. constant model: 12.1, p-value = 3.03e-06
<strong>Final chosen model after applying directly lasso</strong>

Linear regression model:
    y ~ 1 + X4 + X6 + X7 + X8 + X11 + X19 + X32 + X48

Estimated Coefficients:
                   Estimate      SE        tStat       pValue  
                   ________    _______    _______    __________

    (Intercept)      0.1483    0.14682       1.01       0.32397
    X4               0.2335    0.13506     1.7289      0.098499
    X6              0.90094    0.14003     6.4338    2.2382e-06
    X7              0.69224    0.16398     4.2216    0.00038249
    X8              0.56179    0.17276     3.2518     0.0038152
    X11            -0.40163    0.15509    -2.5896        0.0171
    X19            -0.33449    0.15965    -2.0952       0.04846
    X32             0.48524    0.16388      2.961     0.0074569
    X48             0.32956    0.13666     2.4115      0.025125


Number of observations: 30, Error degrees of freedom: 21
Root Mean Squared Error: 0.712
R-squared: 0.895,  Adjusted R-Squared: 0.855
F-statistic vs. constant model: 22.4, p-value = 1.23e-08

Example of use of option theoreticalSigns.

rng(1)
n=30;
thresh=struct;
thresh.pval=0.10;
out=simulateLM(n,'R2',0.8);
y=out.y;
% Just the first 3 variables are significant.
X=[out.X randn(n,57)];
% Suppose that it is known that variable 11:20 must have a positive sign in
% univariate regressions.
theoreticalSigns=zeros(1,size(X,2));
theoreticalSigns(11:20)=1;
% and that variables 41:50 must have a negative sign in
% univariate regressions.
theoreticalSigns(41:50)=-1;
% call to univariatems with option theoreticalSigns.
[Tsel,Texcl]=univariatems(y,X,'theoreticalSigns',theoreticalSigns);

Example of X as a table and a variable inside X is categorical.

This example is taken from https://stats.oarc.ucla.edu/r/dae/tobit-models/ Consider the situation in which we have a measure of academic aptitude (scaled 200-800) which we want to model using reading and math test scores, as well as, the type of program the student is enrolled in (academic, general, or vocational). For this example, tobit regression (see function regressCens) is more appropriate. We just use this dataset to show the case in which one of the variables is categorical and X is table.

link="https://stats.idre.ucla.edu/stat/data/tobit.csv";
XX=readtable(link,"ReadRowNames",true);
XX.prog=categorical(XX.prog);
% Define y and X
y=XX(:,"apt");
X=XX(:,["read", "math" "prog"]);
% In this case both y and X are tables
[Tsel,Texcl]=univariatems(y,X);

Input Arguments

expand all

`y` — Response variable. Vector or table with just one column.

A vector with n elements that contains the response variables.

Missing values (NaN's) and infinite values (Inf's) are allowed, since observations (rows) with missing or infinite values will automatically be excluded from the computations.

Data Types: vector or table

`X` — Predictor variables. Matrix or table.

Data matrix of explanatory variables (also called 'regressors') of dimension (n-by-p). Note that, in this case, p can be greater than n. Rows of X represent observations, and columns represent variables. Missing values (NaN's) and infinite values (Inf's) are allowed, since observations (rows) with missing or infinite values will automatically be excluded from the computations. If X is table for the variables, which are defined as categorical function dummyvar is called, and the p-value (R2) refers to the univariate model with all the levels (minus 1) of the categorical variable.

Data Types: matrix or table

Name-Value Pair Arguments

Specify optional comma-separated pairs of Name,Value arguments. Name is the argument name and Value is the corresponding value. Name must appear inside single quotes (' '). You can specify several name and value pair arguments in any order as Name1,Value1,...,NameN,ValueN.

Example:

 'fsr',false
, 'h',round(n*0,75)
, 'lms',1
, 'nsamp',1000
, 'PredictorNames',{'X1','X2'}
, 'thresh',0.10
,

`fsr` —Method to use for robust regression.boolean.

If fsr is true univariate robust is based on function FSR (forward search) else it is based on function LXS (LTS or LMS). The default value of fsr is true.

Example: 'fsr',false

Data Types: logical

`h` —The number of observations that have determined the robust regression estimator for each univariate regression.scalar.

h is an integer greater or equal than [(n+p+1)/2], but smaller than n.

Example: 'h',round(n*0,75)

Data Types: double

`lms` —Criterion to use to find the robust beta in the univariate regressions.scalar.

If lms=1 (default) Least Median of Squares is computed, else Least Trimmed of Squares is computed.

Example: 'lms',1

Data Types: double

`nsamp` —Number of subsamples that will be extracted to find the robust estimator.scalar.

If nsamp=0, all subsets will be extracted.

They will be (n choose p).

Remark. If the number of all possible subset is <1000, the default is to extract all subsets otherwise just 1000.

Example: 'nsamp',1000

Data Types: double

`PredictorNames` —names of the explanatory variables.cell array of characters | string array.

Cell array of characters or string array of length p containing the names of the explanatory variables.

If PredictorNames is missing and X is not a table, the following sequence of strings will be automatically created for labelling the column of matrix X ("X1", "X2", "X3", ..., "Xp").

Example: 'PredictorNames',{'X1','X2'}

Data Types: cell array of characters or string array

`thresh` —threshold that defines the variables to retain.scalar | struct.

The default value of thresh is 0.10, that is all variables which in univariate robust regression had a p-value smaller than 0.10 are retained.

If thresh is a struct it is possible to specify whether the threshold is based on (robust) p-values or (robust) R2.

If thresh is a struct it may contain one of the the following fields:

Value	Description
`pval`	all variables that, in univariate regression, have a p-value smaller or equal to thresh.pval are selected.
`pvalrob`	all variables that, in univariate regression, have a robust p-value smaller or equal to thresh.pvalr are selected.
`R2`	all variables that, in univariate regression, have a R2 square greater or equal to thresh.R2 are selected.
`R2rob`	all variables that, in univariate regression, have a robust R2 square greater or equal to thresh.R2r are selected. Note that, if thresh is a struct with both fields an error is produced because just one between thresh.pval, thresh.pvalrob, thresh.R2, thresh.R2rob must be present.

Example: 'thresh',0.10

Data Types: double

`theoreticalSigns` —theoretical sign that the beta from univariate regression must have.vector of length p.

1 denotes a positive sign for the corresponding variable, while -1 denotes a negative sign for the corresponding variable. 0 denotes that any sign is possible. For example, if p is equal to 5, if theoreticalSigns=[1 1 0 -1 1] means that variables 1, 2, and 5 must have a positive robust estimate of the slope in the univariate regression, while variable 4 must have a negative estimate for the robust beta coefficient. Finally, variable 3 can have any sign.

Example: 'theoreticalSigns',[-1 1 1 0 -1]. If theoreticalSigns is empty or it is not specified, no filter based on sign is applied.

Data Types: double

Output Arguments

expand all

`Tsel` —Details of the variables that were important from univariate analysis. table

The details of table Tsel are as follows.

The rownames of this table contain the names of the variables.

1st col: the indexes of the important variables 2nd-4th col: estimates of beta p-value and R2 from non robust univariate regression 5th-7th col: estimates of beta p-value and R2 from robust univariate regression 8th col: number of units declared as outliers in robust univariate regression.

The rows of table Tsel are ordered in terms of variable importance, in the sense that row 1 refers to the variable with highest robust R2 (smallest robust p-value). Row 2 contains the variable with the second highest robust R2 (second smallest robust p-value)...

If no explanatory variable survives the criteria, Tsel is a 0×8 empty table.

`Texcl` —Details of the variables that were not important from univariate analysis. table

The details of table Texcl are as follows.

The rownames of this table contain the names of the selected variables.

1st col: the indexes of the important variables;

2nd-4th col: estimates of beta p-value and R2 from non robust univariate regression;

5th-7th col: estimates of beta p-value and R2 from robust univariate regression;

8th col: number of units declared as outliers in robust univariate regression.

If no explanatory variable is excluded Texcl is a 0×8 empty table.

Documentation

univariatems

Syntax

Description

Examples

Call of univariatems with all default arguments.

Call of univariatems with a personalized threshold for p-value.

Call of univariatems with a personalized threshold based on R2rob.

Related Examples

Compare stepwiselm and lasso with and without preliminary calling univariatems.

Example of use of option theoreticalSigns.

Example of X as a table and a variable inside X is categorical.

Input Arguments

`y` — Response variable. Vector or table with just one column.

`X` — Predictor variables. Matrix or table.

Name-Value Pair Arguments

`fsr` —Method to use for robust regression.boolean.

`h` —The number of observations that have determined the robust regression estimator for each univariate regression.scalar.

`lms` —Criterion to use to find the robust beta in the univariate regressions.scalar.

`nsamp` —Number of subsamples that will be extracted to find the robust estimator.scalar.

`PredictorNames` —names of the explanatory variables.cell array of characters | string array.

`thresh` —threshold that defines the variables to retain.scalar | struct.

`theoreticalSigns` —theoretical sign that the beta from univariate regression must have.vector of length p.

Output Arguments

`Tsel` —Details of the variables that were important from univariate analysis. table

`Texcl` —Details of the variables that were not important from univariate analysis. table

References

See Also

Documentation

univariatems

Syntax

Description

Examples

Call of univariatems with all default arguments.

Call of univariatems with a personalized threshold for p-value.

Call of univariatems with a personalized threshold based on R2rob.

Related Examples

Compare stepwiselm and lasso with and without preliminary calling univariatems.

Example of use of option theoreticalSigns.

Example of X as a table and a variable inside X is categorical.

Input Arguments

y — Response variable. Vector or table with just one column.

X — Predictor variables. Matrix or table.

Name-Value Pair Arguments

fsr —Method to use for robust regression.boolean.

h —The number of observations that have determined the robust regression estimator for each univariate regression.scalar.

lms —Criterion to use to find the robust beta in the univariate regressions.scalar.

nsamp —Number of subsamples that will be extracted to find the robust estimator.scalar.

PredictorNames —names of the explanatory variables.cell array of characters | string array.

thresh —threshold that defines the variables to retain.scalar | struct.

theoreticalSigns —theoretical sign that the beta from univariate regression must have.vector of length p.

Output Arguments

Tsel —Details of the variables that were important from univariate analysis. table

Texcl —Details of the variables that were not important from univariate analysis. table

References

See Also

`y` — Response variable. Vector or table with just one column.

`X` — Predictor variables. Matrix or table.

`fsr` —Method to use for robust regression.boolean.

`h` —The number of observations that have determined the robust regression estimator for each univariate regression.scalar.

`lms` —Criterion to use to find the robust beta in the univariate regressions.scalar.

`nsamp` —Number of subsamples that will be extracted to find the robust estimator.scalar.

`PredictorNames` —names of the explanatory variables.cell array of characters | string array.

`thresh` —threshold that defines the variables to retain.scalar | struct.

`theoreticalSigns` —theoretical sign that the beta from univariate regression must have.vector of length p.

`Tsel` —Details of the variables that were important from univariate analysis. table

`Texcl` —Details of the variables that were not important from univariate analysis. table