univariatems

univariatems performs preliminary univariate robust model selection in linear regression

Syntax

  • Tsel=univariatems(y,X)example
  • Tsel=univariatems(y,X,Name,Value)example
  • [Tsel, Texcl]=univariatems(___)example

Description

In the situations in which the number of potential explanatory variables is very large it is necessary to preliminary understand the subset of variables which surely must be excluded before running the proper variable selection procedure. This procedure estimates a univariate regression model (with intercept) between each column of X and the response. Just the variables which have an R2 greater than (or a $p$-value smaller than) a certain threshold in the univariate regressions are retained. The p-value or R2 threshold are based on robust univariate models (with intercept) but the unrobust models can be chosen using option thresh. Option fsr enables the user to select the preferred robust regression procedure.

example

Tsel =univariatems(y, X) Call of univariatems with all default arguments.

example

Tsel =univariatems(y, X, Name, Value) Call of univariatems with a personalized threshold for p-value.

example

[Tsel, Texcl] =univariatems(___) Call of univariatems with a personalized threshold based on R2rob.

Examples

expand all

  • Call of univariatems with all default arguments.
  • n=30;
    out=simulateLM(n,'R2',0.8);
    y=out.y;
    % Just the first 3 variables are significant
    X=[out.X randn(n,57)];
    Tsel=univariatems(y,X);
    disp('Table with the details of the vars which are significant in univ. regr.')

  • Call of univariatems with a personalized threshold for p-value.
  • Generate a regression model with 3 expl. variables.

    n=30;
    out=simulateLM(n,'R2',0.5);
    y=out.y;
    % Just the first 3 variables are significant
    X=[out.X randn(n,7)];
    % The variables which are retained are those for which the robust p-value
    % of univiate regression is not greater than 0.2
    mypval=0.20;
    Tsel=univariatems(y,X,'thresh',mypval);

  • Call of univariatems with a personalized threshold based on R2rob.
  • n=30;
    out=simulateLM(n,'R2',0.8,'nexpl',5);
    y=out.y;
    % Just the first 10 variables are significant
    X=[out.X randn(n,20)];
    % The variables which are retained are those for which the robust R2 
    % of univariate regression is not smaller than 0.08
    thresh=struct;
    thresh.R2rob=0.08;
    Tsel=univariatems(y,X,'thresh',thresh);
    disp(Tsel)

    Related Examples

    expand all

  • Compare stepwiselm and lasso with and without preliminary calling univariatems.
  • rng(1)
    n=30;
    out=simulateLM(n,'R2',0.8);
    y=out.y;
    % Just variables 6:8 are significant
    X=[randn(n,5) out.X randn(n,52)];
    % Use a threshold of p-values based on non robust models
    Tsel=univariatems(y,X);
    Xtab=array2table(X(:,Tsel.i),'VariableNames',Tsel.Properties.RowNames);
    mdl=stepwiselm(Xtab,y,'Upper','linear','Verbose',0);
    fprintf('<strong>Final chosen model after univariatems filter and then stepwiselm</strong>')
    disp(mdl)
    disp('<strong>Final chosen model after applying directly stepwiselm</strong>')
    mdltrad=stepwiselm(X,y,'Upper','linear','Verbose',0);
    disp(mdltrad)
    % Lasso part
    cvlassoParameter=5;
    [B,FitInfo] = lasso(Xtab{:,:},y,'CV',cvlassoParameter);
    idxLambdaMinMSE = FitInfo.IndexMinMSE;
    seqp=1:size(Xtab,2);
    minMSEModelPredictors = seqp(B(:,idxLambdaMinMSE)~=0);
    mdl=fitlm(Xtab(:,minMSEModelPredictors),y);
    disp('<strong>Final chosen model after univariatems filter and then lasso</strong>')
    disp(mdl)
    disp('<strong>Final chosen model after applying directly lasso</strong>')
    [B,FitInfo] = lasso(X,y,'CV',cvlassoParameter);
    idxLambdaMinMSE = FitInfo.IndexMinMSE;
    seqp=1:size(X,2);
    minMSEModelPredictors = seqp(B(:,idxLambdaMinMSE)~=0);
    mdl=fitlm(X(:,minMSEModelPredictors),y,'VarNames',["X"+minMSEModelPredictors "y"]);
    disp(mdl)
    <strong>Final chosen model after univariatems filter and then stepwiselm</strong>
    Linear regression model:
        y ~ 1 + X6 + X7 + X8
    
    Estimated Coefficients:
                       Estimate      SE        tStat       pValue  
                       ________    _______    _______    __________
    
        (Intercept)    0.046141    0.17791    0.25935       0.79741
        X6               1.0712    0.16059     6.6705    4.4672e-07
        X7              0.94031     0.1986     4.7348    6.7681e-05
        X8              0.83425    0.19591     4.2584    0.00023769
    
    
    Number of observations: 30, Error degrees of freedom: 26
    Root Mean Squared Error: 0.962
    R-squared: 0.763,  Adjusted R-Squared: 0.735
    F-statistic vs. constant model: 27.9, p-value = 2.78e-08
    <strong>Final chosen model after applying directly stepwiselm</strong>
    
    Linear regression model:
        y ~ 1 + x6 + x7 + x8 + x14 + x33 + x38 + x42 + x43 + x60
    
    Estimated Coefficients:
                       Estimate       SE        tStat       pValue  
                       ________    ________    _______    __________
    
        (Intercept)    0.036482     0.11885    0.30695       0.76205
        x6               1.0523    0.097845     10.755    9.1827e-10
        x7               1.3944     0.14186     9.8291     4.224e-09
        x8              0.75831     0.12019     6.3095    3.6942e-06
        x14            -0.17313    0.076607    -2.2599      0.035139
        x33            -0.26114     0.10221     -2.555      0.018873
        x38             0.47328     0.13463     3.5155     0.0021758
        x42             0.68126     0.17095     3.9852    0.00072821
        x43             0.33055     0.11266     2.9339     0.0082075
        x60            -0.85072     0.18011    -4.7234    0.00013016
    
    
    Number of observations: 30, Error degrees of freedom: 20
    Root Mean Squared Error: 0.551
    R-squared: 0.94,  Adjusted R-Squared: 0.913
    F-statistic vs. constant model: 35, p-value = 2.59e-10
    <strong>Final chosen model after univariatems filter and then lasso</strong>
    
    Linear regression model:
        y ~ 1 + X6 + X32 + X7 + X8 + X26 + X15 + X50
    
    Estimated Coefficients:
                       Estimate      SE        tStat       pValue  
                       ________    _______    _______    __________
    
        (Intercept)    0.039614    0.18901    0.20959       0.83592
        X6               0.9015    0.18904     4.7689    9.2322e-05
        X32             0.22098    0.21242     1.0403        0.3095
        X7              0.83778    0.22394      3.741     0.0011319
        X8              0.79432    0.20381     3.8974    0.00077435
        X26            0.091074    0.25484    0.35738       0.72421
        X15             0.10139    0.17995    0.56344       0.57884
        X50             0.19768    0.19432     1.0173       0.32009
    
    
    Number of observations: 30, Error degrees of freedom: 22
    Root Mean Squared Error: 0.977
    R-squared: 0.793,  Adjusted R-Squared: 0.728
    F-statistic vs. constant model: 12.1, p-value = 3.03e-06
    <strong>Final chosen model after applying directly lasso</strong>
    
    Linear regression model:
        y ~ 1 + X4 + X6 + X7 + X8 + X11 + X19 + X32 + X48
    
    Estimated Coefficients:
                       Estimate      SE        tStat       pValue  
                       ________    _______    _______    __________
    
        (Intercept)      0.1483    0.14682       1.01       0.32397
        X4               0.2335    0.13506     1.7289      0.098499
        X6              0.90094    0.14003     6.4338    2.2382e-06
        X7              0.69224    0.16398     4.2216    0.00038249
        X8              0.56179    0.17276     3.2518     0.0038152
        X11            -0.40163    0.15509    -2.5896        0.0171
        X19            -0.33449    0.15965    -2.0952       0.04846
        X32             0.48524    0.16388      2.961     0.0074569
        X48             0.32956    0.13666     2.4115      0.025125
    
    
    Number of observations: 30, Error degrees of freedom: 21
    Root Mean Squared Error: 0.712
    R-squared: 0.895,  Adjusted R-Squared: 0.855
    F-statistic vs. constant model: 22.4, p-value = 1.23e-08
    

  • Example of use of option theoreticalSigns.
  • rng(1)
    n=30;
    thresh=struct;
    thresh.pval=0.10;
    out=simulateLM(n,'R2',0.8);
    y=out.y;
    % Just the first 3 variables are significant
    X=[out.X randn(n,57)];
    % Suppose that it is known that variable 11:20 must have a positive sign in
    % univariate regressions
    theoreticalSigns=zeros(1,size(X,2));
    theoreticalSigns(11:20)=1;
    % and that variables 41:50 must have a negative sign in
    % univariate regressions
    theoreticalSigns(41:50)=-1;
    % call to univariatems with option theoreticalSigns
    [Tsel,Texcl]=univariatems(y,X,'theoreticalSigns',theoreticalSigns);

  • Example of X as a table and a variable inside X is categorical.
  • This example is taken from https://stats.oarc.ucla.edu/r/dae/tobit-models/ Consider the situation in which we have a measure of academic aptitude (scaled 200-800) which we want to model using reading and math test scores, as well as, the type of program the student is enrolled in (academic, general, or vocational). For this example tobit regression (see function regressCens) is more appropriate. We just use this dataset to show the case in which one of the variables is categorial and X is table.

    XX=readtable("https://stats.idre.ucla.edu/stat/data/tobit.csv","ReadRowNames",true);
    XX.prog=categorical(XX.prog);
    % Define y and X
    y=XX(:,"apt");
    X=XX(:,["read", "math" "prog"]);
    % In this case both y and X are tables
    [Tsel,Texcl]=univariatems(y,X);

    Input Arguments

    expand all

    y — Response variable. Vector or table with just one column.

    A vector with n elements that contains the response variables.

    Missing values (NaN's) and infinite values (Inf's) are allowed, since observations (rows) with missing or infinite values will automatically be excluded from the computations.

    Data Types: vector or table

    X — Predictor variables. Matrix or table.

    Data matrix of explanatory variables (also called 'regressors') of dimension (n-by-p). Note that in this case p can be gerater than n. Rows of X represent observations, and columns represent variables. Missing values (NaN's) and infinite values (Inf's) are allowed, since observations (rows) with missing or infinite values will automatically be excluded from the computations. If X is table for the variables which are defined as categorical function dummyvar is called and the p-value (R2) refers to the univariate model with all the level (minus 1) of the categorical variable.

    Data Types: matrix or table

    Name-Value Pair Arguments

    Specify optional comma-separated pairs of Name,Value arguments. Name is the argument name and Value is the corresponding value. Name must appear inside single quotes (' '). You can specify several name and value pair arguments in any order as Name1,Value1,...,NameN,ValueN.

    Example: 'fsr',false , 'h',round(n*0,75) , 'lms',1 , 'nsamp',1000 , 'PredictorNames',{'X1','X2'} , 'thresh',0.10 ,

    fsr —Method to use for robust regression.boolean.

    If fsr is true univariate robust is based on function FSR (forward search) else it is based on function LXS (LTS or LMS). The default value of fsr is true.

    Example: 'fsr',false

    Data Types: logical

    h —The number of observations that have determined the robust regression estimator for each univariate regression.scalar.

    h is an integer greater or equal than [(n+p+1)/2] but smaller then n

    Example: 'h',round(n*0,75)

    Data Types: double

    lms —Criterion to use to find the robust beta in the univariate regressions.scalar.

    If lms=1 (default) Least Median of Squares is computed, else Least Trimmed of Squares is computed.

    Example: 'lms',1

    Data Types: double

    nsamp —Number of subsamples which will be extracted to find the robust estimator.scalar.

    Number of subsamples which will be extracted to find the robust estimator. If nsamp=0, all subsets will be extracted.

    They will be (n choose p).

    Remark. If the number of all possible subset is <1000 the default is to extract all subsets otherwise just 1000.

    Example: 'nsamp',1000

    Data Types: double

    PredictorNames —names of the explanatory variables.cell array of characters | string array.

    Cell array of characters or string array of length p containing the names of the explanatory variables.

    If PredictorNames is a missing and X is not a table the following sequence of strings will be automatically created for labelling the column of matrix X ("X1", "X2", "X3", ..., "Xp")

    Example: 'PredictorNames',{'X1','X2'}

    Data Types: cell array of characters or string array

    thresh —threshold which defines the variables to retain.scalar | struct.

    The default value of thresh is 0.10 that is all variables which in univariate robust regression had a p-value smaller than 0.10 are retained.

    If thresh is a struct it is possible to specify whether the threshold is based on (robust) p-values or (robust) R2.

    If thresh is a struct it may contain one of the the following fields

    Value Description
    pval

    all variables which in univariate regression have a p-value smaller or equal than thresh.pval are selected.

    pvalrob

    all variables which in univariate regression have a robust p-value smaller or equal than thresh.pvalr are selected.

    R2

    all variables which in univariate regression have a R2 square greater or equal than thresh.R2 are selected.

    R2rob

    all variables which in univariate regression have a robust R2 square greater or equal than thresh.R2r are selected.

    Note that if thresh is a struct with both fields an error is produced because just one between thresh.pval, thresh.pvalrob, thresh.R2, thresh.R2rob must be present

    Example: 'thresh',0.10

    Data Types: double

    theoreticalSigns —theoretical sign which the beta from univariate regression must have.vector of length p.

    1 denotes a positive sign for the corresponding variable while -1 denotes a negative sign for the corresponding variable. 0 denotes that any sign is possible. For example if p is equal to 5 if theoreticalSigns=[1 1 0 -1 1] means that variables 1, 2 and 5 must have a positive robust estimate of the slope in the univariate regression while variable 4 must have a negative estimate for the robust beta cofficient. Finally, variable 3 can have any sign.

    Example: 'theoreticalSigns',[-1 1 1 0 -1]. If theoreticalSigns is empty or it is not specified no filter based on sign is applied.

    Data Types: double

    Output Arguments

    expand all

    Tsel —Details of the variables which were important from univariate analysis. table

    The details of table Tsel are as follows.

    The rownames of this table contain the names of the variables.

    1st col: the indexes of the important variables 2nd-4th col: estimates of beta pvalue and R2 from non robust univariate regression 5th-7th col: estimates of beta pvalue and R2 from robust univariate regression 8th col: number of units declared as outliers in robust univariate regression.

    The rows of table Tsel are ordered in terms of variable importance in the sense that row 1 refers to the variable with highest robust R2 (smallest robust p-value). Row 2 contains the variable with the second highest robust R2 (second smallest robust p-value)...

    If no explanatory variable survives the criteria, Tsel is a 0×8 empty table

    Texcl —Details of the variables which were not important from univariate analysis. table

    The details of table Texcl are as follows.

    The rownames of this table contain the names of the selected variables.

    1st col: the indexes of the important variables;

    2nd-4th col: estimates of beta pvalue and R2 from non robust univariate regression;

    5th-7th col: estimates of beta pvalue and R2 from robust univariate regression;

    8th col: number of units declared as outliers in robust univariate regression.

    If no explanatory variable is excluded Texcl is a 0×8 empty table.

    References

    See Also

    |

    This page has been automatically generated by our routine publishFS