mdMARsimulate

mdMARsimulate generates missing values under a MAR logistic mechanism.

Syntax

  • Ymar=mdMARsimulate(Y)example
  • Ymar=mdMARsimulate(Y,Name,Value)example
  • [Ymar,out]=mdMARsimulate(___)example

Description

This function introduces missing values in a data matrix using a Missing At Random (MAR) mechanism. The probability that an entry is missing is modeled through a logistic regression whose covariates are observed columns of X.

The logistic link is evaluated using the Statistics and Machine Learning Toolbox probability distribution object created by pdLogistic = makedist('Logistic','mu',0,'sigma',1);

Therefore, if $\eta_{ij} = \alpha_j + x_i' \beta_j$, the missingness probability is computed as $Pr(M_{ij}=1 | x_i)$ = cdf(pdLogistic,eta_ij), which is equal to

\[ 1/(1+exp(-\eta_{ij})). \] If obsCols=1 and missCols=2:p, the mechanism is \[ Pr(M_{ij}=1 | X_{i1}) = cdf(pdLogistic, \alpha_j + \beta_j*X_{i1}), \qquad j = 2, ..., p, \] where $M_{ij}=1$ denotes a missing entry. The intercept alpha_j is chosen so that the expected missingness proportion in the corresponding missing column is equal to missRate.

example

Ymar =mdMARsimulate(Y) MAR missingness driven by the first variable.

example

Ymar =mdMARsimulate(Y, Name, Value) Different logistic slopes for different missing columns.

example

[Ymar, out] =mdMARsimulate(___) MAR mechanism driven by two observed variables.

Examples

expand all

  • MAR missingness driven by the first variable.
  • rng(1708)
    n = 1000;
    p = 4;
    Y= randn(n,p);
    [Ymar,out] = mdMARsimulate(Y,'missRate',0.3,'beta',1.5, ...
    'obsCols',1,'missCols',2:p,'plots',1);
    mdpattern(Ymar)
    disp(out.patternTable)
    Detailed explanation of the "Missing data pattern figure"
    Top axis contains the names of the variables.
    Big circle means missing value; smaller filled dot represents non-missing value.
    Left axis shows the number of observations for each pattern.
    The sum of the numbers on the left axis is n, the total number of rows.
    Right axis counts the variables with missing values.
    The number of missing values for each variable is shown on the bottom axis.
    
    ans =
    
      9×6 table
    
                       NrowsWithPattern     Y1       Y2        Y3        Y4      NvarWithMis
                       ________________    ____    ______    ______    ______    ___________
    
        Pattern1            436.00         1.00      1.00      1.00      1.00        0.00   
        Pattern2            113.00         1.00      1.00      1.00      0.00        1.00   
        Pattern3             98.00         1.00      1.00      0.00      1.00        1.00   
        Pattern4             60.00         1.00      1.00      0.00      0.00        2.00   
        Pattern5             83.00         1.00      0.00      1.00      1.00        1.00   
        Pattern6             54.00         1.00      0.00      1.00      0.00        2.00   
        Pattern7             66.00         1.00      0.00      0.00      1.00        2.00   
        Pattern8             90.00         1.00      0.00      0.00      0.00        3.00   
        totPatOrMis        1000.00         0.00    293.00    314.00    317.00      924.00   
    
        Pattern_0obs_1mis    Count     Proportion
        _________________    ______    __________
    
            {'0000'}         436.00       0.44   
            {'0001'}         113.00       0.11   
            {'0010'}          98.00       0.10   
            {'0111'}          90.00       0.09   
            {'0100'}          83.00       0.08   
            {'0110'}          66.00       0.07   
            {'0011'}          60.00       0.06   
            {'0101'}          54.00       0.05   
    
    
    Click here for the graphical output of this example (link to Ro.S.A. website).

  • Different logistic slopes for different missing columns.
  • rng(1708)
    Y= randn(1000,4);
    [Ymar,out] = mdMARsimulate(Y,'missRate',0.3,'obsCols',1, ...
    'missCols',2:4,'beta',[0.5 1.5 3],'msg',true);
    disp(out.alpha)
    Target NA proportion in missCols:     0.3         0.3         0.3
    Generated NA proportion in missCols:  0.3110
    New NA proportion in missCols:        0.3110
    Obtained NA proportion in missCols:   0.3110
    Obtained NA proportion in all Ymar:   0.2333
             -0.90         -1.20         -1.88
    
    

  • MAR mechanism driven by two observed variables.
  • rng(1708)
    Y= randn(1000,5);
    B = [1.2 0.5 1.0; -0.7 1.5 0.2];
    [Ymar,out] = mdMARsimulate(Y,'missRate',[0.2 0.3 0.4], ...
    'obsCols',[1 2],'missCols',3:5,'beta',B);
    disp(out.patternTable)
        Pattern_0obs_1mis    Count     Proportion
        _________________    ______    __________
    
            {'00000'}        376.00       0.38   
            {'00001'}        158.00       0.16   
            {'00010'}        142.00       0.14   
            {'00011'}        109.00       0.11   
            {'00101'}         80.00       0.08   
            {'00100'}         76.00       0.08   
            {'00111'}         34.00       0.03   
            {'00110'}         25.00       0.03   
    
    

    Input Arguments

    expand all

    Y — Input data matrix. Matrix.

    n-by-p numeric matrix. Rows are observations and columns are variables. Missing values already present in Y are preserved. The columns used to drive the MAR mechanism, specified by option obsCols, must contain finite values.

    Data Types: single| double

    Name-Value Pair Arguments

    Specify optional comma-separated pairs of Name,Value arguments. Name is the argument name and Value is the corresponding value. Name must appear inside single quotes (' '). You can specify several name and value pair arguments in any order as Name1,Value1,...,NameN,ValueN.

    Example: 'obsCols',[1 3] , 'missCols',2:5 , 'missRate',0.2 , 'beta',1.5 , 'alphaInterval',[-20 20] , 'plots',true , 'msg',true

    obsCols —Observed columns driving the MAR mechanism.vector.

    Vector containing the indices of the fully observed variables used in the logistic missingness model. The default value is 1.

    Example: 'obsCols',[1 3]

    Data Types: double

    missCols —Columns in which missing values are generated.vector.

    Vector containing the column indices in which additional missing values are introduced. The default value is 2:p.

    Example: 'missCols',2:5

    Data Types: double

    missRate —Desired missingness proportion in missCols.scalar | vector.

    Scalar in the interval [0,1], or a row/column vector with one value for each column specified in missCols. If missRate is scalar, the same target missingness proportion is used for all columns in missCols.

    The default value is 0.3.

    Example: 'missRate',0.2

    Data Types: double

    beta —Logistic regression coefficients.scalar, vector | matrix.

    If beta is scalar, the same coefficient is used for all variables in obsCols and missCols. If beta is a vector with length equal to numel(obsCols), the same linear predictor is used for all columns in missCols. If beta is a vector with length equal to numel(missCols) and numel(obsCols)=1, a different coefficient is used for each missing column. If beta is a matrix of size numel(obsCols)-by-numel(missCols), each missing column has its own logistic slope vector. The default value is 1.5.

    Example: 'beta',1.5

    Data Types: double

    alphaInterval —Initial interval for alpha search.vector.

    Two-element vector used as the starting bracketing interval for fzero. If the root is not bracketed, the interval is expanded automatically. The default value is [-10 10].

    Example: 'alphaInterval',[-20 20]

    Data Types: double

    plots —Plot missingness patterns.boolean.

    If plots is equal to 1 (true), a bar plot of the missingness patterns is produced. If plots is equal to 0 (false), no plot is produced. The default value is 0 (false).

    Example: 'plots',true

    Data Types: double

    msg —Level of output to display.boolean.

    If msg is true, a compact summary of the target and obtained missingness proportions is printed on the screen. The default value is false.

    Example: 'msg',true

    Data Types: logical

    Output Arguments

    expand all

    Ymar —Data matrix with MAR missing values. Matrix

    n-by-p matrix equal to Y except for the additional NaN values generated in the columns specified by missCols.

    out — description Structure

    Structure which contains the following compact diagnostic fields:

    Value Description
    alpha

    1-by-numel(missCols) vector containing the intercepts used in the logistic missingness models.

    beta

    numel(obsCols)-by-numel(missCols) matrix containing the logistic slope coefficients used for each missing column.

    missRateTarget

    1-by-numel(missCols) vector containing the target missingness proportions.

    missRateGeneratedMissCols

    scalar containing the proportion of generated Bernoulli missing indicators in missCols.

    missRateNewMissCols

    scalar containing the proportion of newly generated NaN values in missCols, excluding cells that were already NaN in Y.

    missRateObtainedMissCols

    scalar containing the final proportion of NaN values in missCols.

    missRateObtainedAll

    scalar containing the final proportion of NaN values in the whole matrix Ymar.

    patternTable

    table summarizing missingness patterns, counts and proportions. In the pattern strings, 0 means observed and 1 means missing.

    obsCols

    columns used to drive the MAR mechanism.

    missCols

    columns in which missing values were generated.

    class

    'mdMARsimulate'.

    Large diagnostic objects such as the full probability matrix, the missingness masks and the LogisticDistribution object are not stored in out, in order to keep the output structure compact.

    References

    Little, R.J.A. and Rubin, D.B. (2019), "Statistical Analysis with Missing Data", 3rd edition, Wiley.

    Rubin, D.B. (1976), Inference and missing data, "Biometrika", Vol. 63, pp. 581-592.

    This page has been automatically generated by our routine publishFS