mdpattern

mdpattern finds and plots missing data patterns

Syntax

  • Mispat=mdpattern(Y)example
  • Mispat=mdpattern(Y,Name,Value)example
  • [Mispat,tMisAndOut]=mdpattern(___)example

Description

example

Mispat =mdpattern(Y) mdpattern with table input.

example

Mispat =mdpattern(Y, Name, Value) Example of the use of options dispresults and plots.

example

[Mispat, tMisAndOut] =mdpattern(___) Example of mdpattern with timetable input.

Examples

expand all

  • mdpattern with table input.
  • Load the nhanes data The nhanes data is a dataset with 25 observations on the following 4 variables.

    % age, Age group (1=20-39, 2=40-59, 3=60+).
    % bmi, Body mass index (kg/m**2).
    % hyp, Hypertensive (1=no,2=yes).
    % chl, Total serum cholesterol (mg/dL).
    % namvar array of strings containing the names of the columns of X.
    namvar=["age"  "bmi" "hyp" "chl"];
    X=[1   NaN  NaN  NaN
    2 22.7   1 187
    1   NaN   1 187
    3   NaN  NaN  NaN
    1 20.4   1 113
    3   NaN  NaN 184
    1 22.5   1 118
    1 30.1   1 187
    2 22.0   1 238
    2   NaN  NaN  NaN
    1   NaN  NaN  NaN
    2   NaN  NaN  NaN
    3 21.7   1 206
    2 28.7   2 204
    1 29.6   1  NaN
    1   NaN  NaN  NaN
    3 27.2   2 284
    2 26.3   2 199
    1 35.3   1 218
    3 25.5   2  NaN
    1   NaN  NaN  NaN
    1 33.2   1 229
    1 27.5   1 131
    3 24.9   1  NaN
    2 27.4   1 186];
    Xtable=array2table(X,VariableNames=namvar);
    [Mispat,tMisAndOut]=mdpattern(Xtable);
    Detailed explanation of the "Missing data pattern figure"
    Top axis contains the names of the variables.
    Big circle means missing value; smaller filled dot represents non missing value.
    Left axis shows the number of observations for each pattern
    For example number 13 shows that the associated pattern is repeated 13 times.
    The sum of the numbers on the left axis is n, the total number of rows.
    Right axis counts the variables with missing values and
    it is equal to the number of big circles in the corresponding row.
    The number of missing values for each variable is shown on the bottom axis.
    
    Click here for the graphical output of this example (link to Ro.S.A. website).

  • Example of the use of options dispresults and plots.
  • Load the nhanes data The nhanes data is a dataset with 25 observations on the following 4 variables.

    % age, Age group (1=20-39, 2=40-59, 3=60+).
    % bmi, Body mass index (kg/m**2).
    % hyp, Hypertensive (1=no,2=yes).
    % chl, Total serum cholesterol (mg/dL).
    % namvar array of strings containing the names of the columns of X.
    namvar=["age"  "bmi" "hyp" "chl"];
    X=[1   NaN  NaN  NaN
    2 22.7   1 187
    1   NaN   1 187
    3   NaN  NaN  NaN
    1 20.4   1 113
    3   NaN  NaN 184
    1 22.5   1 118
    1 30.1   1 187
    2 22.0   1 238
    2   NaN  NaN  NaN
    1   NaN  NaN  NaN
    2   NaN  NaN  NaN
    3 21.7   1 206
    2 28.7   2 204
    1 29.6   1  NaN
    1   NaN  NaN  NaN
    3 27.2   2 284
    2 26.3   2 199
    1 35.3   1 218
    3 25.5   2  NaN
    1   NaN  NaN  NaN
    1 33.2   1 229
    1 27.5   1 131
    3 24.9   1  NaN
    2 27.4   1 186];
    Xtable=array2table(X,VariableNames=namvar);
    % Plot is not shown.
    plots=false;
    % option dispresults shows a detailed explanation of
    % the content of two output matrices in the command window.
    dispresults=true;
    [Mispat,tMisAndOut]=mdpattern(Xtable,'plots',false,'dispresults',dispresults);
    Table which shows missing values patterns
                       NrowsWithPattern    age    hyp    bmi    chl    NvarWithMis
                       ________________    ___    ___    ___    ___    ___________
    
        Pattern1              13            1      1      1      1          0     
        Pattern2               3            1      1      1      0          1     
        Pattern3               1            1      1      0      1          1     
        Pattern4               1            1      0      0      1          2     
        Pattern5               7            1      0      0      0          3     
        totPatOrMis           25            0      8      9     10         27     
    
    0 means missing value and 1 represents non missing value
    First column contains the number of observations for each pattern
    For example number 13 shows that the associated pattern is repeated 13 times
    The sum of the numbers in the first column is n, that is the total number of rows
    The last column shows the number of variables with missing values for that particular pattern
    ------------------------
    Missing value and outlier report
                Mean     Median     Stdev      MAD      Count_miss    Perc_miss    outInf    outSup
               ______    ______    _______    ______    __________    _________    ______    ______
    
        age      1.76        2     0.83066    1.4826         0            0          0         0   
        bmi    26.562    26.75      4.2152    4.5961         9           36          0         0   
        hyp    1.2353        1     0.43724         0         8           32          0         4   
        chl     191.4      187      45.215    28.169        10           40          3         1   
    
    Columns outInf and outSup contain the number of units which are
    above x0.75+1.5*IQR or below x0.25-1.5*IQR, where IQR is the interquartile range
    

  • Example of mdpattern with timetable input.
  • TT = readtimetable('outages.csv');
    [A,B]=mdpattern(TT(:,["Loss" "Customers" ]))

    Related Examples

    expand all

  • An example with 2 simulated patterns of missing values.
  • close all
    n=10000;
    p=10;
    X=randn(n,p);
    % Create first missing  data pattern
    n1=300; n2=3;
    rowsWithMis=randsample(n,n1);
    colsWithMis=randsample(p,n2);
    X(rowsWithMis,colsWithMis)=NaN;
    % Create second missing  data pattern
    n1=120; n2=5;
    rowsWithMis=randsample(n,n1);
    colsWithMis=randsample(p,n2);
    X(rowsWithMis,colsWithMis)=NaN;
    mdpattern(X);
    Detailed explanation of the "Missing data pattern figure"
    Top axis contains the names of the variables.
    Big circle means missing value; smaller filled dot represents non missing value.
    Left axis shows the number of observations for each pattern
    For example number 9582 shows that the associated pattern is repeated 9582 times.
    The sum of the numbers on the left axis is n, the total number of rows.
    Right axis counts the variables with missing values and
    it is equal to the number of big circles in the corresponding row.
    The number of missing values for each variable is shown on the bottom axis.
    
    Click here for the graphical output of this example (link to Ro.S.A. website)

    Input Arguments

    expand all

    Name-Value Pair Arguments

    Specify optional comma-separated pairs of Name,Value arguments. Name is the argument name and Value is the corresponding value. Name must appear inside single quotes (' '). You can specify several name and value pair arguments in any order as Name1,Value1,...,NameN,ValueN.

    Example: 'Lc',{'Y1' Y2' 'Y3' 'Y4'} , 'plots',false , 'dispresults',true

    Lc —Vector of column labels.cell array of charaters | String array.

    Lc contains the labels of the columns of the input array Y. This option is unnecessary if Y is a table, because in this case Lc=X.Properties.VariableNames;

    Example: 'Lc',{'Y1' Y2' 'Y3' 'Y4'}

    Data Types: cell array of characters or String.

    plots —Plot on the screen.boolean If plots = true (default), a plot which displays missing data patterns is displayed on the screen.

    Top axis contains the name of the variables Big circle means missing value; smaller filled dot represents non missing value Left axis shows the number of observations for each pattern. For example number 40 shows that the associated patterns is repeated 40 times.

    The sum of the numbers on the left axis is n the total number of rows Right axis counts the variables with missing values and it is equal to the number of big circles in the corresponding row.

    Example: 'plots',false

    Data Types: Boolean

    dispresults —Display results on the screen.boolean.

    If dispresults is true it is possible to see on the screen the two output tables Xpat,tMisAndOut.

    The default value of dispresults is false.

    Example: 'dispresults',true

    Data Types: Boolean

    Output Arguments

    expand all

    Mispat —missing values pattern. table

    table with size (k+1)x(v+2), where k is the total number of missing values patterns which are present in the data matrix. The first k rows contain the patterns.

    The last row contains n and then the total number of missing values in each column. The first column contains information about the number of observations for each pattern.

    The columns of Mispat are sorted in non decreasing number of outliers. The last column contains the number of variables with missing values for each pattern.

    tMisAndOut —missing values and univariate outliers for each variable.

    The rows of this table are associated with the variables.

    The columns are referred to a series of statistics.

    More precisely: Columns 1:4 contain mean and median, std deviation and rescaled MAD (median absolute deviation).

    Fifth column (Count_miss) contains the number of missing values for each variable.

    Sixth column (Percmiss) conatins the percentage of missing data for each variable.

    Seventh and eight column contain the number of outliers respectively in the left and right tail of the distribution. The criterion to decide whether a unit is outlier is based on the boxplot concept, that is the outliers are the units which are above x0.75+1.5*IQR or below x0.25-1.5*IQR, where IQR is the interquartile range.

    References

    Schafer, J.L. (1997). "Analysis of Incomplete Multivariate Data". London: Chapman & Hall.

    This page has been automatically generated by our routine publishFS