# mdpattern

mdpattern finds and plots missing data patterns

## Syntax

• Mispat=mdpattern(Y)example
• Mispat=mdpattern(Y,Name,Value)example
• [Mispat,tMisAndOut]=mdpattern(___)example

## Description

 Mispat =mdpattern(Y) mdpattern with table input.

 Mispat =mdpattern(Y, Name, Value) Example of the use of options dispresults and plots.

 [Mispat, tMisAndOut] =mdpattern(___) Example of mdpattern with timetable input.

## Examples

expand all

### mdpattern with table input.

Load the nhanes data The nhanes data is a dataset with 25 observations on the following 4 variables.

% age, Age group (1=20-39, 2=40-59, 3=60+).
% bmi, Body mass index (kg/m**2).
% hyp, Hypertensive (1=no,2=yes).
% chl, Total serum cholesterol (mg/dL).
% namvar array of strings containing the names of the columns of X.
namvar=["age"  "bmi" "hyp" "chl"];
X=[1   NaN  NaN  NaN
2 22.7   1 187
1   NaN   1 187
3   NaN  NaN  NaN
1 20.4   1 113
3   NaN  NaN 184
1 22.5   1 118
1 30.1   1 187
2 22.0   1 238
2   NaN  NaN  NaN
1   NaN  NaN  NaN
2   NaN  NaN  NaN
3 21.7   1 206
2 28.7   2 204
1 29.6   1  NaN
1   NaN  NaN  NaN
3 27.2   2 284
2 26.3   2 199
1 35.3   1 218
3 25.5   2  NaN
1   NaN  NaN  NaN
1 33.2   1 229
1 27.5   1 131
3 24.9   1  NaN
2 27.4   1 186];
Xtable=array2table(X,VariableNames=namvar);
[Mispat,tMisAndOut]=mdpattern(Xtable);
Detailed explanation of the "Missing data pattern figure"
Top axis contains the names of the variables.
Big circle means missing value; smaller filled dot represents non missing value.
Left axis shows the number of observations for each pattern
For example number 13 shows that the associated pattern is repeated 13 times.
The sum of the numbers on the left axis is n, the total number of rows.
Right axis counts the variables with missing values and
it is equal to the number of big circles in the corresponding row.
The number of missing values for each variable is shown on the bottom axis.


### Example of the use of options dispresults and plots.

Load the nhanes data The nhanes data is a dataset with 25 observations on the following 4 variables.

% age, Age group (1=20-39, 2=40-59, 3=60+).
% bmi, Body mass index (kg/m**2).
% hyp, Hypertensive (1=no,2=yes).
% chl, Total serum cholesterol (mg/dL).
% namvar array of strings containing the names of the columns of X.
namvar=["age"  "bmi" "hyp" "chl"];
X=[1   NaN  NaN  NaN
2 22.7   1 187
1   NaN   1 187
3   NaN  NaN  NaN
1 20.4   1 113
3   NaN  NaN 184
1 22.5   1 118
1 30.1   1 187
2 22.0   1 238
2   NaN  NaN  NaN
1   NaN  NaN  NaN
2   NaN  NaN  NaN
3 21.7   1 206
2 28.7   2 204
1 29.6   1  NaN
1   NaN  NaN  NaN
3 27.2   2 284
2 26.3   2 199
1 35.3   1 218
3 25.5   2  NaN
1   NaN  NaN  NaN
1 33.2   1 229
1 27.5   1 131
3 24.9   1  NaN
2 27.4   1 186];
Xtable=array2table(X,VariableNames=namvar);
% Plot is not shown
plots=false;
% option dispresults is shows and therefore a detailed explanation of
% the content of two output matrices is shown in the command window.
dispresults=true;
[Mispat,tMisAndOut]=mdpattern(Xtable,'plots',false,'dispresults',dispresults);
Table which shows missing values patterns
NrowsWithPattern    age    hyp    bmi    chl    NvarWithMis
________________    ___    ___    ___    ___    ___________

Pattern1              13            1      1      1      1          0
Pattern2               3            1      1      1      0          1
Pattern3               1            1      1      0      1          1
Pattern4               1            1      0      0      1          2
Pattern5               7            1      0      0      0          3
totPatOrMis           25            0      8      9     10         27

0 means missing value and 1 represents non missing value
First column contains the number of observations for each pattern
For example number 13 shows that the associated pattern is repeated 13 times
The sum of the numbers in the first column is n, that is the total number of rows
The last column shows the number of variables with missing values for that particular pattern
------------------------
Missing value and outlier report
Mean     Median     Stdev      MAD      Count_miss    Perc_miss    outInf    outSup
______    ______    _______    ______    __________    _________    ______    ______

age      1.76        2     0.83066    1.4826         0            0          0         0
bmi    26.562    26.75      4.2152    4.5961         9           36          0         0
hyp    1.2353        1     0.43724         0         8           32          4         0
chl     191.4      187      45.215    28.169        10           40          1         3

Columns outInf and outSup contain the number of units which are
above x0.75+1.5*IQR or below x0.25-1.5*IQR, where IQR is the interquartile range


### Example of mdpattern with timetable input.

TT = readtimetable('outages.csv');
[A,B]=mdpattern(TT(:,["Loss" "Customers" ]))

## Related Examples

expand all

### An example with 2 simulated patterns of missing values.

close all
n=10000;
p=10;
X=randn(n,p);
% Create first missing  data pattern
n1=300; n2=3;
rowsWithMis=randsample(n,n1);
colsWithMis=randsample(p,n2);
X(rowsWithMis,colsWithMis)=NaN;
% Create second missing  data pattern
n1=120; n2=5;
rowsWithMis=randsample(n,n1);
colsWithMis=randsample(p,n2);
X(rowsWithMis,colsWithMis)=NaN;
mdpattern(X);
Detailed explanation of the "Missing data pattern figure"
Top axis contains the names of the variables.
Big circle means missing value; smaller filled dot represents non missing value.
Left axis shows the number of observations for each pattern
For example number 9581 shows that the associated pattern is repeated 9581 times.
The sum of the numbers on the left axis is n, the total number of rows.
Right axis counts the variables with missing values and
it is equal to the number of big circles in the corresponding row.
The number of missing values for each variable is shown on the bottom axis.


## Input Arguments

### Y — data matrix (2D array) table or timetable containing $n$ observations on $v$ quantitative variables. Data Types - matrix, table or timetable.

Data Types: single| double

### Name-Value Pair Arguments

Specify optional comma-separated pairs of Name,Value arguments. Name is the argument name and Value is the corresponding value. Name must appear inside single quotes (' '). You can specify several name and value pair arguments in any order as  Name1,Value1,...,NameN,ValueN.

Example:  'Lc',{'Y1' Y2' 'Y3' 'Y4'} , 'plots',false , 'dispresults',false 

### Lc —Vector of column labels.cell array of charaters | String array.

Lc contains the labels of the columns of the input array Y. This option is unnecessary if Y is a table, because in this case Lc=X.Properties.VariableNames;

Example:  'Lc',{'Y1' Y2' 'Y3' 'Y4'} 

Data Types: cell array of characters or String.

### plots —Plot on the screen.boolean If plots = true (default), a plot which displays missing data patterns is dispalyed on the screen.

Top axis contains the name of the variables Big circle means missing value; smaller filled dot represents non missing value Left axis shows the number of observations for each pattern. For example number 40 shows that the associated patterns is repeated 40 times.

The sum of the numbers on the left axis is n the total number of rows Right axis counts the variables with missing values and it is equal to the number of big circles in the corresponding row.

Example:  'plots',false 

Data Types: Boolean

### dispresults —Display results on the screen.boolean.

If dispresults is true (default) it is possible to see on the screen the two output tables Xpat,tMisAndOut.

Example:  'dispresults',false 

Data Types: Boolean

## Output Arguments

### Mispat —missing values pattern. table

table with size (k+1)x(v+2), where k is the total number of missing values patterns which are present in the data matrix. The first k rows contain the patterns.

The last row contains n and then the total number of missing values in each column. The first column contains information about the number of observations for each pattern.

The columns of Mispat are sorted in non decreasing number of outliers. The last column contains the number of variables with missing values for each pattern.

### tMisAndOut —missing values and univariate outliers for each variable.

The rows of this table are associated with the variables.

The columns are referred to a series of statistics.

More precisely:

Columns 1:4 contain mean and median, std deviation and rescaled MAD (median absolute deviation).

Fifth column (Count_miss) contains the number of missing values for each variable.

Sixth column (Percmiss) conatins the percentage of missing data for each variable.

Seventh and eight column contain the number of outliers respectively in the left and right tail of the distribution. The criterion to decide whether a unit is outlier is based on the boxplot concept, that is the outliers are the units which are above x0.75+1.5*IQR or below x0.25-1.5*IQR, where IQR is the interquartile range.

## References

Schafer, J.L. (1997). "Analysis of Incomplete Multivariate Data". London: Chapman & Hall.