FSM gives an automatic outlier detection procedure in multivariate analysis

n=200; v=3; randn('state', 123456); Y=randn(n,v); % Contaminated data Ycont=Y; Ycont(1:5,:)=Ycont(1:5,:)+3; [out]=FSM(Ycont);

------------------------- Signal detection loop dmin(195,200)>99.9% and dmin(196,200)>99.9% and rmin(194,200)>99% drmin(194,200)>99.9% and dmin(195,200)>99.9% and rmin(196,200)>99% dmin(195,200)>99.999% ------------------- Signal validation Validated signal ------------------------------- Start resuperimposing envelopes from step m=194 Superimposition stopped because d_{min}(195,198)>99% envelope $d_{min}(195,198)>99$\% envelope Subsample of 197 units is homogeneous ---------------------------- Final output Number of units declared as outliers=3 Summary of the exceedances 1 99 999 9999 99999 0 6 5 3 3

FSM with plots showing envelope superimposition.

n=200; v=3; randn('state', 123456); Y=randn(n,v); % Contaminated data Ycont=Y; Ycont(1:5,:)=Ycont(1:5,:)+3; [out]=FSM(Ycont,'plots',2);

------------------------- Signal detection loop dmin(195,200)>99.9% and dmin(196,200)>99.9% and rmin(194,200)>99% drmin(194,200)>99.9% and dmin(195,200)>99.9% and rmin(196,200)>99% dmin(195,200)>99.999% ------------------- Signal validation Validated signal ------------------------------- Start resuperimposing envelopes from step m=194 Superimposition stopped because d_{min}(195,198)>99% envelope $d_{min}(195,198)>99$\% envelope Subsample of 197 units is homogeneous ---------------------------- Final output Number of units declared as outliers=3 Summary of the exceedances 1 99 999 9999 99999 0 6 5 3 3

n=200; v=3; randn('state', 123456); Y=randn(n,v); % Contaminated data Ycont=Y; Ycont(1:5,:)=Ycont(1:5,:)+3; plots=struct; plots.ncoord=1; [out]=FSM(Ycont,'plots',plots);

------------------------- Signal detection loop dmin(195,200)>99.9% and dmin(196,200)>99.9% and rmin(194,200)>99% drmin(194,200)>99.9% and dmin(195,200)>99.9% and rmin(196,200)>99% dmin(195,200)>99.999% ------------------- Signal validation Validated signal ------------------------------- Start resuperimposing envelopes from step m=194 Superimposition stopped because d_{min}(195,198)>99% envelope $d_{min}(195,198)>99$\% envelope Subsample of 197 units is homogeneous ---------------------------- Final output Number of units declared as outliers=3 Summary of the exceedances 1 99 999 9999 99999 0 6 5 3 3

n=1000; v=10; Y=randn(n,v); [out]=FSM(Y,'init',200,'plots',0);

n=100; v=3; Y=randn(n,v); % Contaminated data Ycont=Y; Ycont(1:5,:)=Ycont(1:5,:)+3; [out]=FSM(Ycont,'m0',5,'crit','md');

load('swiss_banknotes'); % Monitor the exceedances of Minimum Mahalanobis Distance [out]=FSM(swiss_banknotes.data(101:200,:),'plots',1); % Control minimum and maximum on the x axis plots=struct; plots.xlim=[60 90]; [out]=FSM(swiss_banknotes.data(101:200,:),'plots',plots); % Monitor the exceedances of Minimum Mahalanobis Distance using normal coordinates for mmd. plots.ncoord=1; [out]=FSM(swiss_banknotes.data(101:200,:),'plots',plots);

`Y`

— Input data.
Matrix.n x v data matrix; n observations and v variables. Rows of Y represent observations, and columns represent variables.

Missing values (NaN's) and infinite values (Inf's) are allowed, since observations (rows) with missing or infinite values will automatically be excluded from the computations.

**
Data Types: **`single|double`

Specify optional comma-separated pairs of `Name,Value`

arguments.
`Name`

is the argument name and `Value`

is the corresponding value. `Name`

must appear
inside single quotes (`' '`

).
You can specify several name and value pair arguments in any order as ```
Name1,Value1,...,NameN,ValueN
```

.

```
'm0',5
```

,```
'crit','md'
```

,```
'rf',0.9
```

,```
'init',50
```

,```
'plots',2
```

,```
'bonflev',0.7
```

,```
'msg',0
```

,```
'nocheck',1
```

`m0`

—Initial subset size or vector which contains the list of the units forming
initial subset.scalar | vector.The default is to start the search with v+1 units which consisting of those observations which are not outlying on any scatterplot, found as the intersection of all points lying within a robust contour containing a specified portion of the data (Riani and Zani 1997) and inside the univariate boxplot. Remark: if m0 is a vector option below crit is ignored.

**Example: **```
'm0',5
```

**Data Types: **`double`

`crit`

—It specified the criterion to be used to
initialize the search.character.if crit='md' the units which form initial subset are those which have the smallest m0 pseudo Mahalanobis distances computed using procedure unibiv (bivariate robust ellipses).

if crit='biv' sorting is done first in terms of times units fell outside robust bivariate ellipses and then in terms of pseudoMD. In other words, the units forming initial subset are chosen first among the set of those which never fell outside robust bivariate ellipses then among those which fell only once outside bivariate ellipses ... up to reach m0.

if crit='uni' sorting is done first in terms of times units fell outside univariate boxplots and then in terms of pseudoMD. In other words, the units forming initial subset are chosen first among the set of those which never fell outside univariate boxplots then among those which fell only once outside univariate boxplots... up to reach m0.

Remark: as the user can see the starting point of the search is not going to affect at all the results of the analysis. The user can explore this point with his own datasets.

Remark: if crit='biv' the user can also supply in scalar rf (see below) the confidence level of the bivariate ellipses.

**Example: **```
'crit','md'
```

**Data Types: **`char`

`rf`

—confidence level for bivariate ellipses.scalar.The default is 0.95. This option is useful only if crit='biv'.

**Example: **```
'rf',0.9
```

**Data Types: **`double`

`init`

—Point where to start monitoring required diagnostics.scalar.Note that if bsb is suppliedinit>=length(bsb). If init is not specified it will be set equal to floor(n*0.6).

**Example: **```
'init',50
```

**Data Types: **`double`

`plots`

—plot of minimum Mahalanobis distance.scalar | structure.If plots is a missing value or is a scalar equal to 0 no plot is produced.

If plots is a scalar equal to 1 (default) the plot of minimum MD with envelopes based on n observations and the scatterplot matrix with the outliers highlighted is produced.

If plots is a scalar equal to 2 the additional plots of envelope resuperimposition are produced.

If plots is a structure it may contain the following fields:

Value | Description |
---|---|

`ylim` |
vector with two elements controlling minimum and maximum on the y axis. Default value is '' (automatic scale); |

`xlim` |
vector with two elements controlling minimum and maximum on the x axis. Default value is '' (automatic scale); |

`resuper` |
vector which specifies for which steps it is necessary to show the plots of resuperimposed envelopes if resuper is not supplied a plot of each step in which the envelope is resuperimposed is shown. Example if resuper =[85 87] plots of resuperimposedenvelopes are shown at steps m=85 and m=87; |

`ncoord` |
scalar. If ncoord=1 plots are shown in normal coordinates else (default) plots are shown in traditional mmd coordinates; |

`labeladd` |
If this option is '1', the outliers in the spm are labelled with the unit row index. The default value is labeladd='', i.e. no label is added; |

`nameY` |
cell array of strings containing the labels of the variables. As default value, the labels which are added are Y1, ...Yv; |

`lwd` |
Scalar which controls line width of the curve which contains the monitoring of minimum Mahalanobis distance. Default line of lwd=2. |

`lwdenv` |
Scalar which controls linewidth of the envelopes. Default value of lwdenv=2. |

**Example: **```
'plots',2
```

**Data Types: **`double`

`bonflev`

—option that might be used to identify extreme outliers
when the distribution of the data is strongly non normal.scalar.In these circumstances, the general signal detection rule based on consecutive exceedances cannot be used. In this case bonflev can be:

- a scalar smaller than 1, which specifies the confidence level for a signal and a stopping rule based on the comparison of the minimum deletion residual with a Bonferroni bound. For example if bonflev=0.99 the procedure stops when the trajectory exceeds for the first time the 99% bonferroni bound.

- A scalar value greater than 1. In this case the procedure stops when the residual trajectory exceeds for the first time this value.

Default value is ' ', which means to rely on general rules based on consecutive exceedances.

**Example: **```
'bonflev',0.7
```

**Data Types: **`double`

`msg`

—It controls whether to display or not messages
on the screen.scalar.If msg==1 (default) messages about the progression of the search are displayed on the screen otherwise only error messages will be displayed.

**Example: **```
'msg',0
```

**Data Types: **`double`

`nocheck`

—It controls whether to perform checks on matrix Y.scalar.If nocheck is equal to 1 no check is performed.

As default nocheck=0.

**Example: **```
'nocheck',1
```

**Data Types: **`double`

`out`

— description
StructureStructure which contains the following fields

Value | Description |
---|---|

`outliers` |
k x 1 vector containing the list of the units declared as outliers or NaN if the sample is homogeneous |

`mmd` |
(n-init) x 2 matrix. 1st col = fwd search index; 2nd col = value of minimum Mahalanobis Distance in each step of the fwd search. |

`Un` |
(n-init) x 11 Matrix which contains the unit(s) included in the subset at each step of the fwd search. REMARK: in every step the new subset is compared with the old subset. Un contains the unit(s) present in the new subset but not in the old one. Un(1,2) for example contains the unit included in step init+1. Un(end,2) contains the units included in the final step of the search. |

`nout` |
2 x 5 matrix containing the number of times mmd went out of particular quantiles. First row contains quantiles 1 99 99.9 99.99 99.999 per cent; Second row contains the frequency distribution. It is NaN if bonflev threshold is used. |

`loc` |
1 x v vector containing location of the data. |

`cov` |
v x v robust estimate of covariance matrix. |

`md` |
n x 1 vector containing the estimates of the robust Mahalanobis distances (in squared units). This vector contains the distances of each observation from the location of the data, relative to the scatter matrix cov. |

`class` |
'FSM'. |

Riani, M., Atkinson, A.C. and Cerioli, A. (2009), Finding an unknown number of multivariate outliers, "Journal of the Royal Statistical Society Series B", Vol. 71, pp. 201-221.

Cerioli, A., Farcomeni, A. and Riani M. (2014), Strong consistency and robustness of the Forward Search estimator of multivariate location and scatter, "Journal of Multivariate Analysis", Vol. 126, pp. 167-183, http://dx.doi.org/10.1016/j.jmva.2013.12.010