tclustreg performs robust linear grouping analysis

The X data have been introduced by Gordaliza, Garcia-Escudero & Mayo-Iscar (2013). The dataset presents two parallel components without contamination.

X = load('X.txt'); y1 = X(:,end); X1 = X(:,1:end-1); k = 2 ; restrfact = 5; alpha1 = 0.05 ; alpha2 = 0.01; out = tclustreg(y1,X1,k,restrfact,alpha1,alpha2); restrfact = 2; alpha1 = 0.05 ; alpha2 = 0.01; out = tclustreg(y1,X1,k,restrfact,alpha1,alpha2,'mixt',2); % Comparison with robust linear grouping out = rlga(X,k,alpha2);

ClaLik with untrimmed units selected using crisp criterion Total estimated time to complete tclustreg: 3.15 seconds MixLik with untrimmed units selected using h largest lik contributions Total estimated time to complete tclustreg: 4.74 seconds RLGA Algorithm k =2 biter =7 niter =10 Finished.

clear all; close all; load fishery; X = fishery.data; % some jittering is necessary because duplicated units are not treated: % this needs to be addressed X = X + 10^(-8) * abs(randn(677,2)); %tclustreg on fishery data y1 = X(:,end); X1 = X(:,1:end-1); k = 3 ; restrfact = 50; alpha1 = 0.04 ; alpha2 = 0.01; out = tclustreg(y1,X1,k,restrfact,alpha1,alpha2,'intercept',0,'mixt',0); %lga and rlga on fishery data figure('name','RLGA'); out=lga(X,3); clickableMultiLegend('1','2','3','data1','data2','data3'); axis manual; alpha = 0.95; figure('name','LGA'); out=rlga(X,3,1-alpha); clickableMultiLegend('0','1','2','3','data1','data2','data3'); axis manual;

clear all; close all; load fishery; X=fishery.data; % some jittering is necessary because duplicated units are not treated % in tclustreg: this needs to be addressed X = X + 10^(-8) * abs(randn(677,2)); y1 = X(:,end); X1 = X(:,1:end-1); % some arbitrary weights for the units we = sqrt(X1)/sum(sqrt(X1)); % tclustreg required parameters k = 2; restrfact = 50; alpha1 = 0.04 ; alpha2 = 0.01; % now tclust is run on each combination of mixt and wtrim options disp('mixt = 0; wtrim = 0;'); disp('standard tclustreg, with classification likelihood and without thinning' ); mixt = 0; wtrim = 0; out = tclustreg(y1,X1,k,restrfact,alpha1,alpha2,'intercept',0,'mixt',mixt,'wtrim',wtrim); disp('mixt = 2; wtrim = 0;'); disp('mixture likelihood, no thinning' ); mixt = 2; wtrim = 0; out = tclustreg(y1,X1,k,restrfact,alpha1,alpha2,'intercept',0,'mixt',mixt,'wtrim',wtrim); disp('mixt = 0; wtrim = 1;'); disp('classification likelihood, thinning based on user weights' ); mixt = 0; wtrim = 1; out = tclustreg(y1,X1,k,restrfact,alpha1,alpha2,'intercept',0,'mixt',mixt,'we',we,'wtrim',wtrim); disp('mixt = 2; wtrim = 1;'); disp('mixture likelihood, thinning based on user weights' ); mixt = 2; wtrim = 1; out = tclustreg(y1,X1,k,restrfact,alpha1,alpha2,'intercept',0,'mixt',mixt,'we',we,'wtrim',wtrim); disp('mixt = 0; wtrim = 2;'); disp('classification likelihood, thinning based on retention probabilities' ); mixt = 0; wtrim = 2; we = []; out = tclustreg(y1,X1,k,restrfact,alpha1,alpha2,'intercept',0,'mixt',mixt,'wtrim',wtrim); disp('mixt = 2; wtrim = 2;'); disp('mixture likelihood, thinning based on retention probabilities' ); mixt = 2; wtrim = 2; we = []; out = tclustreg(y1,X1,k,restrfact,alpha1,alpha2,'intercept',0,'mixt',mixt,'wtrim',wtrim); disp('mixt = 0; wtrim = 3;'); disp('classification likelihood, thinning based on bernoulli weights' ); mixt = 0; wtrim = 3; we = []; out = tclustreg(y1,X1,k,restrfact,alpha1,alpha2,'intercept',0,'mixt',mixt,'wtrim',wtrim); disp('mixt = 2; wtrim = 3;'); disp('mixture likelihood, thinning based on bernoulli weights' ); mixt = 2; wtrim = 3; we = []; out = tclustreg(y1,X1,k,restrfact,alpha1,alpha2,'intercept',0,'mixt',mixt,'wtrim',wtrim); disp('mixt = 0; wtrim = 4;'); disp('classification likelihood, tandem thinning based on bernoulli weights' ); mixt = 0; wtrim = 4; we = []; out = tclustreg(y1,X1,k,restrfact,alpha1,alpha2,'intercept',0,'mixt',mixt,'wtrim',wtrim); disp('mixt = 2; wtrim = 4;'); disp('mixture likelihood, tandem thinning based on bernoulli weights' ); mixt = 2; wtrim = 4; we = []; out = tclustreg(y1,X1,k,restrfact,alpha1,alpha2,'intercept',0,'mixt',mixt,'wtrim',wtrim);

Generate mixture of regression using MixSimReg, with an average overlapping at centroids = 0.01. Use all default options.

rng(372,'twister'); p=3; k=2; Q=MixSimreg(k,p,'BarOmega',0.001); n=500; [y,X,id]=simdatasetreg(n,Q.Pi,Q.Beta,Q.S,Q.Xdistrib); % plot the dataset yXplot(y,X); % run tclustreg out=tclustreg(y,X,k,50,0.01,0.01,'intercept',1);

Generate mixture of regression using MixSimReg, with an average overlapping at centroids =0.01.

rng(372,'twister'); p=3; k=2; Q=MixSimreg(k,p,'BarOmega',0.001); n=200; [y,X,id]=simdatasetreg(n,Q.Pi,Q.Beta,Q.S,Q.Xdistrib); % plot the dataset yXplot(y,X); % Generate the elemental subsets used in tclustreg once and for all. nsamp = 100; ncomb = bc(n,p); method = [10*ones(n/2,1); ones(n/2,1)]; % weighted sampling using weights in "method" msg = 0; for i=1:k C(:,(i-1)*p+1:i*p) = subsets(nsamp, n, p, ncomb, msg, method); end % tclustreg using samples in C out=tclustreg(y,X,k,50,0.01,0.01,'nsamp',C);

`y`

— Response variable.
Vector.A vector with n elements that contains the response variable.

y can be either a row or a column vector.

**
Data Types: **`single|double`

`X`

— Explanatory variables (also called 'regressors').
Matrix.Data matrix of dimension $(n \times p-1)$. Rows of X represent observations, and columns represent variables. Missing values (NaN's) and infinite values (Inf's) are allowed, since observations (rows) with missing or infinite values will automatically be excluded from the computations.

**
Data Types: **`single|double`

`k`

— Number of clusters.
Scalar.This is a guess on the number of data groups.

**
Data Types: **`single|double`

`restrfact`

— restriction factor for regression residuals and covariance
matrices of explanatory variables.
Scalar or vector with two
elements.If restrfact is a scalar it controls the differences among group scatters of the residuals. The value 1 is the strongest restriction. If restrfactor is a vector with two elements the first element controls the differences among group scatters of the residuals and the second the differences among covariance matrices of the explanatory variables.

**
Data Types: **`single|double`

`alphaLik`

— Trimming level.
Scalar.alpha1 is a value between 0 and 0.5 or an integer specifying the number of observations which have to be trimmed. If alphaLik=0 there is no trimming. More in detail, if 0<alphaLik<1 clustering is based on h=fix(n*(1-alphaLik)) observations.

Else if alphaLik is an integer greater than 1 clustering is based on h=n-floor(alphaLik). More in detail, likelihood contributions are sorted and the units associated with the smallest n-h contributions are trimmed.

**
Data Types: **`single|double`

`alphaX`

— Second-level trimming or constrained weighted model for X.
Scalar.alphaX is a value between [0 and 1).

- If alphaX is in the interval [0 0.5] it indicates the proportion of units subject to second level trimming. In particular, if alphaX=0 there is no second-level trimming.

Note that alphaX is usually smaller than alphaLik.

- If alphaX is in the interval (0.5 1) it indicates a Bonferronized confidence level to be used to identify the units subject to second level trimming.

- If alphaX is 1 constrained weighted model for X is assumed (see Gershenfeld (1997)). The CWM estimator is able to take into account different distributions for the explanatory variables across groups, so overcoming an intrinsic limitation of mixtures of regression, where they are implicitly assumed equally distributed.

**
Data Types: **`single|double`

Specify optional comma-separated pairs of `Name,Value`

arguments.
`Name`

is the argument name and `Value`

is the corresponding value. `Name`

must appear
inside single quotes (`' '`

).
You can specify several name and value pair arguments in any order as ```
Name1,Value1,...,NameN,ValueN
```

.

```
'intercept',1
```

,```
'mixt',1
```

,```
'equalweights',true
```

,```
'nsamp',1000
```

,```
'refsteps',15
```

,```
'plots',1
```

,```
'wtrim',1
```

,```
'we',[0.2 0.2 0.2 0.2 0.2]
```

,```
'msg',1
```

`intercept`

—Indicator for constant term.scalar.If 1, a model with constant term will be fitted (default), else no constant term will be included.

**Example: **```
'intercept',1
```

**Data Types: **`double`

`mixt`

—mixture modelling or crisp assignment.scalar.Option mixt specifies whether mixture modelling or crisp assignment approach to model based clustering must be used.

In the case of mixture modelling parameter mixt also controls which is the criterior to find the untrimmed units in each step of the maximization If mixt >=1 mixture modelling is assumed else crisp assignment.

In mixture modelling the likelihood is given by

\[ \prod_{i=1}^n \sum_{j=1}^k \pi_j \phi (y_i \; x_i' , \beta_j , \sigma_j), \] while in crisp assignment the likelihood is given by \[ \prod_{j=1}^k \prod _{i\in R_j} \pi_j \phi (y_i \; x_i' , \beta_j , \sigma_j), \]where $R_j$ contains the indexes of the observations which are assigned to group $j$, Remark - if mixt>=1 previous parameter equalweights is automatically set to 1.

Parameter mixt also controls the criterion to select the units to trim if mixt == 2 the h units are those which give the largest contribution to the likelihood that is the h largest values of

\[ \sum_{j=1}^k \pi_j \phi (y_i \; x_i' , \beta_j , \sigma_j) \qquad i=1, 2, ..., n \] elseif mixt==1 the criterion to select the h units is exactly the same as the one which is used in crisp assignment. That is: the n units are allocated to a cluster according to criterion \[ \max_{j=1, \ldots, k} \hat \pi'_j \phi (y_i \; x_i' , \beta_j , \sigma_j) \]and then these n numbers are ordered and the units associated with the largest h numbers are untrimmed.

**Example: **```
'mixt',1
```

**Data Types: **`single | double`

`equalweights`

—cluster weights in the concentration and assignment steps.logical.A logical value specifying whether cluster weights shall be considered in the concentration, assignment steps and computation of the likelihood.

if equalweights = true we are (ideally) assuming equally sized groups by maximizing the likelihood

**Example: **```
'equalweights',true
```

**Data Types: **`Logical`

`nsamp`

—number of subsamples to extract.scalar | matrix with k*p columns.If nsamp is a scalar it contains the number of subsamples which will be extracted.

If nsamp=0 all subsets will be extracted.

If the number of all possible subset is <300 the default is to extract all subsets, otherwise just 300.

If nsamp is a matrix it contains in the rows the indexes of the subsets which have to be extracted. nsamp in this case can be conveniently generated by function subsets.

nsamp must have k*p columns. The first p columns are used to estimate the regression coefficient of group 1... the last p columns are used to estimate the regression coefficient of group k

**Example: **```
'nsamp',1000
```

**Data Types: **`double`

`refsteps`

—Number of refining iterations.scalar.Number of refining iterations in each subsample. Default is 10.

refsteps = 0 means "raw-subsampling" without iterations.

**Example: **```
'refsteps',15
```

**Data Types: **`single | double`

`plots`

—Plot on the screen.scalar.A flag to control the generation of the plots.

If plots=1 a plot is showed on the screen with the final allocation (and if size(X,2)==2 with the lines associated to the groups)

**Example: **```
'plots',1
```

**Data Types: **`double`

`wtrim`

—Application of observation weights.scalar.A flag taking values [0, 1, 2, 3, 4] to control the application of weights on the observations.

- If \texttt{wtrim}=0 (no weights) and \texttt{mixt}=0, the algorithm reduces to the standard tclustreg algorithm.

- If \texttt{wtrim}=0 and \texttt{mixt}=2, the maximum posterior probability $D\_i$ of equation 7 of Garcia et al. 2010 is computing by maximizing the log-likelihood contributions of the mixture model of each observation.

- If \texttt{wtrim} = 1, trimming is done by weighting the observations using values specified in vector \texttt{we}.

In this case, vector \texttt{we} must be supplied by the user. For instance, \texttt{we} = $X$.

- If \texttt{wtrim} = 2, trimming is again done by weighting the observations using values specified in vector \texttt{we}.

In this case, vector \texttt{we} is computed from the data as a function of the density estimate $\mbox{pdfe}$.

Specifically, the weight of each observation is the probability of retaining the observation, computed as

\[\mbox{pretain}_{i g} = 1 - \mbox{pdfe}_{ig}/\max_{ig}(\mbox{pdfe}_{ig})\]- If \texttt{wtrim} = 3, trimming is again done by weighting the observations using values specified in vector \texttt{we}. In this case, each element $we_i$ of vector \texttt{we} is a Bernoulli random variable with probability of success $\mbox{pdfe}_{ig}$. In the clustering framework this is done under the constraint that no group is empty.

- If \texttt{wtrim} = 4, trimming is done with the tandem approach of Cerioli and Perrotta (2014).

**Example: **```
'wtrim',1
```

**Data Types: **`double`

`we`

—Vector of observation weights.vector.A vector of size n-by-1 containing the weights to apply to each observation. Default value is a vector of ones.

**Example: **```
'we',[0.2 0.2 0.2 0.2 0.2]
```

**Data Types: **`double`

`msg`

—Level of output to display.scalar.Scalar which controls whether to display or not messages on the screen.

If msg==0 nothing is displayed on the screen.

If msg==1 (default) messages are displayed on the screen about estimated time to compute the estimator or the number of subsets in which there was no convergence.

If msg==2 detailed messages are displayed. For example the information at iteration level.

**Example: **```
'msg',1
```

**Data Types: **`single | double`

`out`

— description
StructureStructure containing the following fields

Value | Description |
---|---|

`bopt` |
$p-1 \times k$ matrix containing the regression parameters. |

`sigmaopt_0` |
$k$ row vector containing the estimated group variances. |

`sigmaopt_cons` |
$k$ row vector containing the estimated group variances corrected with asymptotic consistency factor. |

`sigmaopt_pison` |
$k$ row vector containing the estimated group variances corrected with asymptotic consistency factor and small sample correction factor of Pison et al. |

`numopt` |
$k$ column vector containing the number of observations in each cluster after the second trimming. . |

`vopt` |
Scalar. The value of the target function. |

`weopt` |
$n$ vector containing the weigths of each observation, i.e. its contribution to the estimates. |

`asig_obs_to_group` |
$n$ vector containing the cluster assigments of all n observations (trimmed observations excluded). |

`asig_obs_to_group_before_tr` |
$n$ vector containing the cluster assigments of all n observations (trimmed observations included). |

`trim1levelopt` |
$n$ vector containing the 1st level trimmed units (0) and 1st level untrimmed (1) units. |

`trim2levelopt` |
$n$ vector containing the 2nd level trimmed units (0) and 2nd level untrimmed (1) units. |

`postprobopt` |
$n$ vector containing the final posterior probability |

`varargout`

—C : Indexes of extracted subsamples.
` `

MatrixMatrix of size nsamp-by-k*p containing (in the rows) the indices of the subsamples extracted for computing the estimate.

(2008), A General Trimming Approach to Robust Cluster Analysis. Annals of Statistics, Vol.36, 1324-1345. Technical Report available at:

http://www.eio.uva.es/inves/grupos/representaciones/trTCLUST.pdf

Cerioli, A. and Perrotta, D. (2014). "Robust Clustering Around Regression Lines with High Density Regions". Advances in Data Analysis and Classification, Volume 8, Issue 1, p. 5-26.