tclustICgpcm computes tclust for different number of groups k and restr. factors $c_{det}$ and $c_{shw}$
tclustICgpcm (where the two letters IC stand for 'Information Criterion') and gpcm stands for Gaussian parimonious clustering models computes the values of BIC (MIXMIX), ICL (MIXCLA) or CLA (CLACLA), for different values of k (number of groups) and different values of $c_{det}$ (restriction factor for determinants), $c_{shw}$ (restriction factor for within group shape element) and for a prespecified level of trimming. If Parallel Computing toolbox is installed, parfor is used to compute tclust for different values of restriction factors. In order to minimize randomness, given k, the same subsets are used for each value of the restriction factors.
Example of use of input options alpha and typeIC.out
=tclustICgpcm(Y
,
Name, Value
)
Use a small number fo subsets in order to reduce execution time.
rng(100) nsamp=20; Y=load('geyser2.txt'); % If no trimming is used 4 groups are found. out=tclustICgpcm(Y,'nsamp',nsamp);
k=1 k=2 k=3 k=4 k=5
Y=load('geyser2.txt'); pa=struct; pa.pars='VVV'; rng(100) nsamp=20; alpha=0.1; whichIC='MIXMIX'; % Just check from k=2 to k=4 kk=2:4; outIC=tclustICgpcm(Y,'pa',pa,'alpha',alpha,'whichIC',whichIC,'kk',kk,'nsamp',nsamp);
rng('default') Reinitialize the random number generator to its startup configuration
rng(20000); ktrue=3; % n = number of observations n=150; % v= number of dimensions v=2; % Imposed average overlap BarOmega=0.04; restrfact=5; outg=MixSim(ktrue,v,'BarOmega',BarOmega, 'restrfactor',restrfact); % data generation given centroids and cov matrices [Y,id]=simdataset(n, outg.Pi, outg.Mu, outg.S); % Number of subsamples to extract (option nsamp) is very small % therefore a great variability is allowed. outIC=tclustICgpcm(Y,'nsamp',50);
cdet and pa.shw.
Y=load('geyser2.txt'); nsamp=100; pa=struct; pa.cdet=[2 4]; pa.shw=[8 16 32]; kk=[2 3 4 6]; out=tclustICgpcm(Y,'pa',pa,'cleanpool',false,'plots',0,'alpha',0.1,'whichIC','CLACLA','kk',kk,'nsamp',nsamp);
Y
— Input data.
Matrix.Data matrix containing n observations on v variables Rows of Y represent observations, and columns represent variables. Observations (rows) with missing (NaN) or or infinite (Inf) values will automatically be excluded from the computations.
Data Types: double
Specify optional commaseparated pairs of Name,Value
arguments.
Name
is the argument name and Value
is the corresponding value. Name
must appear
inside single quotes (' '
).
You can specify several name and value pair arguments in any order as
Name1,Value1,...,NameN,ValueN
.
, 'kk',1:4
, 'whichIC','CLACLA'
, 'alpha',0
, 'nsamp',1000
, 'RandNumbForNini',''
, 'refsteps',10
, 'reftol',1e05
, 'equalweights',true
, 'startv1',1
, 'plots',1
, 'numpool',4
, 'cleanpool',1
, 'msg',1
, 'nocheck',10
, 'Ysave',1
, 'warmup',true
, 'UnitsSameGroup',[12 20]
pa
—Constraints to apply and model specification.structure.Structure containing the following fields:
Value  Description 

pars 
type of Gaussian Parsimonious Clustering Model. Character. A 3 letter word in the set: 'VVE','EVE','VVV','EVV','VEE','EEE','VEV','EEV','VVI', 'EVI','VEI','EEI','VII','EII'. If pa is not supplied VVV model is used. 
cdet 
scalar o vector containing the values of the restriction factors for ratio of determinants which have to be tested. If pa.cdet=1 all determinants are forced to be equal. If this field is not present the values [1, 2, 4, 8, 16, 32, 64, 128] are used. See section More About for additional details. 
shw 
scalar o vector containing the values of the restriction factors to the elements of the shape matrices inside each group. If pa.shw=1 all diagonal elements of the shape matrix of cluster j (with j=1, ..., k) will be equal. If this field is not present the values [1, 2, 4, 8, 16, 32, 64, 128] are used. 
shb 
scalar in the interval [1 Inf) which specifies the the restriction which has to be applied to the elements of the shape matrices across each group. If this field is not present pa.shb=128 is used. 
maxiterS 
positive integer which specifies the maximum number of iterations to obtain the restricted shape matrix. This parameter is used by routine restrshapepars. The default value of pa.maxiterS is 5. 
maxiterR 
positive integer which specifies the maximum number of iterations to obtain the common rotation matrix in presence of varying shape. This parameter is used by routine cpcV. The default value of pa.maxiterR is 20. 
maxiterDSR 
positive integer which specifies the maximum number of iterations to obtain the requested restricted determinants, shape matrices and rotation. For all parametrizations pa.maxiterDSR is set to 1 apart from for the specifications 'VVE', 'EVE' and 'VEE'. The default value of pa.maxiterDSR is 20. 
tolS 
tolerance to use to exit the iterative procedure for estimating the shape. Scalar. The iterative procedures stops when the relative difference of a certain output matrix is smaller than itertol in two consecutive iterations. The default value of pa.tol is 1e12. 
zerotol 
tolerance value to declare all input values equal to 0 in the eigenvalues restriction routine (file restreigen.m) or in the final reconstruction of covariance matrices. The default value of zerotol is 1e10. 
msg 
boolean which if set equal to true enables to monitor the relative change of the estimates of lambda Gamma and Omega in each iteration. The defaul value of pa.msg is false, that is nothing is displayed in each iteration. 
userepmat 
scalar, which specifies whether to use implicit expansion or bsxfun. pa.userepmat =2 implies implicit expansion, pa.userepmat=1 implies use of bsxfun. The default is to use implicit expansion (faster) if verLessThanFS('9.1') is false and bsxfun if MATLAB is older than 2016b. Data Types  struct 
Example:
Data Types: struct
Example  pa=struct; pa.cdet=10;
kk
—number of mixture components.integer vector.Integer vector specifying the number of mixture components (clusters) for which the BIC is to be calculated.
Vector. The default value of kk is 1:5.
Example: 'kk',1:4
Data Types: int16  int32  single  double
whichIC
—type of information criterion.character.Character which specifies which information criteria must be computed for each k (number of groups) and each value of the restriction factors $c_{det}$ and $c_{shw}$.
Possible values for whichIC are:
'MIXMIX' = a mixture model is fitted and to compute the information criterion the mixture likelihood is used. This option corresponds to the use of the Bayesian Information criterion (BIC). In output structure out just the matrix out.MIXMIX is given. If this input option is not specified, MIXMIX is assumed.
'MIXCLA' = a mixture model is fitted but to compute the information criterion the classification likelihood is used. This option corresponds to the use of the Integrated Complete Likelihood (ICL). In output structure out just the matrix out.MIXCLA is given.
'CLACLA' = everything is based on the classification likelihood. This information criterion will be called CLA. In output structure out just the matrix out.CLACLA is given.
Example: 'whichIC','CLACLA'
Data Types: character
alpha
—global trimming level.fraction  number of observations which have to be trimmed.alpha is a scalar between 0 and 0.5 or an integer specifying the number of observations to be trimmed. If alpha = 0 all observations are considered.
More in detail, if 0 < alpha < 1 clustering is based on h = fix(n*(1alpha)) observations. Else if alpha is an integer greater than 1 clustering is based on h=nfloor(alpha).
The default value of alpha which is used is 0.
Example: 'alpha',0
Data Types: single  double
nsamp
—number of subsamples to extract.scalar  matrix.If nsamp is a scalar it contains the number of subsamples which will be extracted. If nsamp=0 all subsets will be extracted.
Remark  if the number of all possible subset is <300 the default is to extract all subsets, otherwise just 300.
 If nsamp is a matrix it contains in the rows the indexes of the subsets which have to be extracted. nsamp in this case can be conveniently generated by function subsets.
nsamp can have k columns or k*(v+1) columns. If nsamp has k columns the k initial centroids each iteration i are given by X(nsamp(i,:),:) and the covariance matrices are equal to the identity.
 If nsamp has k*(v+1) columns the initial centroids and covariance matrices in iteration i are computed as follows X1=X(nsamp(i,:),:) mean(X1(1:v+1,:)) contains the initial centroid for group 1 cov(X1(1:v+1,:)) contains the initial cov matrix for group 1 1 mean(X1(v+2:2*v+2,:)) contains the initial centroid for group 2 cov((v+2:2*v+2,:)) contains the initial cov matrix for group 2 1 ...
mean(X1((k1)*v+1:k*(v+1))) contains the initial centroids for group k cov(X1((k1)*v+1:k*(v+1))) contains the initial cov matrix for group k REMARK  if nsamp is not a scalar option option below startv1 is ignored. More precisely if nsamp has k columns startv1=0 elseif nsamp has k*(v+1) columns option startv1=1.
Example: 'nsamp',1000
Data Types: double
RandNumbForNini
—Preextracted random numbers to initialize proportions.matrix.Matrix with size kbysize(nsamp,1) containing the random numbers which are used to initialize the proportions of the groups. This option is effective just if nsamp is a matrix which contains preextracted subsamples and k is a scalar. The purpose of this option is to enable the user to replicate the results.
The default value of RandNumbForNini is empty, that is random numbers from uniform are used.
Example: 'RandNumbForNini',''
Data Types: single  double
refsteps
—Number of refining iterations.scalar.Number of refining iterations in subsample. Default is 15. refsteps = 0 means "rawsubsampling" without iterations.
Example: 'refsteps',10
Data Types: single  double
reftol
—scalar.default value of tolerance for the refining steps.The default value is 1e14;
Example: 'reftol',1e05
Data Types: single  double
equalweights
—cluster weights in the concentration and assignment steps.logical.A logical value specifying whether cluster weights shall be considered in the concentration, assignment steps and computation of the likelihood. The default value is false.
Example: 'equalweights',true
Data Types: Logical
startv1
—how to initialize centroids and cov matrices.scalar.If startv1 is 1 then initial centroids and and covariance matrices are based on (v+1) observations randomly chosen, else each centroid is initialized taking a random row of input data matrix and covariance matrices are initialized with identity matrices.
Remark 1 in order to start with a routine which is in the required parameter space, eigenvalue restrictions are immediately applied. The default value of startv1 is 1.
Remark 2  option startv1 is used just if nsamp is a scalar (see for more details the help associated with nsamp).
Example: 'startv1',1
Data Types: single  double
plots
—Plot on the screen.scalar.If plots = 1, three plots are produced.
The first is two panels plot. Left panel contains the Information criterion as function of k for each value of $c_{det}$ given the best value of $c_{shw}$ to the best found value. The kind of information criterion which is dispalyed depends on input option whichIC. It can be BIC for mixture likelihood (MIXMIX) or ICL (MIXCLA) or BIC for classification likelihood (CLACLA).
The second plot is a heatmap of the information criterion for the values of $c_{det}$ and $c_{shw}$ which have been considered.
The third plot is a two panels plot. The panel on the left contains the values of IC for $c_{shb}$ given the best values of $c_{det}$ and $c_{shw}$. This plot enables to choose the best value of $c_{shb}$. The panel on the right contains the values of IC in correspondence of the cases of no rotation, equal rotation and varying rotation.
The default value of plots is 0 that is no plot is shown.
Example: 'plots',1
Data Types: single  double
numpool
—number of pools for parellel computing.scalar.If numpool > 1, the routine automatically checks if the Parallel Computing Toolbox is installed and distributes the random starts over numpool parallel processes. If numpool <= 1, the random starts are run sequentially. By default, numpool is set equal to the number of physical cores available in the CPU (this choice may be inconvenient if other applications are running concurrently). The same happens if the numpool value chosen by the user exceeds the available number of cores. REMARK 1: up to R2013b, there was a limitation on the maximum number of cores that could be addressed by the parallel processing toolbox (8 and, more recently, 12). From R2014a, it is possible to run a local cluster of more than 12 workers.
REMARK 2: Unless you adjust the cluster profile, the default maximum number of workers is the same as the number of computational (physical) cores on the machine.
REMARK 3: In modern computers the number of logical cores is larger than the number of physical cores. By default, MATLAB is not using all logical cores because, normally, hyperthreading is enabled and some cores are reserved to this feature.
REMARK 4: It is because of Remarks 3 that we have chosen as default value for numpool the number of physical cores rather than the number of logical ones. The user can increase the number of parallel pool workers allocated to the multiple start monitoring by:
 setting the NumWorkers option in the local cluster profile settings to the number of logical cores (Remark 2). To do so go on the menu "HomeParallelManage Cluster Profile" and set the desired "Number of workers to start on your local machine".
 setting numpool to the desired number of workers;
Therefore, *if a parallel pool is not already open*, UserOption numpool (if set) overwrites the number of workers set in the local/current profile. Similarly, the number of workers in the local/current profile overwrites default value of 'numpool' obtained as feature('numCores') (i.e. the number of physical cores).
Example: 'numpool',4
Data Types: double
cleanpool
—clean pool.scalar.cleanpool is 1 if the parallel pool has to be cleaned after the execution of the routine. Otherwise it is 0. The default value of cleanpool is 1. Clearly this option has an effect just if previous option numpool is >
1.
Example: 'cleanpool',1
Data Types: single  double
msg
—Message on the screen.scalar.Scalar which controls whether to display or not messages about code execution.
Example: 'msg',1
Data Types: single  double
nocheck
—Check input arguments.scalar.If nocheck is equal to 1 no check is performed on matrix Y. As default nocheck=0.
Example: 'nocheck',10
Data Types: single  double
Ysave
—save input matrix.scalar.Scalar that is set to 1 to request that the input matrix Y is saved into the output structure out. Default is 1, that is matrix Y is saved inside output structure out.
Example: 'Ysave',1
Data Types: single  double
warmup
—warming up step.boolean.If warmup is true (default) we investigate for each candidate value of k the 14 GPCMs and identify the best model. If we find varying determinants then the sequence of values of cdet is investigated else cdet is always set to 1. Similarly, if we find a shape different from the identity the sequence of values of cshw is investigated else cshw is set equal to 1. Moreover in this step we control the type of rotation. If warmp is false we always start from VVV specificationa and we investigate each value of cdet and cshw. On a later stage we investigate the optimal value of cshb and the type of rotation.
Example: 'warmup',true
Data Types: boolean
UnitsSameGroup
—Units with same labels.numeric vector.List of the units which must (whenever possible) have the same label. For example if UnitsSameGroup=[20 26], it means that group which contains unit 20 is always labelled with number 1. Similarly, the group which contains unit 26 is always labelled with number 2, (unless it is found that unit 26 already belongs to group 1). In general, group which contains unit UnitsSameGroup(r) where r=2, ...length(kk)1 is labelled with number r (unless it is found that unit UnitsSameGroup(r) has already been assigned to groups 1, 2, ..., r1). The default value of UnitsSameGroup is '' that is consistent labels are not imposed.
Example: 'UnitsSameGroup',[12 20]
Data Types: single  double
out
— description
Structurestructure which contains the following fields:
Value  Description 

CLACLA 
3D array of size 5times8times8 if kk and pa.cdet and pa.cshw are not specififed else it is a 3D array of size length(kk)timeslength(pa.cdet)timeslength(pa.shw) containinig the value of the penalized classification likelihood. This output is present only if 'whichIC' is 'CLACLA'. 
IDXCLA 
3D array of size 5times8times8 if kk and pa.cdet and pa.cshw are not specififed else it is a 3D array of size length(kk)timeslength(pa.cdet)timeslength(pa.shw) containinig the assignment of each unit using the classification model. This output is present only if 'whichIC' is 'CLACLA'. 
MIXMIX 
3D array of size 5times8times8 if kk and pa.cdet and pa.cshw are not specififed else it is a 3D array of size length(kk)timeslength(pa.cdet)timeslength(pa.shw) containinig the value of the penalized mixture likelihood. This output is present only if 'whichIC' is 'MIXMIX'. 
MIXCLA 
D array of size 5times8times8 if kk and pa.cdet and pa.cshw are not specififed else it is a 3D array of size length(kk)timeslength(pa.cdet)timeslength(pa.shw) containinig the value of ICL. This output is present only if 'whichIC' is 'MIXCLA'. 
IDXMIX 
3D array of size 5times8times8 if kk and pa.cdet and pa.cshw are not specififed else it is a 3D array of size length(kk)timeslength(pa.cdet)timeslength(pa.shw) containinig the assignment of each unit using using the mixture model. This output is present only if 'whichIC' is 'MIXMIX'. 
kk 
vector containing the values of k (number of components) which have been considered. This vector is equal to input optional argument kk if kk had been specified else it is equal to 1:5. 
ccdet 
vector containing the values of $c_det$ (values of the restriction factor for ratio of the determinant) which have been considered. This vector is equal to input argument pa.cdet else it is equal to [1, 2, 4, 8, 16, 32, 64, 128]. 
kbest 
scalar containing optimal value of k. 
ccshw 
vector containing the values of $c_det$ (values of the restriction factor for ratio of the shape elements inside each group) which have been considered. This vector is equal to input argument pa.cdet else it is equal to [1, 2, 4, 8, 16, 32, 64, 128]. 
cdetbest 
scalar containing optimal value of restriction among determinants. 
cshwbest 
scalar containing optimal value of restriction among shape elements inside each group. 
cshbbest 
scalar containing optimal value of restriction among ordered shape elements across groups. 
BICbest 
scalar containing optimal value of BIC (smallest value of BIC). 
modelbest 
character identifying the best specification among the 14 GPCMs. 
alpha 
scalar containing the value of the trimming out.modelbest = character identifying the best specification among the 14 GPCMs. level which has been used. 
idx 
nby1 vector containing assignment of each unit to each of the k groups. Cluster names are integer numbers from 1 to k. 0 indicates trimmed observations. Note that this is the assignment using cshb=128; 
idxcshbbest 
nby1 vector containing assignment of each unit to each of the k groups. Cluster names are integer numbers from 1 to k. 0 indicates trimmed observations. Note that this is the assignment which used best value of cshb; 
pa 
strcture containing the the gpcm parameters (pa.pars, pa.shb, pa.shw, pa.cdet) for the best model. 
Y 
Original data matrix Y. The field is present if option Ysave is set to 1. out.alpha = scalar. Level of trimming which has been used. 
BICshb 
value of the information criterion for each candidate value of $c_{shb}$. Matrix with two columns and r rows. The first column contains the values of candidate values of $c_{shb}$ which have been used. Note that if max(cshb) is smaller or equal than $cshwbest^((v1)/v)$. The second column contains the associated values of the information crterion. r is the number of candidate values of cshb which have been considered. 
IDXshb 
matrix with size nbyr which contains the classification for each candidate value of cshb. The number of columns of this matrix is equal to the number of rows of out.BICshb. 
BICrot 
value of the information criterion for each type or rotation. First element refers to $I$ (no rotation) the second element refers to $E$ (equal rotation) and the third element refers to $V$ (varying rotation). 
IDXrot 
matrix with size nby3 which contains the classification for each type of rotation. First column refers to $I$ (no rotation), the second column refers to $E$ (equal rotation) and the third column refers to $V$ (varying rotation). 
Cerioli, A., GarciaEscudero, L.A., MayoIscar, A. and Riani M. (2017), Finding the Number of Groups in ModelBased Clustering via Constrained Likelihoods, "Journal of Computational and Graphical Statistics", pp. 404416, https://doi.org/10.1080/10618600.2017.1390469
tclustIC

tclust

tclustICsol

tclustICplot

carbikeplot