tkmeans

tkmeans computes trimmed k-means

Syntax

out =tkmeans(Y,k,alpha)example
out =tkmeans(Y,k,alpha,Name,Value)example
[out , varargout]=tkmeans(___)example

Description

tkmeans(Y, k, alpha) partitions the points in the n-by-v data matrix Y into k clusters. This partition minimizes the trimmed sum, over all clusters, of the within-cluster sums of point-to-cluster-centroid distances. Rows of Y correspond to points, columns correspond to variables. tkmeans returns inside structure out an n-by-1 vector IDX containing the cluster indices of each point. By default, tkmeans uses (squared) Euclidean distances.

example

out =tkmeans(Y, k, alpha) Trimmed k-means using geyser data (1).

example

out =tkmeans(Y, k, alpha, Name, Value) Trimmed k-means using geyser data (2).

example

[out , varargout] =tkmeans(___) Trimmed k-means using M5data and different plots.

Examples

expand all

Trimmed k-means using geyser data (1).

3 groups and trimming level of 3 percent

close all
Y=load('geyser2.txt');
out=tkmeans(Y,3,0.03,'plots',1);

Total estimated time to complete trimmed k means:  0.35 seconds

Click here for the graphical output of this example (link to Ro.S.A. website).

Trimmed k-means using geyser data (2).

option weights =1

close all
Y=load('geyser2.txt');
out=tkmeans(Y,3,0.03,'plots',1,'weights',1);

Trimmed k-means using M5data and different plots.

Weights =1

Y=load('M5data.txt');
close all
out=tkmeans(Y(:,1:2),3,0,'plots',1);
out=tkmeans(Y(:,1:2),3,0.1,'plots','ellipse');
out=tkmeans(Y(:,1:2),3,0.1,'plots','contour');
out=tkmeans(Y(:,1:2),3,0.1,'plots','contourf');
out=tkmeans(Y(:,1:2),3,0.1,'plots','boxplotb');
cascade;
% using a structure for plots
contOpt = struct;
contOpt.cmap = summer 
out=tkmeans(Y(:,1:2),3,0.1,'plots',contOpt)
cascade

Related Examples

expand all

Trimmed k-means using structured noise.

The data have been generated using the following R instructions set.seed (0) v <- runif (100, -2 * pi, 2 * pi) noise <- cbind (100 + 25 * sin (v), 10 + 5 * v) x <- rbind ( rmvnorm (360, c (0.0, 0), matrix (c (1, 0, 0, 1), ncol = 2)), rmvnorm (540, c (5.0, 10), matrix (c (6, -2, -2, 6), ncol = 2)), noise)

close all
Y=load('structurednoise.txt');
out=tkmeans(Y(:,1:2),2,0.1,'plots',1);
out=tkmeans(Y(:,1:2),5,0.15,'plots',1);
cascade

Trimmed k-means using mixture100 data.

The data have been generated using the following R instructions set.seed (100) mixt <- rbind (rmvnorm (360, c ( 0, 0), matrix (c (1, 0, 0, 1), ncol = 2)), rmvnorm (540, c ( 5, 10), matrix (c (6, -2, -2, 6), ncol = 2)), rmvnorm (100, c (2.5, 5), matrix (c (50, 0, 0, 50), ncol = 2)))

close all
Y=load('mixture100.txt');
out=tkmeans(Y(:,1:2),3,0,'plots',1);
out=tkmeans(Y(:,1:2),2,0.05,'plots',1);
cascade

Input Arguments

expand all

`Y` — Input data. Matrix.

Data matrix containining n observations on v variables.

Rows of Y represent observations, and columns represent variables.

Missing values (NaN's) and infinite values (Inf's) are allowed, since observations (rows) with missing or infinite values will automatically be excluded from the computations.

Data Types: single| double

`k` — Number of groups. Scalar.

Data Types: single| double

`alpha` — Global trimming level. Scalar.

alpha is a scalar between 0 and 0.5.

Il alpha = 0 tkmeans reduces to kmeans.

Data Types: single| double

Name-Value Pair Arguments

Specify optional comma-separated pairs of Name,Value arguments. Name is the argument name and Value is the corresponding value. Name must appear inside single quotes (' '). You can specify several name and value pair arguments in any order as Name1,Value1,...,NameN,ValueN.

Example:

 'nsamp',0
, 'refsteps',20
, 'reftol',0.0001
, 'weights',1
, 'plots', 1
, 'msg',1
, 'nocheck',1
, 'nomes',1
, 'Ysave',1

`nsamp` —Number of subsamples.integer.

Number of subsamples which will be extracted to find the partition. If nsamp=0 all subsets will be extracted. They will be (n choose k).

Remark: if the number of all possible subset is <300 the default is to extract all subsets, otherwise just 300.

Example: 'nsamp',0

Data Types: double

`refsteps` —Number of iterations.scalar.

Number of refining iterations in each subsample (default = 15).

Example: 'refsteps',20

Data Types: double

`reftol` —Tolerance.scalar.

Default value of tolerance for the refining steps. The default value is 1e-14.

Example: 'reftol',0.0001

Data Types: double

`weights` —Cluster weights.integer.

A dummy scalar, specifying whether cluster weights (1) shall be considered in the concentration and assignment steps. If weights=1 in the assignment step to the squared Euclidean distance of unit i to group j log n_j is substracted. The default is no cluster weights.

Example: 'weights',1

Data Types: double

`plots` —Plot on the screen.scalar, char | struct.

- If plots=0 (default), plots are not generated.

- If plot=1, a plot with the classification is shown on the screen (using the spmplot function). The plot can be: * for v=1, an histogram of the univariate data.

* for v=2, a bivariate scatterplot.

* for v>2, a scatterplot matrix generated by spmplot.

When v>=2 plots offers the following additional features (for v=1 the behaviour is forced to be as for plots=1): - plots='contourf' adds in the background of the bivariate scatterplots a filled contour plot. The colormap of the filled contour is based on grey levels as default.

This argument may also be inserted in a field named 'type' of a structure. In the latter case it is possible to specify the additional field 'cmap', which changes the default colors of the color map used. The field 'cmap' may be a three-column matrix of values in the range [0,1] where each row is an RGB triplet that defines one color.

Check the colormap function for additional informations.

- plots='contour' adds in the background of the bivariate scatterplots a contour plot. The colormap of the contour is based on grey levels as default. This argument may also be inserted in a field named 'type' of a structure.

In the latter case it is possible to specify the additional field 'cmap', which changes the default colors of the color map used. The field 'cmap' may be a three-column matrix of values in the range [0,1] where each row is an RGB triplet that defines one color.

Check the colormap function for additional informations.

- plots='ellipse' superimposes confidence ellipses to each group in the bivariate scatterplots. The size of the ellipse is chi2inv(0.95,2), i.e. the confidence level used by default is 95%. This argument may also be inserted in a field named 'type' of a structure. In the latter case it is possible to specify the additional field 'conflev', which specifies the confidence level to use and it is a value between 0 and 1.

- plots='boxplotb' superimposes on the bivariate scatterplots the bivariate boxplots for each group, using the boxplotb function. This argument may also be inserted in a field named 'type' of a structure.

If plots is a struct it may contain the following fields:

Value Description

Value	Description
`type`	a char specifying the type of superimposition Choices are 'contourf', 'contour', 'ellipse' or 'boxplotb'. REMARK - The labels=0 are automatically excluded from the overlaying phase, considering them as outliers.

type

a char specifying the type of superimposition Choices are 'contourf', 'contour', 'ellipse' or 'boxplotb'.

REMARK - The labels=0 are automatically excluded from the overlaying phase, considering them as outliers.

Example: 'plots', 1

Data Types: single | double | string

`msg` —Message on the screen.scalar.

Scalar which controls whether to display or not messages on the screen. If msg=1 (default) messages are displayed on the screen about estimated time to compute the estimator else no message is displayed on the screen.

Example: 'msg',1

Data Types: double

`nocheck` —Check input.scalar.

If nocheck is equal to 1 no check is performed on matrix Y.

As default nocheck=0.

Example: 'nocheck',1

Data Types: double

`nomes` —Estimated time message.scalar.

If nomes is equal to 1 no message about estimated time to compute tkemans is displayed, else if nomes is equal to 0 (default), a message about estimated time is displayed.

Example: 'nomes',1

Data Types: double

`Ysave` —Saving Y.scalar.

Scalar that is set to 1 to request that the input matrix Y is saved into the output structure out.

Default is 0, i.e. no saving is done.

Example: 'Ysave',1

Data Types: double

Output Arguments

expand all

`out` — description Structure

Structure which contains the following fields

Value	Description
`idx`	n-by-1 vector containing assignment of each unit to each of the k groups. Cluster names are integer numbers from 1 to k, 0 indicates trimmed observations.
`muopt`	k-by-v matrix containing cluster centroids locations. Robust estimate of final centroids of the groups.
`sigmaopt`	v-by-v-by-k empirical covariance matrices of the groups found by tkmeans.
`BoxTest`	Structure containing the results of the Box test of equality of covariance matrices.
`bs`	k-by-1 vector containing the units forming initial subset associated with muopt.
`D`	n-by-k matrix containing squared Euclidean distances from each point to every centroid.
`siz`	Matrix of size k-by-3 1st col = sequence from 0 to k 2nd col = number of observations in each cluster 3rd col = percentage of observations in each cluster Remark: 0 denotes unassigned units.
`weights`	Numerical vector of length k, containing the weights of each cluster. If input option weights=1 out.weights=(1/k, ...., 1/k) else if input option weights <> 1 out.weights=(n1/n, ..., nk/n).
`h`	Scalar. Number of observations that have determined the centroids (number of untrimmed units).
`obj`	Scalar. Value of the objective function which is minimized (value of the best returned solution).
`Y`	Original data matrix Y. The field is present only if option Ysave is set to 1.
`emp`	"Empirical" statistics computed on final classification. Scalar or structure. When convergence is reached, out.emp=0. When convergence is not obtained, this field is a structure which contains the statistics of interest: idxemp (ordered from 0 to k, k being the number of groups with at least one observation and 0 representing the possible group of outliers), muemp, sigmaemp and sizemp, which are the empirical counterparts of idx, muopt, sigmaopt and siz.

`varargout` —C : Subsamples extracted. Cell

nsamp-by-k initial subsamples extracted.

More About

expand all

Additional Details

This iterative algorithm initializes k clusters randomly and performs "concentration steps" in order to improve the current cluster assignment.

The number of maximum concentration steps to be performed is given by input parameter refsteps. For approximately obtaining the global optimum, the system is initialized nsamp times and concentration steps are performed until convergence or refsteps is reached. When processing more complex data sets higher values of nsamp and refsteps have to be specified (obviously implying extra computation time). However, if more then half of the iterations do not converge, a warning message is issued, indicating that nsamp has to be increased.

References

Garcia-Escudero, L.A., Gordaliza, A., Matran, C. and Mayo-Iscar, A. (2008), A General Trimming Approach to Robust Cluster Analysis. Annals of Statistics, Vol. 36, 1324-1345.

Documentation

tkmeans

Syntax

Description

Examples

Trimmed k-means using geyser data (1).

Trimmed k-means using geyser data (2).

Trimmed k-means using M5data and different plots.

Related Examples

Trimmed k-means using structured noise.

Trimmed k-means using mixture100 data.

Input Arguments

`Y` — Input data. Matrix.

`k` — Number of groups. Scalar.

`alpha` — Global trimming level. Scalar.

Name-Value Pair Arguments

`nsamp` —Number of subsamples.integer.

`refsteps` —Number of iterations.scalar.

`reftol` —Tolerance.scalar.

`weights` —Cluster weights.integer.

`plots` —Plot on the screen.scalar, char | struct.

`msg` —Message on the screen.scalar.

`nocheck` —Check input.scalar.

`nomes` —Estimated time message.scalar.

`Ysave` —Saving Y.scalar.

Output Arguments

`out` — description Structure

`varargout` —C : Subsamples extracted. Cell

More About

Additional Details

References

See Also

Documentation

tkmeans

Syntax

Description

Examples

Trimmed k-means using geyser data (1).

Trimmed k-means using geyser data (2).

Trimmed k-means using M5data and different plots.

Related Examples

Trimmed k-means using structured noise.

Trimmed k-means using mixture100 data.

Input Arguments

Y — Input data. Matrix.

k — Number of groups. Scalar.

alpha — Global trimming level. Scalar.

Name-Value Pair Arguments

nsamp —Number of subsamples.integer.

refsteps —Number of iterations.scalar.

reftol —Tolerance.scalar.

weights —Cluster weights.integer.

plots —Plot on the screen.scalar, char | struct.

msg —Message on the screen.scalar.

nocheck —Check input.scalar.

nomes —Estimated time message.scalar.

Ysave —Saving Y.scalar.

Output Arguments

out — description Structure

varargout —C : Subsamples extracted. Cell

More About

Additional Details

References

See Also

`Y` — Input data. Matrix.

`k` — Number of groups. Scalar.

`alpha` — Global trimming level. Scalar.

`nsamp` —Number of subsamples.integer.

`refsteps` —Number of iterations.scalar.

`reftol` —Tolerance.scalar.

`weights` —Cluster weights.integer.

`plots` —Plot on the screen.scalar, char | struct.

`msg` —Message on the screen.scalar.

`nocheck` —Check input.scalar.

`nomes` —Estimated time message.scalar.

`Ysave` —Saving Y.scalar.

`out` — description Structure

`varargout` —C : Subsamples extracted. Cell