# MixSimreg

MixSimreg generates k regression hyperplanes in p dimensions with given overlap

## Syntax

• out=MixSimreg(k,p)example
• out=MixSimreg(k,p,Name,Value)example

## Description

MixSimreg(k,p) generates k groups in p dimensions. It is possible to control the average and maximum or standard deviation of overlapping.

Notation and background.

Given two generic clusters $i$ and $j$ with $i \ne j=1,...,k$, indexed by $\phi(x,\mu_i,\sigma_i^2)$ and $\phi(x,\mu_j, \sigma_j^2)$ with probabilities of occurrence $\pi_i$ and $\pi_j$, the misclassification probability with respect to cluster $i$ (denoted with $w_{j|i}$) is defined as

$Pr[\pi_i \phi(x,\mu_i,\sigma_i^2) < \pi_j \phi(x,\mu_j,\sigma_j^2)]$

where, in the regression context, $\mu_i={\overline x}_i' \beta_i$ and $\mu_j= \overline x_j' \beta_j$. We assume that the length of vectors $x_i$, $x_j$, $\beta_i$, and $\beta_j$ is $p$ (number of explanatory variables including or excluding intercept). In our implementation, the distribution of the elements of vectors $\beta_i$ ($\beta_j$) can be 'Normal' (with parameters $\mu$ and $\sigma$), 'HalfNormal' (with parameter $\sigma$) or uniform (with parameters $a$ and $b$). Same thing for the distribution of the elements of $x_i$ ($x_j$). However, while the parameters of the distributions are the same for all elements of $\beta$ in all groups, the parameters of the distribution of the elements of vectors $x_i$ ($x_j$) can vary for each group and each explanatory variable. In other words, it is possible to specify (say) that the distribution of the second explanatory variable in the first group is $U(2, 3)$ while the distribution of the third explanatory variable in the second group is $U(2, 10)$.

The matrix containing the misclassification probabilities $w_{j|i}$ is called OmegaMap.

The probability of overlapping between groups i and j is given by

$w_{j|i} + w_{i|j} \qquad i,j=1,2, ..., k$

The diagonal elements of OmegaMap are equal to 1.

The average overlap (BarOmega, in the code) is defined as the sum of the off diagonal elements of OmegaMap (containing the misclassification probabilities) divided by $k*(k-1)/2$.

The maximum overlap (MaxOmega, in the code) is defined as:

$\max (w_{j|i} + w_{i|j}) \qquad i \ne j=1,2, ..., k$

The probability of overlapping $w_{j|i}$ is nothing but the cdf of a linear combination of non central $\chi^2$ distributions with 1 degree of freedom plus a linear combination of $N(0,1)$ evaluated in a point $c$.

The coefficients of the linear combinations of non central $\chi^2$ and $N(0,1)$ and point c depend on $\sigma_{j|i}^2 = \sigma_i^2/\sigma_j^2$.

This probability is computed using routine ncx2mixtcdf

 out =MixSimreg(k, p) Example 1: Mixture of regression with prefixed average overlap.

 out =MixSimreg(k, p, Name, Value) Example 2: Mixture of regression with prefixed average overlap.

## Examples

expand all

### Example 1: Mixture of regression with prefixed average overlap.

Generate mixture of regression using an average overlapping at centroids =0.01. Use all default options 1) Beta is generated according to random normal for each group with mu=0 and sigma=1 2) X in each dimension and each group is generated according to U(0, 1) 3) regression hyperplanes contain intercepts

p=5;
k=3;
out=MixSimreg(k,p,'BarOmega',0.01);
n=200;
% out.Xdistrib.BarX in this case has dimension 5-by-3 and is equal to
% 1.0000    1.0000    1.0000
% 0.5000    0.5000    0.5000
% 0.5000    0.5000    0.5000
% 0.5000    0.5000    0.5000
% 0.5000    0.5000    0.5000
% Probabilities of overlapping are evaluated at
% out.Beta(:,1)'*out.Xdistrib.BarX(:,1) ... out.Beta(:,3)'*out.Xdistrib.BarX(:,3)
[y,X,id]=simdatasetreg(n,out.Pi,out.Beta,out.S,out.Xdistrib);
yXplot(y,X(:,2:end),'group',id);

### Example 2: Mixture of regression with prefixed average overlap.

Generate mixture of regression hyperplanes using an average overlapping at centroids =0.01.

% 1) we use all the default options for Beta (random normal for each group with
% mu=0.5 and sigma=1)
% 2) X in the second dimension for the third group is generated according to U(1, 3)
rng(10,'twister')
% Specify the distribution of the explanatory variables
p=5;
k=3;
Xdistrib=struct;
Xdistrib.type='Uniform';
Xdistrib.a=zeros(p-1,k);
Xdistrib.a(2,3)=1;
Xdistrib.b=ones(p-1,k);
Xdistrib.b(2,3)=3;
% 3) regression hyperplanes contain intercepts
Q=MixSimreg(k,p,'BarOmega',0.01,'Xdistrib',Xdistrib);
n=200;
% Q.Xdistrib.BarX in this case has dimension 5-by-3 and is equal to
%     1.0000    1.0000    1.0000
%     0.5000    0.5000    0.5000
%     0.5000    0.5000    2.0000
%     0.5000    0.5000    0.5000
%     0.5000    0.5000    0.5000
% Probabilitties of overlapping are evaluated at
% Q.Beta(:,1)'*Q.Xdistrib.BarX(:,1) ... Q.Beta(:,3)'*Q.Xdistrib.BarX(:,3)
[y,X,id]=simdatasetreg(n,Q.Pi,Q.Beta,Q.S,Q.Xdistrib);
yXplot(y,X(:,2:end),'group',id);

## Related Examples

expand all

### Example 3: Mixture of regression with prefixed average overlap.

Exactly as before but now the distribution of beta is N(0 6)

rng(10,'twister')
p=5;
k=3;
% Specify the distribution of the explanatory variables
Xdistrib=struct;
Xdistrib.type='Uniform';
Xdistrib.a=zeros(p-1,k);
Xdistrib.a(2,3)=1;
Xdistrib.b=ones(p-1,k);
Xdistrib.b(2,3)=3;
% Specify the distribution of the beta coefficients
n=200;
% Probabilitties of overlapping are evaluated at
% Q.Beta(:,1)'*Q.Xdistrib.BarX(:,1) ... Q.Beta(:,3)'*Q.Xdistrib.BarX(:,3)
% Q.betadistrib in this case is equal to
%      type: 'Normal'
%     sigma: 6
%        mu: 0.5000
[y,X,id]=simdatasetreg(n,Q.Pi,Q.Beta,Q.S,Q.Xdistrib);
yXplot(y,X(:,2:end),'group',id)

### Example 4: Internation trade data example.

All slopes are positive (beta generated using half normal) p=1 and there is no intercept

rng(10,'twister')
p=1;
k=5;
Xdistrib=struct;
Xdistrib.intercept=0;
Xdistrib.type='Uniform';
Xdistrib.a=zeros(p,k);
Xdistrib.b=10*ones(p,k);
n=200;
[y,X,id]=simdatasetreg(n,Q.Pi,Q.Beta,Q.S,Q.Xdistrib);
yXplot(y,X,'group',id)

### Example 5: Another international trade data example.

Here the strips of certain groups are limited up to certain values There is no intercept.

% In this example we compare high and low overlap among regression hyperplanes
p=1;
k=4;
Xdistrib=struct;
Xdistrib.intercept=0;
Xdistrib.type='Uniform';
Xdistrib.a=zeros(p,k);
Xdistrib.b=[4 2 10 5];
n=200;
% Strong overlap BarOmega=0.2
close all
rng(10,'twister')
[y,X,id]=simdatasetreg(n,Q.Pi,Q.Beta,Q.S,Q.Xdistrib);
yXplot(y,X,'group',id,'tag','Strong_Overlap')
set(gcf,'Name','Strong overlap')
% Small overlap BarOmega=0.01
rng(10,'twister')
[y,X,id]=simdatasetreg(n,Q.Pi,Q.Beta,Q.S,Q.Xdistrib);
yXplot(y,X,'group',id,'tag','Small_Overlap')
set(gcf,'Name','Small overlap')

### Example 6: Betadistrib with a specific parameter for each group.

rng(10,'twister')
clear all
close all
intercept = 1;
n=200;
p=1;
k=2;
Xdistrib=struct;
Xdistrib.intercept=intercept;
Xdistrib.type='Uniform';
Xdistrib.a=zeros(p,k);
Xdistrib.b=10*ones(p,k);
% beta distributed as HalfNormal
% beta distributed as normal
[y1,X1,id1]=simdatasetreg(n,Q1.Pi,Q1.Beta,Q1.S,Q1.Xdistrib);
[y2,X2,id2]=simdatasetreg(n,Q2.Pi,Q2.Beta,Q2.S,Q2.Xdistrib);
yXplot(y1,X1,'group',id1,'tag','scalar')
title('Betadistrib is a scalar: same parameters for all betas')
yXplot(y2,X2,'group',id2,'tag','vector')
title('Betadistrib is a vector: a parameter for each beta')
cascade;

### Example 7: A Tweedie-based international trade data example.

This is like example 5, but the explanatory variable follow the Tweedie distribution.

n=500;
p=1;
k=3;
% generate a Tweedie distributed explanatory variable
[alpha,theta,delta] = deal(0.5 , 5 , 1); % Barabesi
X = twdrnd(alpha,theta,delta,n);
% Distribution of the explanatory variable: the mean the Tweedie is
% specified
Xdistrib=struct;
Xdistrib.intercept=0;
Xdistrib.type='User';
Xdistrib.BarX = mean(X);
% Distribution of the regression parameters
% Overlap level: BarOmega=0.1
close all
rng(10,'twister')
% The X-component
Q.Xdistrib.X = X;
% Now simulate the dataset
[y,X,id]=simdatasetreg(n,Q.Pi,Q.Beta,Q.S,Q.Xdistrib);
% And plot it
yXplot(y,X,'group',id,'tag','Strong_Overlap');
set(gcf,'Name','explanatory variable is Tweedie distributed');
title('Three components with the same Tweedie-distributed x-variable');

### Example 8: Another Tweedie-based international trade data example.

This is like example 7, but with group-specific Tweedie distributions.

n=500;
p=1;
k=3;
% generate three Tweedie distributed explanatory variables
[alpha,theta,delta] = deal(0.5 , 5 , 1);
alpha2 = 0.7;
alpha3 = 0.9;
X1 = twdrnd(alpha,theta,delta,floor(n/3));
X2 = twdrnd(alpha2,theta,delta,floor(n/3));
X3 = twdrnd(alpha3,theta,delta,n-2*floor(n/3));
% Distribution of the explanatory variable: the mean the three Tweedie
% components is specified
Xdistrib=struct;
Xdistrib.intercept=0;
Xdistrib.type='User';
Xdistrib.BarX = [mean(X1) , mean(X2) , mean(X3)];
% Distribution of the regression parameters
% Overlap level: BarOmega=0.1
close all
rng(10,'twister')
% The three X-components
Q.Xdistrib.X = [X1 ; X2 ; X3];
% Now simulate the dataset
[y,X,id]=simdatasetreg(n,Q.Pi,Q.Beta,Q.S,Q.Xdistrib);
% And plot it
yXplot(y,X,'group',id,'tag','Strong_Overlap');
set(gcf,'Name','explanatory variable is Tweedie distributed');
title('Three components with different Tweedie-distributed explanatory variables');

## Input Arguments

### k — Number of groups (components). Scalar.

Desired number of groups.

Data Types: int16|int32|int64|single|double

### p — Number of explanatory variables for each regression hyperplane (including intercept). Scalar.

Desired number of variables.

Data Types: int16|int32|int64|single|double

### Name-Value Pair Arguments

Specify optional comma-separated pairs of Name,Value arguments. Name is the argument name and Value is the corresponding value. Name must appear inside single quotes (' '). You can specify several name and value pair arguments in any order as  Name1,Value1,...,NameN,ValueN.

Example:  'BarOmega',0.05 , 'MaxOmega',0.05 , 'StdOmega',0.05 , 'hom',false , 'restrfactor',8 , 'PiLow',0.1 , 'Xdistrib',1 , 'betadistrib',1 , 'resN',3 , 'tol',[0.01 0.02] , 'lim',0.001 , 'Display','off' 

### BarOmega —Requested average overlap.scalar.

Value of desired average overlap. The default value is ''

Example:  'BarOmega',0.05 

Data Types: double

### MaxOmega —Requested maximum overlap.scalar.

Value of desired maximum overlap. If BarOmega is empty the default value of MaxOmega is 0.15.

Example:  'MaxOmega',0.05 

Data Types: double

### StdOmega —Requested std of overlap.scalar.

Value of desired standard deviation of overlap.

Remark 1: The probability of overlapping between two clusters $i$ and $j$ ($i \ne j=1, 2, ..., k$), called $p_{ij}$, is defined as the sum of the two misclassification probabilities $p_{ij}=w_{j|i} + w_{i|j}$.

Remark 2: it is possible to specify up to two values among BarOmega MaxOmega and StdOmega.

Example:  'StdOmega',0.05 

Data Types: double

### hom —Equal Sigmas.scalar boolean.

Scalar boolean which specifies if the desired clusters have to be heterogeneous or homogeneous: hom=false (default) ==> heterogeneous, hom=true ==> homogeneous \Sigma_1 = ... = \Sigma_k

Example:  'hom',false 

Data Types: boolean

### restrfactor —restriction factor.scalar.

Scalar in the interval $[1, \infty]$ which specifies the maximum ratio to allow between the largest $\sigma^2$ and the smallest $\sigma^2$ which are generated. If, for example, restrfactor=10, after generating the mixtures of regression lines (hyperplanes) we check that the ratio $\sigma^2_i/\sigma^2_j$, $i \ne j=1, ..., k$, is not larger than restrfactor. In order to apply this restriction, which is typical of tclust, we call routine restreigen.

Example:  'restrfactor',8 

Data Types: double

### PiLow —Smallest miximg proportion.scalar.

Value of the smallest mixing proportion (if 'PiLow' is not reachable with respect to k, equal proportions are taken; PiLow = 1.0 implies equal proportions by default).

PiLow must be a number in the interval (0 1]

Example:  'PiLow',0.1 

Data Types: double

### Xdistrib —distribution to use for each explanatory variable.scalar | structure.

It specifies the distribution to use for each explanatory variable and each group. Once chosen, the distribution is fixed for each explanatory variable and each group; however, the parameters of the chosen distribution may vary across variables and groups. For example, once decided that the distibution of the X is uniform, the second variable of the first group can be defined in [a21 b21] while the third variable of the second group can be defined in [a32 b32].

- If Xdistrib = 1 the default is to assume that the explanatory variables come from U(0, 1) and that the first explanatory variable is a constant term (the intercept).

- If Xdistrib is a structure, it may contain information about the distribution (in the fieldname 'type') and the parameters of the distribution. The following options are admitted for Xdistrib:

Value Description
intercept

scalar equal to 1 if intercept is present. The default value of Xdistrib.intercept is 1.

The other fields of Xdistrib depend on the distribution which is chosen.

type

string which identifies the kind of distribution.

Possibile values are: 'Normal'; NORMAL DISTRIBUTION N(mu, sigma); In this case the use must supply mu and sigma.

'Uniform'; UNIFORM DISTRIBUTION U(a, b).

'HalfNormal'; HALF NORMAL DISTRIBUTION Half(sigma)= |N(0 sigma)|.

'User'. OTHER DISTRIBUTION. In this case the user must directly provide means of the p explanatory variables for each group.

mu

matrix of size (p-1)-by-k if Xdistrib.intercept=1 or p-by-k if Xdistrib.intercept=0 containing the parameters mu for each explanatory variable and each group. The default value of Xdistrib.mu is 0.5*ones(p-1, k).

Xdistrib.mu is used just if X.distrib.type is normal.

sigma

matrix of size (p-1)-by-k if (Xdistrib.intercept=1) or p-by-k if (Xdistrib.intercept=0) containing the parameters sigma for each explanatory variable and each group.

The default value of Xdistrib.sigma is ones(p-1,k).

Xdistrib.sigma is used just if X.distrib.type is 'Normal' or if Xdistrib.type is 'HalfNormal'.

a

matrix of size (p-1)-by-k if Xdistrib.intercept=1 or p-by-k if Xdistrib.intercept=0 containing the parameters a (the lower limits) for each explanatory variable and each group. The default value of Xdistrib.a is zeros(p-1, k).

Xdistrib.a is used just if Xdistrib.type is 'Uniform'

b

matrix of size (p-1)-by-k if Xdistrib.intercept=1 or p-by-k if Xdistrib.intercept=0 containing the parameters b (the upper limits) for each explanatory variable and each group. The default value of Xdistrib.b is ones(p-1, k).

Xdistrib.b is used just if Xdistrib.type is 'Uniform'

BarX

(p-1)-by k matrix if intercept is present or p-by-k matrix if intercept is not present containing the means of the p explanatory variables for each group.

Xdistrib.BarX is used just if Xdistrib.type is 'User'

Example:  'Xdistrib',1 

Data Types: double

### betadistrib —distribution to use for regression coefficients.scalar | structure.

It specifies the distribution to use for each element of the vectors of regression coefficients.

Scalar or structure.

Once chosen, the distribution together with its parameters is fixed for each element of beta, across each group.

- If betadistrib = 1 the default is to assume that the vector of regression coefficients come from N(0, 1).

- If betadistrib is a structure it may contain information about the distribution (in the fieldname type) and the parameters of the distribution.

Value Description
type

string which identifies the kind of distribution.

Possibile values are: 'Normal'; NORMAL DISTRIBUTION N(mu, sigma); In this case the user must supply mu and sigma.

'Uniform'; UNIFORM DISTRIBUTION U(a, b).

'HalfNormal'; HALF NORMAL DISTRIBUTION Half(sigma)= |N(0 sigma)|.

'User'. OTHER DISTRIBUTION.

mu

scalar, containing parameter mu for the distribution of each element of beta across each group. The default value of betadistrib.mu is 0.

sigma

scalar, containing parameter sigma for the distribution of each element of beta across each group. The default value of betadistrib.sigma is 1.

a

scalar, containing parameter a for the distribution of each element of beta across each group. The default value of betadistrib.a is 0.

b

scalar, containing parameter b for the distribution of each element of beta across each group. The default value of betadistrib.b is 1.

Beta

matrix of size (p-1)-by k (if intercept is present) or p-by-k (if intercept is not present) containing the vectors of regression coefficients for the k groups.

Example:  'betadistrib',1 

Data Types: double

### resN —maximum number of attempts.scalar integer.

Maximum number of mixture re-simulations to find a simulation setting with prespecified level of overlapping.

The default value of resN is 100.

Example:  'resN',3 

Data Types: double

### tol —tolerance.vector of length 2.

- tol(1) (which will be called tolmap) specifies the tolerance between the requested and empirical misclassification probabilities (default is 1e-06).

- tol(2) (which will be called tolnxc2) specifies the tolerance to use in routine ncx2mixtcdf (which computes the cdf of linear combinations of non central chi2 distributions). The default value of tol(2) 1e-06.

Example:  'tol',[0.01 0.02] 

Data Types: double

### lim —maximum number of integration terms to use inside routine ncx2mixtcdf.integer.

Default is 1e06.

REMARK: Parameters tolncx2=tol(2) and lim are used by function ncx2mixtcdf which computes the cdf of a linear combination of non central chi2 r.v.. This is the probability of misclassification.

Example:  'lim',0.001 

Data Types: double

### Display —Level of display.logical.

- 'off' displays no output.

- 'notify' (default) displays output if requested overlap cannot be reached in a particular simulation.

- 'iter' displays output at each iteration of each simulation.

Example:  'Display','off' 

Data Types: char

## Output Arguments

### out — description Structure

Structure which contains the following fields

Value Description
OmegaMap

matrix of misclassification probabilities (k-by-k);

OmegaMap(i,j) = $w_{j|i}$ is the probability that X, coming from the $i$-th component (group), is classified to the $j$-th component.

BarOmega

scalar. Value of average overlap. BarOmega is computed as (sum(sum(OmegaMap))-k)/(0.5*k(k-1))

MaxOmega

scalar. Value of maximum overlap. MaxOmega is the maximum of OmegaMap(i,j)+OmegaMap(j,i) (i ~= j)=1, 2, ..., k. In other words, MaxOmega= OmegaMap(rcMax(1),rcMax(2))+OmegaMap(rcMax(2),rcMax(1))

StdOmega

scalar. Value of standard deviation (std) of overlap.

StdOmega is the standard deviation of the k*(k-1)/2 probabilities of overlapping.

rcMax

vector of length 2. It containes the row and column numbers associated with the pair of components producing maximum overlap 'MaxOmega'

fail

scalar, flag value. 0 indicates a successful mixture generation, 1 represents failure.

Pi

vector of length k containing the mixing proportions.

Clearly, sum(out.Pi)=1.

Mu

vector of length k, consisting of components' mean vectors for each regression hyperplane.

out.Mu(1)=BarX'Beta(:,1) ... out.Mu(p)=BarX'Beta(:,k)

S

k-by-1 vector containing the variances for the k groups.

Beta

p-by-k matrix containing (in each column) the regression coefficients for each group.

Xdistrib

distribution of X which has been used (struct).

betadistrib

distribution of beta which has been used (struct).

## References

Maitra, R. and Melnykov, V. (2010), Simulating data to study performance of finite mixture modeling and clustering algorithms, "The Journal of Computational and Graphical Statistics", Vol. 19, pp. 354-376. [to refer to this publication we will use "MM2010 JCGS"]

Melnykov, V., Chen, W.-C. and Maitra, R. (2012), MixSim: An R Package for Simulating Data to Study Performance of Clustering Algorithms, "Journal of Statistical Software", Vol. 51, pp. 1-25.

Davies, R. (1980), The distribution of a linear combination of chi-square random variables, "Applied Statistics", Vol. 29, pp. 323-333.

Parlett, B.N. and Reinsch, C. (1969), Balancing a matrix for calculation of eigenvalues and eigenvectors, "Numerische Mathematik", Vol. 13, pp. 293-304.

Parlett, B.N. and Reinsch, C. (1971), Balancing a Matrix for Calculation of Eigenvalues and Eigenvectors, in Bauer, F.L. Eds, "Handbook for Automatic Computation", Vol. 2, pp. 315-326, Springer.

Garcia-Escudero, L.A., Gordaliza, A., Matran, C. and Mayo-Iscar, A. (2008), A General Trimming Approach to Robust Cluster Analysis. Annals of Statistics, Vol. 36, 1324-1345.

Torti F., Perrotta D., Riani, M. and Cerioli A. (2019). Assessing Robust Methodologies for Clustering Linear Regression Data, "Advances in Data Analysis and Classification", Vol. 13, pp. 227–257, https://doi.org/10.1007/s11634-018-0331-4