MixSimreg

MixSimreg generates k regression hyperplanes in p dimensions with given overlap

Syntax

Description

MixSimreg(k,p) generates k groups in p dimensions. It is possible to control the average and maximum or standard deviation of overlapping.

Notation and background.

Given two generic clusters $i$ and $j$ with $i \ne j=1,...,k$, indexed by $\phi(x,\mu_i,\sigma_i^2)$ and $\phi(x,\mu_j, \sigma_j^2)$ with probabilities of occurrence $\pi_i$ and $\pi_j$, the misclassification probability with respect to cluster $i$ (denoted with $w_{j|i}$) is defined as

\[ Pr[\pi_i \phi(x,\mu_i,\sigma_i^2) < \pi_j \phi(x,\mu_j,\sigma_j^2)] \]

where, in the regression context, $\mu_i={\overline x}_i' \beta_i$ and $\mu_j= \overline x_j' \beta_j$. We assume that the length of vectors $x_i$, $x_j$, $\beta_i$, and $\beta_j$ is $p$ (number of explanatory variables including or excluding intercept). In our implementation, the distribution of the elements of vectors $\beta_i$ ($\beta_j$) can be 'Normal' (with parameters $\mu$ and $\sigma$), 'HalfNormal' (with parameter $\sigma$) or uniform (with parameters $a$ and $b$). Same thing for the distribution of the elements of $x_i$ ($x_j$). However, while the parameters of the distributions are the same for all elements of $\beta$ in all groups, the parameters of the distribution of the elements of vectors $x_i$ ($x_j$) can vary for each group and each explanatory variable. In other words, it is possible to specify (say) that the distribution of the second explanatory variable in the first group is $U(2, 3)$ while the distribution of the third explanatory variable in the second group is $U(2, 10)$.

The matrix containing the misclassification probabilities $w_{j|i}$ is called OmegaMap.

The probability of overlapping between groups i and j is given by

\[ w_{j|i} + w_{i|j} \qquad i,j=1,2, ..., k \]

The diagonal elements of OmegaMap are equal to 1.

The average overlap (BarOmega, in the code) is defined as the sum of the off diagonal elements of OmegaMap (containing the misclassification probabilities) divided by $k*(k-1)/2$.

The maximum overlap (MaxOmega, in the code) is defined as:

\[ \max (w_{j|i} + w_{i|j}) \qquad i \ne j=1,2, ..., k \]

The probability of overlapping $w_{j|i}$ is nothing but the cdf of a linear combination of non central $\chi^2$ distributions with 1 degree of freedom plus a linear combination of $N(0,1)$ evaluated in a point $c$.

The coefficients of the linear combinations of non central $\chi^2$ and $N(0,1)$ depend on the eigenvalues and eigenvectors of matrix $\Sigma_{j|i} = \Sigma^{0.5}_i \Sigma^{-1}_j \Sigma^{0.5}_i$.

Point c depends on the same eigenvalues and eigenvectors of matrix $\Sigma_{j|i}$, the mixing proportions $\pi_i$ and $\pi_j$ and the determinants $|\Sigma_i|$ and $|\Sigma_j|$. This probability is computed using routine ncx2mixtcdf

example

out =MixSimreg(k, p) Example 1: Mixture of regression with prefixed average overlap.

example

out =MixSimreg(k, p, Name, Value) Example 2: Mixture of regression with prefixed average overlap.

Examples

expand all

  • Example 1: Mixture of regression with prefixed average overlap.
  • Generate mixture of regression using an average overlapping at centroids =0.01. Use all default options 1) Beta is generated according to random normal for each group with mu=0 and sigma=1 2) X in each dimension and each group is generated according to U(0, 1) 3) regression hyperplanes contain intercepts

        p=5;
        k=3;
        Q=MixSimreg(k,p,'BarOmega',0.01);
        n=200;
        % Q.Xdistrib.BarX in this case has dimension 5-by-3 and is equal to
        % 1.0000    1.0000    1.0000
        % 0.5000    0.5000    0.5000
        % 0.5000    0.5000    0.5000
        % 0.5000    0.5000    0.5000
        % 0.5000    0.5000    0.5000
        % Probabilities of overlapping are evaluated at
        % Q.Beta(:,1)'*Q.Xdistrib.BarX(:,1) ... Q.Beta(:,3)'*Q.Xdistrib.BarX(:,3)
        [y,X,id]=simdatasetreg(n,Q.Pi,Q.Beta,Q.S,Q.Xdistrib);
        yXplot(y,X(:,2:end),'group',id);
    

  • Example 2: Mixture of regression with prefixed average overlap.
  • Generate mixture of regression hyperplanes using an average overlapping at centroids =0.01. 1) we use all the default options for Beta (random normal for each group with mu=0.5 and sigma=1) 2) X in the second dimension for the third group is generated according to U(1, 3)

        rng(10,'twister')
        % Specify the distribution of the explanatory variables
        Xdistrib=struct;
        Xdistrib.type='Uniform';
        Xdistrib.a=zeros(p-1,k);
        Xdistrib.a(2,3)=1;
        Xdistrib.b=ones(p-1,k);
        Xdistrib.b(2,3)=3;
        % 3) regression hyperplanes contain intercepts
        Q=MixSimreg(k,p,'BarOmega',0.01,'Xdistrib',Xdistrib);
        n=200;
        % Q.Xdistrib.BarX in this case has dimension 5-by-3 and is equal to
        %     1.0000    1.0000    1.0000
        %     0.5000    0.5000    0.5000
        %     0.5000    0.5000    2.0000
        %     0.5000    0.5000    0.5000
        %     0.5000    0.5000    0.5000
        % Probabilitties of overlapping are evaluated at
        % Q.Beta(:,1)'*Q.Xdistrib.BarX(:,1) ... Q.Beta(:,3)'*Q.Xdistrib.BarX(:,3)
        [y,X,id]=simdatasetreg(n,Q.Pi,Q.Beta,Q.S,Q.Xdistrib);
        yXplot(y,X(:,2:end),'group',id);
    

    Related Examples

  • Example 3: Mixture of regression with prefixed average overlap.
  • Exactly as before but now the distribution of beta is N(0 6)

        rng(10,'twister')
        p=5;
        k=3;
        % Specify the distribution of the explanatory variables
        Xdistrib=struct;
        Xdistrib.type='Uniform';
        Xdistrib.a=zeros(p-1,k);
        Xdistrib.a(2,3)=1;
        Xdistrib.b=ones(p-1,k);
        Xdistrib.b(2,3)=3;
        % Specify the distribution of the beta coefficients
        betadistrib=struct;
        betadistrib.type='Normal';
        betadistrib.sigma=6;
        Q=MixSimreg(k,p,'BarOmega',0.01,'Xdistrib',Xdistrib,'betadistrib',betadistrib);
        n=200;
        % Probabilitties of overlapping are evaluated at
        % Q.Beta(:,1)'*Q.Xdistrib.BarX(:,1) ... Q.Beta(:,3)'*Q.Xdistrib.BarX(:,3)
        % Q.betadistrib in this case is equal to
        %      type: 'Normal'
        %     sigma: 6
        %        mu: 0.5000
        [y,X,id]=simdatasetreg(n,Q.Pi,Q.Beta,Q.S,Q.Xdistrib);
        yXplot(y,X(:,2:end),'group',id)
    

  • Example 4: Internation trade data example.
  • All slopes are positive (beta generated using half normal) p=1 and there is no intercept

        rng(10,'twister')
        p=1;
        k=5;
        Xdistrib=struct;
        Xdistrib.intercept=0;
        Xdistrib.type='Uniform';
        Xdistrib.a=zeros(p,k);
        Xdistrib.b=10*ones(p,k);
    
        betadistrib=struct;
        betadistrib.type='HalfNormal';
        betadistrib.sigma=6;
        Q=MixSimreg(k,p,'BarOmega',0.01,'Xdistrib',Xdistrib,'betadistrib',betadistrib);
        n=200;
    
        [y,X,id]=simdatasetreg(n,Q.Pi,Q.Beta,Q.S,Q.Xdistrib);
        yXplot(y,X,'group',id)
    

  • Example 5: Another international trade data example.
  • Here the strips of certain groups are limited up to certain values There is no intercept. In this example we compare high and low overlap among regression hyperplanes

        p=1;
        k=4;
        Xdistrib=struct;
        Xdistrib.intercept=0;
        Xdistrib.type='Uniform';
        Xdistrib.a=zeros(p,k);
        Xdistrib.b=[4 2 10 5];
    
        betadistrib=struct;
        betadistrib.type='HalfNormal';
        betadistrib.sigma=6;
        n=200;
    
        % Strong overlap BarOmega=0.2
        close all
        rng(10,'twister')
        Q=MixSimreg(k,p,'BarOmega',0.2,'Xdistrib',Xdistrib,'betadistrib',betadistrib);
        [y,X,id]=simdatasetreg(n,Q.Pi,Q.Beta,Q.S,Q.Xdistrib);
        yXplot(y,X,'group',id,'tag','Strong_Overlap')
        set(gcf,'Name','Strong overlap')
    
        % Small overlap BarOmega=0.01
        rng(10,'twister')
        Q=MixSimreg(k,p,'BarOmega',0.01,'Xdistrib',Xdistrib,'betadistrib',betadistrib);
        [y,X,id]=simdatasetreg(n,Q.Pi,Q.Beta,Q.S,Q.Xdistrib);
        yXplot(y,X,'group',id,'tag','Small_Overlap')
        set(gcf,'Name','Small overlap')
    

  • Example 6: Betadistrib with a specific parameter for each group.
  •     rng(10,'twister')
        clear all
        close all
        intercept = 1;
        n=200;
        p=1;
        k=2;
    
        Xdistrib=struct;
        Xdistrib.intercept=intercept;
        Xdistrib.type='Uniform';
        Xdistrib.a=zeros(p,k);
        Xdistrib.b=10*ones(p,k);
    
        % beta distributed as HalfNormal
        betadistrib=struct;
        betadistrib.type='HalfNormal';
        betadistrib.sigma=6;
    
        % beta distributed as normal
        betadistrib1=struct;
        betadistrib1.type='Normal';
        betadistrib2 = betadistrib1;
        betadistrib1.mu=  -1;
        betadistrib1.sigma=1;
        betadistrib2.mu=[-1 , 1];
        betadistrib2.sigma=[0.01 , 1];
    
        Q1=MixSimreg(k,p+intercept,'BarOmega',0.01,'Xdistrib',Xdistrib,'betadistrib',betadistrib1);
        Q2=MixSimreg(k,p+intercept,'BarOmega',0.01,'Xdistrib',Xdistrib,'betadistrib',betadistrib2);
    
        [y1,X1,id1]=simdatasetreg(n,Q1.Pi,Q1.Beta,Q1.S,Q1.Xdistrib);
        [y2,X2,id2]=simdatasetreg(n,Q2.Pi,Q2.Beta,Q2.S,Q2.Xdistrib);
        yXplot(y1,X1,'group',id1,'tag','scalar')
        title('Betadistrib is a scalar: same parameters for all betas')
        yXplot(y2,X2,'group',id2,'tag','vector')
        title('Betadistrib is a vector: a parameter for each beta')
        cascade;
    
    

    Input Arguments

    expand all

    k — Number of groups (components). Scalar.

    Desired number of groups.

    Data Types: int16|int32|int64|single|double

    p — Number of explanatory variables for each regression hyperplane (including intercept). Scalar.

    Desired number of variables.

    Data Types: int16|int32|int64|single|double

    Name-Value Pair Arguments

    Specify optional comma-separated pairs of Name,Value arguments. Name is the argument name and Value is the corresponding value. Name must appear inside single quotes (' '). You can specify several name and value pair arguments in any order as Name1,Value1,...,NameN,ValueN.

    Example: 'BarOmega',0.05 , 'MaxOmega',0.05 , 'StdOmega',0.05 , 'hom',false , 'restrfactor',8 , 'PiLow',0.1 , 'Xdistrib',1 , 'betadistrib',1 , 'resN',3 , 'tol',[0.01 0.02] , 'lim',0.001 , 'Display','off'

    BarOmega —Requested average overlap.scalar.

    Value of desired average overlap. The default value is ''

    Example: 'BarOmega',0.05

    Data Types: double

    MaxOmega —Requested maximum overlap.scalar.

    Value of desired maximum overlap. If BarOmega is empty the default value of MaxOmega is 0.15.

    Example: 'MaxOmega',0.05

    Data Types: double

    StdOmega —Requested std of overlap.scalar.

    Value of desired standard deviation of overlap.

    Remark 1: The probability of overlapping between two clusters i and j (i \ne j=1, 2, ..., k), called pij, is defined as the sum of the two misclassification probabilities pij=w_{j|i} + w_{i|j} Remark 2: it is possible to specify up to two values among BarOmega MaxOmega and StdOmega.

    Example: 'StdOmega',0.05

    Data Types: double

    hom —Equal Sigmas.scalar boolean.

    Scalar boolean which specifies if the desired clusters have to be heterogeneous or homogeneous:

    hom=false (default) ==> heterogeneous, hom=true ==> homogeneous \Sigma_1 = ... = \Sigma_k

    Example: 'hom',false

    Data Types: boolean

    restrfactor —restriction factor.scalar.

    Scalar in the interval [1 \infty] which specifies the maximum ratio to allow between the largest \sigma^2 and the smallest \sigma^2 which are generated. If, for example, restrfactor=10, after generating the covariance matrices we check that the ratio \sigma^2_i/sigma^2_j i \ne j=1, ..., k is not larger than restrfactor. In order to apply this restriction (which is typical of tclust, we call routine restreigen.m)

    Example: 'restrfactor',8

    Data Types: double

    PiLow —Smallest miximg proportion.scalar.

    Value of the smallest mixing proportion (if 'PiLow' is not reachable with respect to k, equal proportions are taken; PiLow = 1.0 implies equal proportions by default).

    PiLow must be a number in the interval (0 1]

    Example: 'PiLow',0.1

    Data Types: double

    Xdistrib —distribution to use for each explanatory variable.scalar | structure.

    It specifies the distribution to use for each explanatory variable and each group. Once chosen, the distribution is fixed for each explanatory variable and each group; however, the parameters of the chosen distribution may vary across variables and groups. For example, once decided that the distibution of the X is uniform, the second variable of the first group can be defined in [a21 b21] while the third variable of the second group can be defined in [a32 b32].

    - If Xdistrib = 1 the default is to assume that the explanatory variables come from U(0, 1) and that the first explanatory variable is a constant term (the intercept).

    - If Xdistrib is a structure, it may contain information about the distribution (in the fieldname 'type') and the parameters of the distribution. The following options are admitted for Xdistrib:

    Value Description
    intercept

    scalar equal to 1 if intercept is present. The default value of Xdistrib.intercept is 1.

    The other fields of Xdistrib depend on the distribution which is chosen.

    type

    string which identifies the kind of distribution.

    Possibile values are:

    'Normal'; NORMAL DISTRIBUTION N(mu, sigma); In this case the use must supply mu and sigma.

    'Uniform'; UNIFORM DISTRIBUTION U(a, b).

    'HalfNormal'; HALF NORMAL DISTRIBUTION Half(sigma)= |N(0 sigma)|.

    'User'. OTHER DISTRIBUTION. In this case the user must directly provide means of the p explanatory variables for each group.

    mu

    matrix of size (p-1)-by-k if Xdistrib.intercept=1 or p-by-k if Xdistrib.intercept=0 containing the parameters mu for each explanatory variable and each group. The default value of Xdistrib.mu is 0.5*ones(p-1, k).

    Xdistrib.mu is used just if X.distrib.type is normal.

    sigma

    matrix of size (p-1)-by-k if (Xdistrib.intercept=1) or p-by-k if (Xdistrib.intercept=0) containing the parameters sigma for each explanatory variable and each group.

    The default value of Xdistrib.sigma is ones(p-1,k).

    Xdistrib.sigma is used just if X.distrib.type is 'Normal' or if Xdistrib.type is 'HalfNormal'.

    a

    matrix of size (p-1)-by-k if Xdistrib.intercept=1 or p-by-k if Xdistrib.intercept=0 containing the parameters a (the lower limits) for each explanatory variable and each group. The default value of Xdistrib.a is zeros(p-1, k).

    Xdistrib.a is used just if Xdistrib.type is 'Uniform'

    b

    matrix of size (p-1)-by-k if Xdistrib.intercept=1 or p-by-k if Xdistrib.intercept=0 containing the parameters b (the upper limits) for each explanatory variable and each group. The default value of Xdistrib.b is ones(p-1, k).

    Xdistrib.b is used just if Xdistrib.type is 'Uniform'

    BarX

    (p-1)-by k matrix if intercept is present or p-by-k matrix if intercept is not present containing the means of the p explanatory variables for each group.

    Xdistrib.BarX is used just if Xdistrib.type is 'User'

    Example: 'Xdistrib',1

    Data Types: double

    betadistrib —distribution to use for regression coefficients.scalar | structure.

    It specifies the distribution to use for each element of the vectors of regression coefficients.

    Scalar or structure.

    Once chosen, the distribution together with its parameters is fixed for each element of beta, across each group.

    - If betadistrib = 1 the default is to assume that the vector of regression coefficients come from N(0, 1).

    - If betadistrib is a structure it may contain information about the distribution (in the fieldname type) and the parameters of the distribution.

    The following options are admitted for betadistrib:

    Value Description
    type

    string which identifies the kind of distribution.

    Possibile values are:

    'Normal'; NORMAL DISTRIBUTION N(mu, sigma); In this case the user must supply mu and sigma.

    'Uniform'; UNIFORM DISTRIBUTION U(a, b).

    'HalfNormal'; HALF NORMAL DISTRIBUTION Half(sigma)= |N(0 sigma)|.

    'User'. OTHER DISTRIBUTION.

    mu

    scalar, containing parameter mu for the distribution of each element of beta across each group. The default value of betadistrib.mu is 0.

    betadistrib.mu is used just if betadistrib.type is normal.

    sigma

    scalar, containing parameter sigma for the distribution of each element of beta across each group. The default value of betadistrib.sigma is 1.

    betadistrib.sigma is used just if betadistrib.type is Normal or if betadistrib is HalfNormal.

    a

    scalar, containing parameter a for the distribution of each element of beta across each group. The default value of betadistrib.a is 0.

    betadistrib.a is used just if betadistrib.type is 'Uniform'.

    b

    scalar, containing parameter b for the distribution of each element of beta across each group. The default value of betadistrib.b is 1.

    betadistrib.a is used just if betadistrib.type is 'Uniform'.

    Beta

    matrix of size (p-1)-by k (if intercept is present) or p-by-k (if intercept is not present) containing the vectors of regression coefficients for the k groups.

    betadistrib.Beta is used just if betadistrib.type is 'User'.

    Example: 'betadistrib',1

    Data Types: double

    resN —maximum number of attempts.scalar integer.

    Maximum number of mixture re-simulations to find a simulation setting with prespecified level of overlapping.

    The default value of resN is 100.

    Example: 'resN',3

    Data Types: double

    tol —tolerance.vector of length 2.

    - tol(1) (which will be called tolmap) specifies the tolerance between the requested and empirical misclassification probabilities (default is 1e-06).

    - tol(2) (which will be called tolnxc2) specifies the tolerance to use in routine ncx2mixtcdf (which computes the cdf of linear combinations of non central chi2 distributions). The default value of tol(2) 1e-06.

    Example: 'tol',[0.01 0.02]

    Data Types: double

    lim —maximum number of integration terms to use inside routine ncx2mixtcdf.integer.

    Default is 1e06.

    REMARK: Parameters tolncx2=tol(2) and lim are used by function ncx2mixtcdf which computes the cdf of a linear combination of non central chi2 r.v.. This is the probability of misclassification.

    Example: 'lim',0.001

    Data Types: double

    Display —Level of display.logical.

    - 'off' displays no output.

    - 'notify' (default) displays output if requested overlap cannot be reached in a particular simulation.

    - 'iter' displays output at each iteration of each simulation.

    Example: 'Display','off'

    Data Types: char

    Output Arguments

    expand all

    out — description Structure

    Structure which contains the following fields

    Value Description
    OmegaMap

    matrix of misclassification probabilities (k-by-k);

    OmegaMap(i,j) = w_{j|i} is the probability that X, coming from the i-th component (group), is classified to the j-th component.

    BarOmega

    scalar. Value of average overlap. BarOmega is computed as (sum(sum(OmegaMap))-k)/(0.5*k(k-1))

    MaxOmega

    scalar. Value of maximum overlap. MaxOmega is the maximum of OmegaMap(i,j)+OmegaMap(j,i) (i ~= j)=1, 2, ..., k. In other words, MaxOmega= OmegaMap(rcMax(1),rcMax(2))+OmegaMap(rcMax(2),rcMax(1))

    StdOmega

    scalar. Value of standard deviation (std) of overlap.

    StdOmega is the standard deviation of the k*(k-1)/2 probabilities of overlapping

    rcMax

    vector of length 2. It containes the row and column numbers associated with the pair of components producing maximum overlap 'MaxOmega' fail = scalar, flag value. 0 indicates a successful mixture generation, 1 represents failure.

    Pi

    vector of length k containing the mixing proportions.

    Clearly, sum(out.Pi)=1.

    Beta

    p-by-k matrix containing (in each column) the regression coefficients for each group.

    Mu

    vector of length k, consisting of components' mean vectors for each regression hyperplane.

    out.Mu(1)=BarX'Beta(:,1) ... out.Mu(p)=BarX'Beta(:,k)

    S

    k-by-1 vector containing the variances for the k groups

    References

    Maitra, R. and Melnykov, V. (2010), Simulating data to study performance of finite mixture modeling and clustering algorithms, The Journal of Computational and Graphical Statistics, 2:19, 354-376. (to refer to this publication we will use "MM2010 JCGS").

    Melnykov, V., Chen, W.-C., and Maitra, R. (2012), MixSim: An R Package for Simulating Data to Study Performance of Clustering Algorithms, Journal of Statistical Software, 51:12, 1-25.

    Davies, R. (1980), The distribution of a linear combination of chi-square random variables, Applied Statistics, vol. 29, pp. 323-333.

    Parlett, B. N. and C. Reinschthe (1971), Balancing a Matrix for Calculation of Eigenvalues and Eigenvectors, Handbook for Auto. Comp., Vol. II, Linear Algebra, pp. 315-326.

    Parlett, B. N. and C. Reinschthe (1969), Balancing a matrix for calculation of eigenvalues and eigenvectors, Numerische Mathematik, 19, Volume 13, Issue 4, pp 293-304.

    Garcia-Escudero, L.A., Gordaliza, A., Matran, C. and Mayo-Iscar, A. (2008), A General Trimming Approach to Robust Cluster Analysis. Annals of Statistics, Vol.36, 1324-1345. Technical Report available at:

    www.eio.uva.es/inves/grupos/representaciones/trTCLUST.pdf

    Riani M., Cerioli A., Perrotta D. and Torti F. (2015), Simulating mixtures of multivariate data with fixed cluster overlap in FSDA, Advances in data analysis and classification. Volume 9, Issue 4, pp 461-481, DOI 10.1007/s11634-015-0223-9.

    This page has been automatically generated by our routine publishFS