subsets

subsets creates a matrix of indexes where rows are distinct p-subsets extracted from a set of n elements

Syntax

  • C=subsets(nsamp, n, p)example
  • C=subsets(nsamp, n, p, ncomb)example
  • C=subsets(nsamp, n, p, ncomb, msg)example
  • C=subsets(nsamp, n, p, ncomb, msg, method)example
  • [C,nselected]=subsets(___)example

Description

example

C =subsets(nsamp, n, p) Create a matrix with the indexes of 5 subsets when n=100, p=3.

example

C =subsets(nsamp, n, p, ncomb) Create a matrix with the indexes of 5 subsets when n=100, p=3.

example

C =subsets(nsamp, n, p, ncomb, msg) Create a matrix with the indexes of 5 subsets when n=100, p=3.

example

C =subsets(nsamp, n, p, ncomb, msg, method) Create a matrix with the indexes of 5 subsets when n=100, p=3.

example

[C, nselected] =subsets(___) Extract 80000 subsets and check they are unique.

Examples

expand all

  • Create a matrix with the indexes of 5 subsets when n=100, p=3.
  • Only default arguments used.

    C = subsets(5,100,3)
    C =
    
      5×3 int8 matrix
    
        48    97    99
        67    85    90
        52    65    75
        16    49    83
        22    97   100
    
    

  • Create a matrix with the indexes of 5 subsets when n=100, p=3.
  • Use information on the number of combinations to speed up generation.

    ncomb = bc(100,3);
    C = subsets(5,100,3,ncomb)
    C =
    
      5×3 int8 matrix
    
       33   44   75
       13   26   71
        4   80   99
        2   31   39
       10   21   94
    
    

  • Create a matrix with the indexes of 5 subsets when n=100, p=3.
  • Also inform about the time taken for the operation.

    ncomb = bc(1000,5);
    C = subsets(500000,1000,5,ncomb,1);
    Warning: you have specified to extract more than 100000 subsets
    To iterate so many times may be lengthy
    

  • Create a matrix with the indexes of 5 subsets when n=100, p=3.
  • As the previous example, but in addition returns in nselected the number of combinations.

    ncomb = bc(1000,5);
    [C , nselected] = subsets(500000,1000,5,ncomb,1);

  • Extract 80000 subsets and check they are unique.
  • C=subsets(80000,100,5);
    size(unique(C,'rows'))

    Related Examples

    expand all

  • Sampling without replacement and the hypergeometric distribution.
  • clear all; close all;
    % parameters
    n      = 100;
    p      = 3;
    nsamp  = 1000000;
    ncomb  = bc(n,p);
    msg    = 0;
    % Sampling without repetition nsamp p-subsets from a dataset of n units.
    C = subsets(nsamp, n, p, ncomb, msg);
    if verLessThan('matlab','8.4')
    hist(double(C(:))); xlim([1 n]);
    else
    histogram(double(C(:)),'Normalization','pdf','BinMethod','Integers'); xlim([1 n]);
    % this superimposes a line with the unit counts
    frC = tabulateFS(double(C(:)));
    hold on; plot(1:n,frC(:,3)/100,'r-','LineWidth',3);
    end
    % The hypergeometric distribution hygepdf(X,M,K,N) computes the probability
    % of drawing exactly X of a possible K items in N drawings without
    % replacement from a group of M objects. For drawings with replacement,
    % the distribution would be binomial.
    hpdf = hygepdf(0:p,n,n/2,p);
    % Say that the n/2 target items (which determine the success of a draw) are
    % in the subset formed by units 1,2,...n/2. Let's then count how many times
    % we get units from this group.
    c   = C<=n/2;
    sc  = sum(c,2);
    tab = tabulateFS(sc);
    tab = (tab(:,2)/sum(tab(:,2)))';
    disp('Probability of getting 0 to p successes in p drawns (hypergeometric pdf):');
    disp(hpdf);
    disp('Frequencies of the 0 to p successes in the p drawns (subsets output):');
    disp(tab);
    Warning: number of subsets which have been chosen (1000000) are greater than possibile number (161700)
    All subsets are extracted
    Probability of getting 0 to p successes in p drawns (hypergeometric pdf):
        0.1212    0.3788    0.3788    0.1212
    
    Frequencies of the 0 to p successes in the p drawns (subsets output):
        0.1212    0.3788    0.3788    0.1212
    
    
    Click here for the graphical output of this example (link to Ro.S.A. website)

  • Weighted sampling without replacement and the non-central Wallenius hypergeometric distribution.
  • clear all; close all;
    % parameters
    n      = 500;
    p      = 3;
    nsamp  = 50000;
    ncomb  = bc(n,p);
    msg    = 0;
    % Sampling probability of the first n/2 units is 10 times larger than the others n/2.
    method = [10*ones(n/2,1); ones(n/2,1)];
    % no need to normalize weights: method = method(:)' / sum(method);
    C = subsets(nsamp, n, p, ncomb, msg, method);
    if verLessThan('matlab','8.4')
    hist(double(C(:))); xlim([1 n]);
    else
    histogram(double(C(:)),'Normalization','pdf','BinMethod','Integers');
    end
    % Here we address the case when the sampling (without replacement) is biased,
    % in the sense that the probabilities to select the units in the sample are
    % proportional to weights provided using option 'method'. In this case, the
    % extraction probabilities follow Wallenius' noncentral hypergeometric
    % distribution. The sampling scheme is the same of that of the hypergeometric
    % distribution but, in addition, the success and failure are associated with
    % weights w1 and w2 and we will say that the odds ratio is W = w1 / w2. The
    % function is then called as: wpdf = WNChygepdf(x,N,K,M,W).
    for i = 0:p
    wpdf(i+1) = WNChygepdf(i,p,n/2,n,10);
    end
    % counts of the actual samples
    c   = C<=n/2;
    sc  = sum(c,2);
    tab = tabulateFS(sc);
    tab = (tab(:,2)/sum(tab(:,2)))';
    disp('Probability of getting 0 to p successes in p weighted drawns (non-central hypergeometric pdf):');
    disp(wpdf);
    disp('Frequencies of the 0 to p successes in the p weighted drawns (subsets output):');
    disp(tab);
    % The non-central hypergeometric is also available in the R package
    % BiasedUrn. In the example above, where there are just two groups and one
    % weight defining the ratio between the units in the two groups, the function
    % to use is dWNCHypergeo (for Wallenius' distribution):
    %
    % dWNCHypergeo(c(0,1,2,3), 50, 50, 3, 10)
    % [1] 0.0007107089 0.0225823308 0.2296133830 0.7470935773
    %
    % The general syntax of the function is:
    % dWNCHypergeo(x, m1, m2, n, odds)
    % x  = Number of red balls sampled.
    % m1 = Initial number of red balls in the urn.
    % m2 = Initial number of white balls in the urn.
    % n  = Total number of balls sampled.
    % N  = Total number of balls in urn before sampling.
    % odds = Probability ratio of red over white balls.
    % p = Cumulative probability.
    % nran = Number of random variates to generate.
    % mu = Mean x.
    % precision = Desired precision of calculation.
    Probability of getting 0 to p successes in p weighted drawns (non-central hypergeometric pdf):
        0.0007    0.0225    0.2262    0.7505
    
    Frequencies of the 0 to p successes in the p weighted drawns (subsets output):
        0.0007    0.0224    0.2292    0.7476
    
    
    Click here for the graphical output of this example (link to Ro.S.A. website)

  • Weighted sampling without replacement, with negative weights.
  • clear all; close all;
    n = 200;
    p = 3;
    nsamp = 10000;
    ncomb = bc(n,p);
    msg = 0;
    method = [-4*ones(n/4,1); -2*ones(n/4,1) ; -1*ones(n/4,1); -4*ones(n/4,1)];
    C = subsets(nsamp, n, p, ncomb, msg, method);
    if verLessThan('matlab','8.4')
    hist(double(C(:))); xlim([1 n]);
    else
    histogram(double(C(:)),'Normalization','pdf','BinMethod','Integers');
    end

  • Function subset used in clustering or mixture modeling simulations.
  • clear all; close all;
    % parameters
    n      = 100;       %number of units
    p      = 2;         %number of variables
    k      = 3;         %number of groups
    nsamp  = 500;       %number of samples
    ncomb  = bc(n,p);
    msg    = 0;
    % A dataset simulated using MixSim
    rng(372,'twister');
    Q=MixSimreg(k,p,'BarOmega',0.001);
    [y,X,id]=simdatasetreg(n,Q.Pi,Q.Beta,Q.S,Q.Xdistrib);
    % Some user-defined weights for weighted sampling, provided as a vector of "method" option.
    method = [1*ones(n/2,1); ones(n/2,1)];
    % C must be a nsamp-by-k*p matrix, to contain the extraction of nsamp p-combinations k times.
    % This can be easily done as follows:
    for i=1:k
    Ck(:,(i-1)*p+1:i*p) = subsets(nsamp, n, p, ncomb, msg, method);
    end
    % Ck is then provided, e.g., to tclustreg as follows:
    out=tclustreg(y,X,k,50,0.01,0.01,'nsamp',Ck);
    Warning: columns 1  2 are constant and just col 1 has been kept!
    ClaLik with untrimmed units selected using crisp criterion
    sample = 1, concentration step = 1
    sample = 1, concentration step = 2
    sample = 1, concentration step = 3
    sample = 1, concentration step = 4
    sample = 1, concentration step = 5
    sample = 1, concentration step = 6
    sample = 1, concentration step = 7
    sample = 1, concentration step = 8
    sample = 1, concentration step = 9
    sample = 1, concentration step = 10
    sample = 1, concentration step = 11
    sample = 42, concentration step = 9
    sample = 42, concentration step = 10
    sample = 42, concentration step = 11
    Total estimated time to complete tclustreg: 16.97 seconds 
    sample = 58, concentration step = 7
    sample = 58, concentration step = 8
    sample = 58, concentration step = 9
    sample = 58, concentration step = 10
    sample = 58, concentration step = 11
    sample = 134, concentration step = 5
    sample = 134, concentration step = 6
    sample = 134, concentration step = 7
    sample = 134, concentration step = 8
    sample = 134, concentration step = 9
    sample = 134, concentration step = 10
    sample = 134, concentration step = 11
    sample = 157, concentration step = 4
    sample = 157, concentration step = 5
    sample = 157, concentration step = 6
    sample = 157, concentration step = 7
    sample = 157, concentration step = 8
    sample = 157, concentration step = 9
    sample = 157, concentration step = 10
    sample = 157, concentration step = 11
    sample = 190, concentration step = 10
    sample = 190, concentration step = 11
    sample = 256, concentration step = 3
    sample = 256, concentration step = 4
    sample = 256, concentration step = 5
    sample = 256, concentration step = 6
    sample = 256, concentration step = 7
    sample = 256, concentration step = 8
    sample = 256, concentration step = 9
    sample = 256, concentration step = 10
    sample = 256, concentration step = 11
    
    Click here for the graphical output of this example (link to Ro.S.A. website)

    Input Arguments

    expand all

    nsamp — Number of subsamples which have to be extracted. Scalar; if nsamp=0 all subsets will be extracted; they will be (n choose p).

    Data Types: single|double

    n — Number of observations of the dataset. Scalar.

    Data Types: single|double

    p — Size of the subsets. Scalar.

    In regression with p explanatory variable the size of the elmental subsets is p;

    in multivariate analysis, in presence of v variables, the size of the elemental subsets is v+1.

    Data Types: single|double

    Optional Arguments

    ncomb — scalar (n choose p). If the user has already computed this value it can supply it directly, otherwise the program will calculate it automatically.

    Example: C=subsets(20,10,3,120)

    Data Types: single|double

    msg — scalar which controls whether to display or not messages on the screen. If msg=1 (default), messages are displayed on the screen about estimated time.

    Example: C=subsets(20,10,3,120,0)

    Data Types: boolean

    method — Sampling methods. Scalar or vector.

    Methods used to extract the subsets. See section 'More About' of function randsampleFS for details about the sampling methods. Default is method = 1.

    - Scalar, from 0 to 3 determining the (random sample without replacement) method to be used.

    - Vector of weights: in such a case, Weighted Sampling Without Replacement is applied using that vector of weights.

    Example: randsampleFS(100,10,2)

    Data Types: single|double

    Output Arguments

    expand all

    C —The indices of the subsets which need to be extracted. Matrix with nselected rows and p columns (stored in int16 format)

    Data Types - single|double

    nselected —Number of rows of matrix C. Scalar

    Data Types - single|double

    References

    Wong, C.K. and M.C. Easton (1980) "An Efficient Method for Weighted Sampling Without Replacement", SIAM Journal of Computing, 9(1):111-113.

    This page has been automatically generated by our routine publishFS