SparseTableTest

SparseTableTest computes independence test for large and sparse contingency tables

Syntax

out=SparseTableTest(N)example
out=SparseTableTest(N,Name,Value)example

Description

This function implements a new test of indipendence between row variables distribution ('outcomes') and columns ('treatments') which is expecially suited for the analysis of large and sparse $I$ -by- $J$ contingency tables. The procedure is based on the collapsing of the original table into a set of 2-by-2 tables for each cell of the original table which has no less than a small number of counts (set in the optional input parameter 'threshold') and testing each of the resulting collapsed tables for independence by any test (Fisher exact test (default), Barnard test or those belonging to the power divergence family of Cressie and Read).

Because of the Bonferroni inequality, a sufficient condition for attaining a significance level $\alpha$ for this test (i.e., the probability of detecting a positive association between two levels of the response variables when in fact there is no such association) is that each test done for each cell of the $I$ -by- $J$ table rejects with significance level equal to $\alpha$ divided by the number of comparisons done. An additional bonus of the procedures is that it enables to highlight the most important contribution to the association of each single entry of the original I-by-J-table two way table. The original idea of this test is due to Spyros Arsenis (Joint Research Centre of the European Commission) and has been successfully applied to the analysis of contingency table coming from international trade data.

example

out =SparseTableTest(N) SparseTableTest with all default options.

example

out =SparseTableTest(N, Name, Value) Cressie and Read test on collapsed contingency table.

Examples

expand all

SparseTableTest with all default options.

For reproducibility

rng default;  
x1 = unidrnd(3,50,1);
x2 = unidrnd(3,50,1);
% Cross-tabulate x1 and x2.
InputTable = crosstab(x1,x2);
out = SparseTableTest(InputTable);

Cressie and Read test on collapsed contingency table.

la=2/3;
% Input is a matrix.
% T = Contingency Table for Car Accident Type (rows) by
% Accident Severity (columns)
T=[2365 944 412; 249 585 276];
out=SparseTableTest(T,'testname',la);

Related Examples

expand all

Chi-squared test on collapsed contingency table.

Input is a data matrix and contingency table has to be built

load smoke
% X = original data matrix
X=smoke{:,:};
% Chi-squared test is used on collapsed 2-by-2 tables. 
% Cells which have a frequency smaller or equal than 15 are ignored. 
out=SparseTableTest(X,'datamatrix',true,'threshold',15,'testname',1);
% show the output obtained
RejectedBonf = out.RejectedBonf
RejectedSidak = out.RejectedSidak
TestResults = out.TestResults

RejectedBonf =

  5×4 logical array

   0   0   0   0
   0   0   0   0
   1   0   0   0
   1   0   0   0
   0   0   0   0


RejectedSidak =

  5×4 logical array

   0   0   0   0
   0   0   0   0
   1   0   0   0
   1   0   0   0
   0   0   0   0


TestResults =

       Inf       Inf       Inf       Inf
       Inf       Inf       Inf       Inf
    0.0018       Inf       Inf       Inf
    0.0023    0.2340    0.1432       Inf
       Inf       Inf       Inf       Inf

Input Arguments

expand all

`N` — Contingency table (default) or n-by-2 input dataset. Matrix or Table.

Matrix or table which contains the input contingency table (say of size I-by-J) or the original data matrix. In this last case N=crosstab(N(:,1),N(:,2)).

By default the procedure assumes that the input is a contingency table.

Data Types: single| double

Name-Value Pair Arguments

Specify optional comma-separated pairs of Name,Value arguments. Name is the argument name and Value is the corresponding value. Name must appear inside single quotes (' '). You can specify several name and value pair arguments in any order as Name1,Value1,...,NameN,ValueN.

Example:

 'threshold',3
, 'alpha',0.05
, 'testname',1
, 'datamatrix',true

`threshold` —Threshold to select collapsed contigencey tables.scalar.

Scalar which specifies above which value collapsed contingency tables have to be produced. The default value of threshold is 2.

Example: 'threshold',3

Data Types: single | double | int32 | int64

`alpha` —Significance level.scalar value in the range (0,1).

Significance level of the hypothesis test, specified as the comma-separated pair consisting of 'alpha' and a scalar value in the range (0,1). The default value of alpha is 0.01.

Example: 'alpha',0.05

Data Types: single | double

`testname` —Test to use on collapsed 2-by-2 tables.char | double.

If testname is a number, it identifies the value of $\lambda$ to use of the power divergence family. See function CressieRead for further details. If testname is a character, possible values are 'Fisher' (to use the Fisher exact test, see function fishertest) or 'Barnard' (to use Barnard exact test, see function barnardtest). The default value of testname is 1, that is $\chi^2$ test is used. Note also that fishertest has been introduced in MATLAB in release 2014b.

Example: 'testname',1

Data Types: single | double | char

`datamatrix` —Data matrix or contingency table.boolean.

If datamatrix is true the first input argument N is forced to be interpreted as a data matrix, else if the input argument is false N is treated as a contingency table. The default value of datamatrix is false, that is the procedure automatically considers N as a contingency table

Example: 'datamatrix',true

Data Types: logical

Output Arguments

expand all

`out` — description Structure

Structure which contains the following fields:

Value Description

Value	Description
`TestResults`	p-values based on collapsed contingency tables. I-by-J matrix. The $(i,j)$ -th entry of the TestResults matrix is the p-value of the Fisher exact test based on the collapsed $(i,j)$ -th table. If the $(i,j)$ -th entry of input matrix UserData is smaller or equal than the input parameter threshold, the test is not performed and the corresponding $(i,j)$ -th entry of matrix TestResults is equal to Inf.
`RejectedBonf`	Results of the tests based on Bonferrroni threshold. Boolean matrix. The $(i,j)$ -th entry of the RejectedBonf matrix is true if the corresponding test based on the collapsed $(i,j)$ -th table is significant. Bonferroni threshold is used.
`RejectedSidak`	Results of the tests based on Sidak threshold. Boolean matrix. The $(i,j)$ -th entry of the RejectedSidak matrix is true if the corresponding test based on the collapsed $(i,j)$ -th table is significant. Sidak threshold is used.

TestResults

p-values based on collapsed contingency tables.

I-by-J matrix.

The $(i,j)$ -th entry of the TestResults matrix is the p-value of the Fisher exact test based on the collapsed $(i,j)$ -th table. If the $(i,j)$ -th entry of input matrix UserData is smaller or equal than the input parameter threshold, the test is not performed and the corresponding $(i,j)$ -th entry of matrix TestResults is equal to Inf.

RejectedBonf

Results of the tests based on Bonferrroni threshold.

Boolean matrix.

The $(i,j)$ -th entry of the RejectedBonf matrix is true if the corresponding test based on the collapsed $(i,j)$ -th table is significant. Bonferroni threshold is used.

RejectedSidak

Results of the tests based on Sidak threshold.

Boolean matrix.

The $(i,j)$ -th entry of the RejectedSidak matrix is true if the corresponding test based on the collapsed $(i,j)$ -th table is significant. Sidak threshold is used.

More About

expand all

Additional Details

$N$ = $I$ -by- $J$ -contingency table. The $(i,j)$ -th element is equal to $n_{ij}$ , $i=1, 2, \ldots, I$ and $j=1, 2, \ldots, J$ . The sum of the elements of N is $n$ (the grand total). The sum of the elements of the $i$ -th row of the contingency table is denoted with $n_{i.}$ (n_idot in the code). The sum of the elements of the $j$ -th column of the contingency table is denoted with $n_{.j}$ (n_dotj in the code).

$P$ = $I$ -by- $J$ -table containing correspondence matrix (proportions). The $(i,j)$ -th element is equal to $n_{ij}/n$ , $i=1, 2, \ldots, I$ and $j=1, 2, \ldots, J$ . The sum of the elements of $P$ is 1.

$P^*$ = $I$ -by- $J$ -table containing correspondence matrix (proportions) under the hypothesis of independence. The $(i,j)$ -th element is equal to $p_{ij}^*=p_{i.}p_{.j}$ , $i=1, 2, \ldots, I$ and $j=1, 2, \ldots, J$ .

The sum of the elements of $P^*$ is 1.

References

Arsenis, S. and Riani, M. (2019), Data mining large contingency tables standard approaches and a new method, in preparation.

Documentation

SparseTableTest

Syntax

Description

Examples

SparseTableTest with all default options.

Cressie and Read test on collapsed contingency table.

Related Examples

Chi-squared test on collapsed contingency table.

Input Arguments

`N` — Contingency table (default) or n-by-2 input dataset. Matrix or Table.

Name-Value Pair Arguments

`threshold` —Threshold to select collapsed contigencey tables.scalar.

`alpha` —Significance level.scalar value in the range (0,1).

`testname` —Test to use on collapsed 2-by-2 tables.char | double.

`datamatrix` —Data matrix or contingency table.boolean.

Output Arguments

`out` — description Structure

More About

Additional Details

References

See Also

Documentation

SparseTableTest

Syntax

Description

Examples

SparseTableTest with all default options.

Cressie and Read test on collapsed contingency table.

Related Examples

Chi-squared test on collapsed contingency table.

Input Arguments

N — Contingency table (default) or n-by-2 input dataset. Matrix or Table.

Name-Value Pair Arguments

threshold —Threshold to select collapsed contigencey tables.scalar.

alpha —Significance level.scalar value in the range (0,1).

testname —Test to use on collapsed 2-by-2 tables.char | double.

datamatrix —Data matrix or contingency table.boolean.

Output Arguments

out — description Structure

More About

Additional Details

References

See Also

`N` — Contingency table (default) or n-by-2 input dataset. Matrix or Table.

`threshold` —Threshold to select collapsed contigencey tables.scalar.

`alpha` —Significance level.scalar value in the range (0,1).

`testname` —Test to use on collapsed 2-by-2 tables.char | double.

`datamatrix` —Data matrix or contingency table.boolean.

`out` — description Structure