# SparseTableTest

SparseTableTest computes independence test for large and sparse contingency tables

## Syntax

• out=SparseTableTest(N)example
• out=SparseTableTest(N,Name,Value)example

## Description

This function implements a new test of indipendence between row variables distribution ('outcomes') and columns ('treatments') which is expecially suited for the analysis of large and sparse $I$-by-$J$ contingency tables. The procedure is based on the collapsing of the original table into a set of 2-by-2 tables for each cell of the original table which has no less than a small number of counts (set in the optional input parameter 'threshold') and testing each of the resulting collapsed tables for independence by any test (Fisher exact test (default), Barnard test or those belonging to the power divergence family of Cressie and Read).

Because of the Bonferroni inequality, a sufficient condition for attaining a significance level $\alpha$ for this test (i.e., the probability of detecting a positive association between two levels of the response variables when in fact there is no such association) is that each test done for each cell of the $I$-by-$J$ table rejects with significance level equal to $\alpha$ divided by the number of comparisons done. An additional bonus of the procedures is that it enables to highlight the most important contribution to the association of each single entry of the original I-by-J-table two way table. The original idea of this test is due to Spyros Arsenis (Joint Research Centre of the European Commission) and has been successfully applied to the analysis of contingency table coming from international trade data.

 out =SparseTableTest(N) SparseTableTest with all default options.

 out =SparseTableTest(N, Name, Value) Cressie and Read test on collapsed contingency table.

## Examples

expand all

### SparseTableTest with all default options.

For reproducibility

    rng default;
x1 = unidrnd(3,50,1);
x2 = unidrnd(3,50,1);
% Cross-tabulate x1 and x2.
InputTable = crosstab(x1,x2);
out = SparseTableTest(InputTable);


### Cressie and Read test on collapsed contingency table.

    la=2/3;
% Input is a matrix.
% T = Contingency Table for Car Accident Type (rows) by
% Accident Severity (columns)
T=[2365 944 412; 249 585 276];
out=SparseTableTest(T,'testname',la);


### Chi-squared test on collapsed contingency table.

Input is a data matrix and contingency table has to be built

    load smoke
% X = original data matrix
X=smoke.data;
% Chi-squared test is used on collapsed 2-by-2 tables.
% Cells which have a frequency smaller or equal than 15 are ignored.
out=SparseTableTest(X,'datamatrix',true,'threshold',15,'testname',1);
% show the output obtained
RejectedBonf = out.RejectedBonf
RejectedSidak = out.RejectedSidak
TestResults = out.TestResults

RejectedBonf =

5×4 logical array

0   0   0   0
0   0   0   0
1   0   0   0
1   0   0   0
0   0   0   0

RejectedSidak =

5×4 logical array

0   0   0   0
0   0   0   0
1   0   0   0
1   0   0   0
0   0   0   0

TestResults =

Inf       Inf       Inf       Inf
Inf       Inf       Inf       Inf
0.0018       Inf       Inf       Inf
0.0023    0.2340    0.1432       Inf
Inf       Inf       Inf       Inf



## Input Arguments

### N — Contingency table (default) or n-by-2 input dataset. Matrix or Table.

Matrix or table which contains the input contingency table (say of size I-by-J) or the original data matrix. In this last case N=crosstab(N(:,1),N(:,2)).

By default the procedure assumes that the input is a contingency table.

Data Types: single| double

### Name-Value Pair Arguments

Specify optional comma-separated pairs of Name,Value arguments. Name is the argument name and Value is the corresponding value. Name must appear inside single quotes (' '). You can specify several name and value pair arguments in any order as  Name1,Value1,...,NameN,ValueN.

Example:  'threshold',3 , 'alpha',0.05 , 'testname',1 , 'datamatrix',true 

### threshold —Threshold to select collapsed contigencey tables.scalar.

Scalar which specifies above which value collapsed contingency tables have to be produced. The default value of threshold is 2.

Example:  'threshold',3 

Data Types: single | double | int32 | int64

### alpha —Significance level.scalar value in the range (0,1).

Significance level of the hypothesis test, specified as the comma-separated pair consisting of 'alpha' and a scalar value in the range (0,1). The default value of alpha is 0.01.

Example:  'alpha',0.05 

Data Types: single | double

### testname —Test to use on collapsed 2-by-2 tables.char | double.

If testname is a number, it identifies the value of $\lambda$ to use of the power divergence family. See function CressieRead for further details. If testname is a character, possible values are 'Fisher' (to use the Fisher exact test, see function fishertest) or 'Barnard' (to use Barnard exact test, see function barnardtest). The default value of testname is 1, that is $\chi^2$ test is used. Note also that fishertest has been introduced in MATLAB in release 2014b.

Example:  'testname',1 

Data Types: single | double | char

### datamatrix —Data matrix or contingency table.boolean.

If datamatrix is true the first input argument N is forced to be interpreted as a data matrix, else if the input argument is false N is treated as a contingency table. The default value of datamatrix is false, that is the procedure automatically considers N as a contingency table

Example:  'datamatrix',true 

Data Types: logical

## Output Arguments

### out — description Structure

Structure which contains the following fields:

Value Description
TestResults

p-values based on collapsed contingency tables.

I-by-J matrix.

The $(i,j)$-th entry of the TestResults matrix is the p-value of the Fisher exact test based on the collapsed $(i,j)$-th table. If the $(i,j)$-th entry of input matrix UserData is smaller or equal than the input parameter threshold, the test is not performed and the corresponding $(i,j)$-th entry of matrix TestResults is equal to Inf.

RejectedBonf

Results of the tests based on Bonferrroni threshold.

Boolean matrix.

The $(i,j)$-th entry of the RejectedBonf matrix is true if the corresponding test based on the collapsed $(i,j)$-th table is significant. Bonferroni threshold is used.

RejectedSidak

Results of the tests based on Sidak threshold.

Boolean matrix.

The $(i,j)$-th entry of the RejectedSidak matrix is true if the corresponding test based on the collapsed $(i,j)$-th table is significant. Sidak threshold is used.

$N$ = $I$-by-$J$-contingency table. The $(i,j)$-th element is equal to $n_{ij}$, $i=1, 2, \ldots, I$ and $j=1, 2, \ldots, J$. The sum of the elements of N is $n$ (the grand total). The sum of the elements of the $i$-th row of the contingency table is denoted with $n_{i.}$ (n_idot in the code). The sum of the elements of the $j$-th column of the contingency table is denoted with $n_{.j}$ (n_dotj in the code).

$P$=$I$-by-$J$-table containing correspondence matrix (proportions). The $(i,j)$-th element is equal to $n_{ij}/n$, $i=1, 2, \ldots, I$ and $j=1, 2, \ldots, J$. The sum of the elements of $P$ is 1.

$P^*$=$I$-by-$J$-table containing correspondence matrix (proportions) under the hypothesis of independence. The $(i,j)$-th element is equal to $p_{ij}^*=p_{i.}p_{.j}$, $i=1, 2, \ldots, I$ and $j=1, 2, \ldots, J$.

The sum of the elements of $P^*$ is 1.

## References

Arsenis, S. and Riani, M. (2019), Data mining large contingency tables standard approaches and a new method, in preparation.