# corrNominal

corrNominal measures strength of association between two unordered (nominal) categorical variables.

## Syntax

• out=corrNominal(N)example
• out=corrNominal(N,Name,Value)example

## Description

corrNominal computes $\chi2$, $\Phi$, Cramer's $V$, Goodman-Kruskal's $\lambda_{y|x}$, Goodman-Kruskal's $\tau_{y|x}$, and Theil's $H_{y|x}$ (uncertainty coefficient).

All these indexes measure the association among two unordered qualitative variables.

Additional details about these indexes can be found in the "More About" section or in the "Output section" of this document.

 out =corrNominal(N) corrNominal with all the default options.

 out =corrNominal(N, Name, Value) Example of option conflev.

## Examples

expand all

### corrNominal with all the default options.

Rows of N indicate type of Bachelor degree: 'Economics' 'Law' 'Literature' Columns of N indicate employment type: 'Private_firm' 'Public_firm' 'Freelance' 'Unemployed'

    N=[150	80	20	50
80	250	30	140
30	50	0	120];
out=corrNominal(N);

Chi2 index
221.2405

Phi index
0.4704

Cramer's V
0.3326

Test of H_0: independence between rows and columns
Coeff         se       zscore       pval
________    ________    ______    __________

CramerV         0.3326    0.021769    15.278             0
GKlambdayx     0.22581    0.028383    7.9556    1.7764e-15
tauyx         0.091674    0.013524    6.7788    1.2121e-11
Hyx            0.08716    0.011265    7.7374    1.0214e-14

-----------------------------------------
Indexes and 95% confidence limits
Value      StandardError    ConflimL    ConflimU
________    _____________    ________    ________

CramerV         0.3326      0.021769        0.28993    0.37687
GKlambdayx     0.22581      0.028383        0.17018    0.28144
tauyx         0.091674      0.013524       0.065168    0.11818
Hyx            0.08716      0.011265       0.065082    0.10924



### Example of option conflev.

Use data from Goodman Kruskal (1954).

    N=[1768   807    189 47
946   1387    746 53
115    438    288 16];
out=corrNominal(N,'conflev',0.99);

Chi2 index
1.0735e+03

Phi index
0.3973

Cramer's V
0.2810

Test of H_0: independence between rows and columns
Coeff         se        zscore    pval
________    _________    ______    ____

CramerV        0.28095    0.0085086     33.02     0
GKlambdayx     0.19239     0.012158    15.825     0
tauyx         0.080883    0.0046282    17.476     0
Hyx           0.075341    0.0041619    18.102     0

-----------------------------------------
Indexes and 99% confidence limits
Value      StandardError    ConflimL    ConflimU
________    _____________    ________    ________

CramerV        0.28095      0.0085086       0.25904     0.30314
GKlambdayx     0.19239       0.012158       0.16108     0.22371
tauyx         0.080883      0.0046282      0.068962    0.092805
Hyx           0.075341      0.0041619      0.064621    0.086061



### corrNominal with option dispresults.

    N=[ 6 14 17 9;
30 32 17 3];
out=corrNominal(N,'dispresults',false);


### Example which starts from the original data matrix.

    N=[26 26 23 18 9;
6  7  9 14 23];
% From the contingency table reconstruct the original data matrix.
n11=N(1,1); n12=N(1,2); n13=N(1,3); n14=N(1,4); n15=N(1,5);
n21=N(2,1); n22=N(2,2); n23=N(2,3); n24=N(2,4); n25=N(2,5);
x11=[1*ones(n11,1) 1*ones(n11,1)];
x12=[1*ones(n12,1) 2*ones(n12,1)];
x13=[1*ones(n13,1) 3*ones(n13,1)];
x14=[1*ones(n14,1) 4*ones(n14,1)];
x15=[1*ones(n15,1) 5*ones(n15,1)];
x21=[2*ones(n21,1) 1*ones(n21,1)];
x22=[2*ones(n22,1) 2*ones(n22,1)];
x23=[2*ones(n23,1) 3*ones(n23,1)];
x24=[2*ones(n24,1) 4*ones(n24,1)];
x25=[2*ones(n25,1) 5*ones(n25,1)];
% X original data matrix
X=[x11; x12; x13; x14; x15; x21; x22; x23; x24; x25];
out=corrNominal(X,'datamatrix',true)


## Input Arguments

### N — Contingency table (default) or n-by-2 input dataset. Matrix or Table.

Matrix or table which contains the input contingency table (say of size I-by-J) or the original data matrix.

In this last case N=crosstab(N(:,1),N(:,2)). As default procedure assumes that the input is a contingency table.

Data Types: single| double

### Name-Value Pair Arguments

Specify optional comma-separated pairs of Name,Value arguments. Name is the argument name and Value is the corresponding value. Name must appear inside single quotes (' '). You can specify several name and value pair arguments in any order as  Name1,Value1,...,NameN,ValueN.

Example:  'NoStandardErrors',true , 'dispresults',false , 'Lr',{'a' 'b' 'c'} , 'Lc',{'c1' c2' 'c3' 'c4'} , 'datamatrix',true , 'conflev',0.99 

### NoStandardErrors —Just indexes without standard errors and p-values.boolean.

if NoStandardErrors is true just the indexes are computed without standard errors and p-values. That is no inferential measure is given. The default value of NoStandardErrors is false.

Example:  'NoStandardErrors',true 

Data Types: Boolean

### dispresults —Display results on the screen.boolean.

If dispresults is true (default) it is possible to see on the screen all the summary results of the analysis.

Example:  'dispresults',false 

Data Types: Boolean

### Lr —Vector of row labels.cell.

Cell containing the labels of the rows of the input contingency matrix N. This option is unnecessary if N is a table, because in this case Lr=N.Properties.RowNames;

Example:  'Lr',{'a' 'b' 'c'} 

Data Types: cell array of strings

### Lc —Vector of column labels.cell.

Cell containing the labels of the columns of the input contingency matrix N. This option is unnecessary if N is a table, because in this case Lc=N.Properties.VariableNames;

Example:  'Lc',{'c1' c2' 'c3' 'c4'} 

Data Types: cell array of strings

### datamatrix —Data matrix or contingency table.boolean.

If datamatrix is true the first input argument N is forced to be interpreted as a data matrix, else if the input argument is false N is treated as a contingency table. The default value of datamatrix is false, that is the procedure automatically considers N as a contingency table

Example:  'datamatrix',true 

Data Types: logical

### conflev —Confidence levels to be used to compute confidence intervals.scalar.

The default value of conflev is 0.95, that is 95 per cent confidence intervals are computed for all the indexes (note that this option is ignored if NoStandardErrors=true).

Example:  'conflev',0.99 

Data Types: double

## Output Arguments

### out — description Structure

Structure which contains the following fields:

Value Description
N

$I$-by-$J$-array containing contingency table referred to active rows (i.e. referred to the rows which participated to the fit).

The $(i,j)$-th element is equal to $n_{ij}$, $i=1, 2, \ldots, I$ and $j=1, 2, \ldots, J$. The sum of the elements of out.N is $n$ (the grand total).

Ntable

same as out.N but in table format (with row and column names).

This output is present just if your MATLAB version is not<2013b.

Chi2

1-by-2 vector which contains $\chi^2$ index, and p-value.

Phi

1-by-2 vector which contains index $\Phi$ index, and p-value. Phi is a chi-square-based measure of association that involves dividing the chi-square statistic by the sample size and taking the square root of the result. More precisely $\Phi= \sqrt{ \frac{\chi^2}{n} }$ This index lies in the interval $[0 , \sqrt{\min[(I-1),(J-1)]}$.

CramerV

1 x 4 vector which contains Cramer's V index, standard error, z test, and p-value. Cramer'V index is index $\Phi$ divided by its maximum. More precisely $V= \sqrt{\frac{\Phi}{\min[(I-1),(J-1)]}}=\sqrt{\frac{\chi^2}{n \min[(I-1),(J-1)]}}$

The range of Cramer index is [0, 1]. A Cramer's V in the range of [0, 0.3] is considered as weak, [0.3,0.7] as medium and > 0.7 as strong.

In order to compute the confidence interval for this index we first find a confidence interval for the non centrality parameter $\Delta$ of the $\chi^2$ distribution with $df=(I-1)(J-1)$ degrees of freedom. (see Smithson (2003); pp. 39-41) $[\Delta_L \Delta_U]$. A confidence interval for $\Delta$ is transformed into one for $V$ by the following transformation

$V_L=\sqrt{\frac{\Delta_L+ df }{n \min[(I-1),(J-1)]}}$ and $V_U=\sqrt{\frac{\Delta_U+ df }{n \min[(I-1),(J-1)]}}$

GKlambdayx

1 x 4 vector which contains index $\lambda_{y|x}$ of Goodman and Kruskal standard error, z test, and p-value.

$\lambda_{y|x} = \sum_{i=1}^I \frac{r_i- r}{n-r}$ $r_i =\max(n_{ij})$ $r =\max(n_{.j})$

tauyx

1 x 4 vector which contains tau index $\tau_{y|x}$, standard error, ztest and p-value.

$\tau_{y|x}= \frac{\sum_{i=1}^I \sum_{j=1}^J f_{ij}^2/f_{i.} -\sum_{j=1}^J f_{.j}^2 }{1-\sum_{j=1}^J f_{.j}^2 }$

Hyx

1 x 4 vector which contains the uncertainty coefficient index (proposed by Theil) $H_{y|x}$, standard error, ztest and p-value.

$H_{y|x}= \frac{\sum_{i=1}^I \sum_{j=1}^J f_{ij} \log( f_{ij}/ (f_{i.}f_{.j}))}{\sum_{j=1}^J f_{.j} \log f_{.j} }$

TestInd

4-by-4 array containing index values (first column), standard errors (second column), zscores (third column), p-values (fourth column).

TestIndtable

4-by-4 table containing index values (first column), standard errors (second column), zscores (third column), p-values (fourth column).

This output is present just if your MATLAB version is not<2013b.

ConfLim

4-by-4 array containing index values (first column), standard errors (second column), lower confidence limit (third column), upper confidence limit (fourth column).

ConfLimtable

4-by-4 table containing index values (first column), standard errors (second column), lower confidence limit (third column), upper confidence limit (fourth column).

This output is present just if your MATLAB version is not<2013b.

$\lambda_{y|x}$ is a measure of association that reflects the proportional reduction in error when values of the independent variable (variable in the rows of the contingency table) are used to predict values of the dependent variable (variable in the columns of the contingency table). The range of $\lambda_{y|x}$ is [0, 1]. A value of 1 means that the independent variable perfectly predicts the dependent variable. On the other hand, a value of 0 means that the independent variable does not help in predicting the dependent variable.

More generally, let $V(y)$ a measure of variation for the marginal distribution $(f_{.1}=n_{.1}/n, ..., f_{.J}=n_{.J}/n)$ of the response $y$ and let $V(y|i)$ denote the same measure computed for the conditional distribution $(f_{1|i}=n_{1|i}/n, ..., f_{J|i}=n_{J|i}/n)$ of $y$ at the $i$-th setting of the the explanatory variable $x$. A proportional reduction in variation measure has the form.

$\frac{V(y) - E[V(y|x)]}{V(y|x)}$ where $E[V(y|x)]$ is the expectation of the conditional variation taken with respect to the distribution of $x$. When $x$ is a categorical variable having marginal distribution, $(f_{1.}, \ldots, f_{I.})$, $E[V(y|x)]= \sum_{i=1}^I (n_{i.}/n) V(y|i) = \sum_{i=1}^I f_{i.} V(y|i)$ If we take as measure of variation $V(y)$ the Gini coefficient $V(y)=1 -\sum_{j=1}^J f_{.j} \qquad V(y|i)=1 -\sum_{j=1}^J f_{j|i}$

we obtain the index of proportional reduction in variation $\tau_{y|x}$ of Goodman and Kruskal.

$\tau_{y|x}= \frac{\sum_{i=1}^I \sum_{j=1}^J f_{ij}^2/f_{i.} -\sum_{j=1}^J f_{.j}^2 }{1-\sum_{j=1}^J f_{.j}^2 }$ If, on the other hand, we take as measure of variation $V(y)$ the entropy index $V(y)=-\sum_{j=1}^J f_{.j} \log f_{.j} \qquad V(y|i) -\sum_{j=1}^J f_{j|i} \log f_{j|i}$

we obtain the index $H_{y|x}$, (uncertainty coefficient of Theil).

$H_{y|x}= \frac{\sum_{i=1}^I \sum_{j=1}^J f_{ij} \log( f_{ij}/ (f_{i.}f_{.j}))}{\sum_{j=1}^J f_{.j} \log f_{.j} }$

The range of $\tau_{y|x}$ and $H_{y|x}$ is [0 1].

A large value of of the index represents a strong association, in the sense that we can guess $y$ much better when we know x than when we do not.

In other words, $\tau_{y|x}=H_{y|x} =1$ is equivalent to no conditional variation in the sense that for each $i$, $n_{j|i}=1$. For example, a value of:

$\tau_{y|x}=0.85$ indicates that knowledge of x reduces error in predicting values of y by 85 per cent (when the variation measure which is used is the Gini's index).

$H_{y|x}=0.85$ indicates that knowledge of x reduces error in predicting values of y by 85 per cent (when variation measure which is used is the entropy index)

## References

Agresti, A. (2002), "Categorical Data Analysis", John Wiley & Sons. [pp.

23-26]

Goodman, L.A. and Kruskal, W.H. (1959), Measures of association for cross classifications II: Further Discussion and References, "Journal of the American Statistical Association", Vol. 54, pp. 123-163.

Goodman, L.A. and Kruskal, W.H. (1963), Measures of association for cross classifications III: Approximate Sampling Theory, "Journal of the American Statistical Association", Vol. 58, pp. 310-364.

Goodman, L.A. and Kruskal, W.H. (1972), Measures of association for cross classifications IV: Simplification of Asymptotic Variances, "Journal of the American Statistical Association", Vol. 67, pp. 415-421.

Liebetrau, A.M. (1983), "Measures of Association", Sage University Papers Series on Quantitative Applications in the Social Sciences, 07-004, Newbury Park, CA: Sage. [pp. 49-56]

Smithson, M.J. (2003), "Confidence Intervals", Quantitative Applications in the Social Sciences Series, No. 140. Thousand Oaks, CA: Sage. [pp.

39-41]

## Acknowledgements

In order to find the confidence interval for the non centrality parameter of the Chi-squared distribution we use routine ncpci from the Effect Size Toolbox. [Code by Harald Hentschke (University of Tübingen) and Maik Stüttgen (University of Bochum)].