# corrOrdinal

corrOrdinal measures strength of association between two ordered categorical variables.

## Syntax

• out=corrOrdinal(N)example
• out=corrOrdinal(N,Name,Value)example

## Description

corrOrdinal computes Goodman-Kruskal's $\gamma$, $\tau_a$, $\tau_b$, $\tau_c$ of Kendall and $d_{y|x}$ of Somers.

All these indexes measure the correlation among two ordered qualitative variables and go between -1 and 1. The sign of the coefficient indicates the direction of the relationship, and its absolute value indicates the strength, with larger absolute values indicating stronger relationships. Values close to an absolute value of 1 indicate a strong relationship between the two variables. Values close to 0 indicate little or no relationship. More in detail:

$\gamma$ is a symmetric measure of association.

Kendall's $\tau_a$ is a symmetric measure of association that does not take ties into account. Ties happen when both members of the data pair have the same value.

Kendall's $\tau_b$ is a symmetric measure of association which takes ties into account. Even if $\tau_b$ ranges from -1 to 1, a value of -1 or +1 can be obtained only from square tables.

$\tau_c$ (also called Stuart-Kendall $\tau_c$) is a symmetric measure of association which makes an adjustment for table size in addition to a correction for ties. Even if $\tau_c$ ranges from -1 to 1, a value of -1 or +1 can be obtained only from square tables.

Somers' $d$ is an asymmetric extension of $\tau_b$ in that it uses a correction only for pairs that are tied on the independent variable (which in this implementation it is assumed to be on the rows of the contingency table).

 out =corrOrdinal(N) corrOrdinal with all the default options.

 out =corrOrdinal(N, Name, Value) Compare calculation of tau-b with that which comes from Matlab function corr.

## Examples

expand all

### corrOrdinal with all the default options.

Rows of N indicate the results of a written test with levels: 'Sufficient' 'Good' Very good' Columns of N indicate the results of an oral test with levels: 'Sufficient' 'Good' Very good'

    N=[20    40    20;
10    45    45;
0     5    15];
out=corrOrdinal(N);
% Because the asymptotic 95 per cent confidence limits do not contain
% zero, this indicates a strong positive association between the
% written and the oral examination.

Test of H_0: independence between rows and columns
The standard errors are computed under H_0
Coeff        se       zscore       pval
_______    ________    ______    __________

gamma        0.5    0.098239    5.0896     3.588e-07
taua     0.18342    0.047553    3.8571    0.00011474
taub     0.30557    0.060038    5.0896     3.588e-07
tauc     0.27375    0.053786    5.0896     3.588e-07
dyx      0.31466    0.061823    5.0896     3.588e-07

-----------------------------------------
Indexes and 95% confidence limits
The standard error are computed under H_1
Value     StandardError    ConflimL    ConflimU
_______    _____________    ________    ________

gamma        0.5        0.0876       0.32831     0.67169
taua     0.18342      0.011904       0.16009     0.20675
taub     0.30557       0.05852       0.19087     0.42027
tauc     0.27375      0.053786       0.16833     0.37917
dyx      0.31466      0.059899       0.19726     0.43205



### Compare calculation of tau-b with that which comes from Matlab function corr.

    % Starting from a contingency table, create the original data matrix to
% te able to call corr.
N=[20    23    20;
21    25    22;
18     18    19];
n11=N(1,1); n12=N(1,2); n13=N(1,3);
n21=N(2,1); n22=N(2,2); n23=N(2,3);
n31=N(3,1); n32=N(3,2); n33=N(3,3);
x11=[1*ones(n11,1) 1*ones(n11,1)];
x12=[1*ones(n12,1) 2*ones(n12,1)];
x13=[1*ones(n13,1) 3*ones(n13,1)];
x21=[2*ones(n21,1) 1*ones(n21,1)];
x22=[2*ones(n22,1) 2*ones(n22,1)];
x23=[2*ones(n23,1) 3*ones(n23,1)];
x31=[3*ones(n31,1) 1*ones(n31,1)];
x32=[3*ones(n32,1) 2*ones(n32,1)];
x33=[3*ones(n33,1) 3*ones(n33,1)];
% X original data matrix
X=[x11; x12; x13; x21; x22; x23; x31; x32; x33];
% Find taub and pvalue of taub using MATLAB routine corr
[RHO,pval]=corr(X,'type','Kendall');
% Compute tau-b using FSDA corrOrdinal routine.
out=corrOrdinal(X,'datamatrix',true,'dispresults',false);
disp(['tau-b from MATLAB routine corr=' num2str(RHO(1,2))])
disp(['tau-b from FSDA routine corrOrdinal=' num2str(out.taub(1))])
% Remark the p-values are slightly different
disp(['pvalue of H0:taub=0 from MATLAB routine corr=' num2str(pval(1,2))])
disp(['pvalue of H0:taub=0 from FSDA routine corrOrdinal=' num2str(out.taub(4))])

tau-b from MATLAB routine corr=0.0083449
tau-b from FSDA routine corrOrdinal=0.0083449
pvalue of H0:taub=0 from MATLAB routine corr=0.89952
pvalue of H0:taub=0 from FSDA routine corrOrdinal=0.89914


### corrOrdinal with option conflev.

    N=[26 26 23 18  9;
6  7  9 14 23];
out=corrOrdinal(N,'conflev',0.999);


### corrOrdinal with with option NoStandardErrors.

    N=[26 26 23 18  9;
6  7  9 14 23];
out=corrOrdinal(N,'NoStandardErrors',true);


### Income and job satisfaction.

Relationship between the income (with levels '< 5000' '5000-25000' and '>25000') and job satisfaction (with levels 'Dissatisfied' 'Moderately satisfied' and 'Very satisfied') for a sample of 300 persons Input data is matlab table Ntable:

    N = [24 23 30;19 43 57;13 33 58];
rownam={'Less_than_5000',  'Between_5000_and_25000' 'Greater_than_25000'};
colnam= {'Dissatisfied' 'Moderately_satisfied' 'Very_satisfied'};
if verLessThan('matlab','8.2.0') ==0
Ntable=array2table(N,'RowNames',matlab.lang.makeValidName(rownam),'VariableNames',matlab.lang.makeValidName(colnam));
%  Check relationship
out=corrOrdinal(Ntable);
else
out=corrOrdinal(N);
end


### Input is the contingency table in matrix format, labels for rows and columns are supplied.

    N=[20    40    20;
10    45    45;
0     5    15];
% labels for rows and columns
labels_rows= {'Sufficient' 'Good' 'Very_good'};
labels_columns= {'Sufficient' 'Good' 'Very_good'};
out=corrOrdinal(N,'Lr',labels_rows,'Lc',labels_columns,'dispresults',false);
if verLessThan('matlab','8.2.0') ==0
% out.Ntable uses labels for rows and columns which are supplied
disp(out.Ntable)
end


## Input Arguments

### N — Contingency table (default) or n-by-2 input dataset. Matrix or Table.

Matrix or table which contains the input contingency table (say of size I-by-J) or the original data matrix.

In this last case N=crosstab(N(:,1),N(:,2)). As default procedure assumes that the input is a contingency table.

Data Types: single| double

### Name-Value Pair Arguments

Specify optional comma-separated pairs of Name,Value arguments. Name is the argument name and Value is the corresponding value. Name must appear inside single quotes (' '). You can specify several name and value pair arguments in any order as  Name1,Value1,...,NameN,ValueN.

Example:  'NoStandardErrors',true , 'dispresults',false , 'Lr',{'a' 'b' 'c'} , 'Lc',{'c1' c2' 'c3' 'c4'} , 'datamatrix',true , 'conflev',0.99 

### NoStandardErrors —Just indexes without standard errors and p-values.boolean.

if NoStandardErrors is true just the indexes are computed without standard errors and p-values. That is no inferential measure is given. The default value of NoStandardErrors is false.

Example:  'NoStandardErrors',true 

Data Types: Boolean

### dispresults —Display results on the screen.boolean.

If dispresults is true (default) it is possible to see on the screen all the summary results of the analysis.

Example:  'dispresults',false 

Data Types: Boolean

### Lr —Vector of row labels.cell.

Cell containing the labels of the rows of the input contingency matrix N. This option is unnecessary if N is a table. because in this case Lr=N.Properties.RowNames;

Example:  'Lr',{'a' 'b' 'c'} 

Data Types: cell array of strings

### Lc —Vector of column labels.cell.

Cell containing the labels of the columns of the input contingency matrix N. This option is unnecessary if N is a table because in this case Lc=N.Properties.VariableNames;

Example:  'Lc',{'c1' c2' 'c3' 'c4'} 

Data Types: cell array of strings

### datamatrix —Data matrix or contingency table.boolean.

If datamatrix is true the first input argument N is forced to be interpreted as a data matrix, else if the input argument is false N is treated as a contingency table. The default value of datamatrix is false, that is the procedure automatically considers N as a contingency table

Example:  'datamatrix',true 

Data Types: logical

### conflev —Confidence levels to be used to compute confidence intervals.scalar.

The default value of conflev is 0.95, that is 95 per cent confidence intervals are computed for all the indexes (note that this option is ignored if NoStandardErrors=true).

Example:  'conflev',0.99 

Data Types: double

## Output Arguments

### out — description Structure

Structure which contains the following fields:

Value Description
N

$I$-by-$J$-array containing contingency table referred to active rows (i.e. referred to the rows which participated to the fit).

The $(i,j)$-th element is equal to $n_{ij}$, $i=1, 2, \ldots, I$ and $j=1, 2, \ldots, J$. The sum of the elements of out.N is $n$ (the grand total).

Ntable

Same as out.N but in table format (with row and column names).

This output is present just if your MATLAB version is not<2013b.

gam

1 x 4 vector which contains Goodman and Kruskall gamma index, standard error, test and p-value.

taua

1 x 4 vector which contains index $\tau_a$, standard error, test and p-value.

taub

1 x 4 vector which contains index $\tau_b$, standard error, test and p-value.

tauc

1 x 4 vector which contains index $\tau_c$, standard error, test and p-value.

som

1 x 4 vector which contains Somers index $d_{y|x}$, standard error, test and p-value.

TestInd

5-by-4 matrix containing index values (first column), standard errors (second column), zscores (third column), p-values (fourth column). Note that the standard errors in this matrix are computed assuming the null hypothesis of independence.

TestIndtable

5-by-4 table containing index values (first column), standard errors (second column), zscores (third column), p-values (fourth column). Note that the standard errors in this table are computed assuming the null hypothesis of independence.

ConfLim

5-by-4 matrix containing index values (first column), standard errors (second column), lower confidence limit (third column), upper confidence limit (fourth column).

Note that the standard errors in this matrix are computed not assuming the null hypothesis of independence.

ConfLimtable

5-by-4 table containing index values (first column), standard errors (second column), lower confidence limit (third column), upper confidence limit (fourth column).

Note that the standard errors in this table are computed not assuming the null hypothesis of independence.

All these indexes are based on concordant and discordant pairs.

A pair of observations is concordant if the subject who is higher on one variable also is higher on the other variable, and a pair of observations is discordant if the subject who is higher on one variable is lower on the other variable.

Let $C$ be the total number of concordant pairs (concordances) and $D$ the total number of discordant pairs (discordances) . If $C > D$ the variables have a positive association, but if $C < D$ then the variables have a negative association.

In symbols, given an $I \times J$ contingency table the concordant pairs with cell $i,j$ are

$a_{ij} = \sum_{k<i} \sum_{l<j} n_{kl} + \sum_{k>i} \sum_{l>j} n_{kl}$ the number of discordant pairs is $b_{ij} = \sum_{k>i} \sum_{l<j} n_{kl} + \sum_{k<i} \sum_{l>j} n_{kl}$

Twice the number of concordances, $C$ is given by:

$2 \times C = \sum_{i=1}^I \sum_{j=1}^J n_{ij} a_{ij}$

Twice the number of discordances, $D$ is given by:

$2 \times D = \sum_{i=1}^I \sum_{j=1}^J n_{ij} b_{ij}$

Goodman-Kruskal's $\gamma$ statistic is equal to the ratio:

$\gamma= \frac{C-D}{C+D}$

$\tau_a$ is equal to concordant minus discordant pairs, divided by a factor which takes into account the total number of pairs.

$\tau_a= \frac{C-D}{0.5 n(n-1)}$

$\tau_b$ is equal to concordant minus discordant pairs divided by a term representing the geometric mean between the number of pairs not tied on x and the number not tied on y.

More precisely:

$\tau_b= \frac{C-D}{\sqrt{ (0.5 n(n-1)-T_x)(0.5 n(n-1)-T_y)}}$

where $T_x= \sum_{i=1}^I 0.5 n_{i.}(n_{i.}-1)$ and $T_y=\sum_{j=1}^J 0.5 n_{.j}(n_{.j}-1)$ Note that $\tau_b \leq \gamma$.

$\tau_c$ is equal to concordant minus discordant pairs multiplied by a factor that adjusts for table size.

$\tau_c= \frac{C-D}{ n^2(m-1)/(2m)}$

where $m= min(I,J)$;

Somers' $d_{y|x}$ is an asymmetric extension of $\gamma$ that differs only in the inclusion of the number of pairs not tied on the independent variable. More precisely

$d_{y|x} = \frac{C-D}{0.5 n(n-1)-T_x}$

Null hypothesis:

corresponding index = 0. Alternative hypothesis (one-sided) index < 0 or index > 0.

In order to compute confidence intervals and test hypotheses, this routine computes the standard error of the various indexes.

Note that the expression of the standard errors which is used to compute the confidence intervals is different from the expression which is used to test the null hypothesis of no association (no relationship or independence) between the two variables.

As concerns the Goodman-Kruskal's $\gamma$ index we have that:

$var(\gamma) = \frac{4}{(C + D)^4} \sum_{i=1}^I \sum_{j=1}^J n_{ij} (D a_{ij} - C b_{ij} )^2$ where $d_{ij}=a_{ij}- b_{ij}$

The variance of $\gamma$ assuming the independence hypothesis is:

$var_0(\gamma) =\frac{1}{(C + D)^2} \left( \sum_{i=1}^I \sum_{j=1}^J n_{ij} d_{ij}^2 -4(C-D)^2/n \right)$

As concerns $\tau_a$ we have that:

$var(\tau_a)= \frac{2}{n(n-1)} \left\{ \frac{2(n-2)}{n(n-1)^2} \sum_{i=1}^I \sum_{j=1}^J (d_{ij} - \overline d)^2 + 1 - \tau_a^2 \right\} \qquad \mbox{with i,j such that N(i,j)>0}$ where $\overline d = \sum_{i=1}^I \sum_{j=1}^J d_{ij} /n \qquad \mbox{with i,j such that N(i,j)>0}$

The variance of $\tau_a$ assuming the independence hypothesis is:

$var_0(\tau_a) =\frac{2 (2n+5)}{9n(n-1) }$

As concerns $\tau_b$ we have that:

$var(\tau_b)= \frac{n}{w^4} \left\{ n \sum_{i=1}^I \sum_{j=1}^J n_{ij} \tau_{ij}^2 - \left( \sum_{i=1}^I \sum_{j=1}^J n_{ij}\tau_{ij}\right)^2 \right\}$ where $\tau_{ij} = 2n d_{ij} +2(C-D) n_{.j} w /n^3+2(C-D) (n_{i.}/n) \sqrt{ w_c/w_r} \qquad \mbox{and} \qquad w= \sqrt{w_rw_c}$

The variance of $\tau_b$ assuming the independence hypothesis is:

$var_0(\tau_b) =\frac{4}{w_r w_c} \left\{ \sum_{i=1}^I \sum_{j=1}^J n_{ij} d_{ij} ^2 -4(C-D)^2/n \right\}$

As concerns Stuart's $\tau_c$ we have that:

$var(\tau_c)= \frac{4m^2}{(m-1)^2 n^4} \left\{ \sum_{i=1}^I \sum_{j=1}^J n_{ij} d_{ij} ^2 -4(C-D)^2/n \right\}$

The variance of $\tau_c$ assuming the independence hypothesis is:

$var_0(\tau_c) =var(\tau_c)$

As concerns $d_{y|x}$ we have that:

$var( d_{y|x})= \frac{4}{w_r^4} \left\{ \sum_{i=1}^I \sum_{j=1}^J n_{ij} (w_r d_{ij} -2(C-D) (n-n_{i.}) \right\}^2$ where $w_r= n^2- \sum_{i=1}^I n_{i.}^2$

The variance of $d_{y|x}$ assuming the independence hypothesis is:

$var_0(d_{y|x}) = \frac{4}{w_r^2} \left\{ \sum_{i=1}^I \sum_{j=1}^J n_{ij} d_{ij} ^2 -4(C-D)^2/n \right\}$

From the theoretical point of view, Simon (1978) showed that all sample measures having the same numerator $(C-D)$ have the same efficacy and hence the same local power, for testing independence.

## References

Agresti, A. (2002), "Categorical Data Analysis", John Wiley & Sons. [pp.

57-59]

Agresti, A. (2010), "Analysis of Ordinal Categorical Data", Second Edition, Wiley, New York, pp. 194-195.

Hollander, M, Wolfe, D.A., Chicken, E. (2014), "Nonparametric Statistical Methods", Third edition, Wiley,

Goktas, A. and Oznur, I. (2011), A comparision of the most commonly used measures of association for doubly ordered square contingency tables via simulation, "Metodoloski zvezki", Vol. 8, pp. 17-37, [available at:

www.stat-d.si/mz/mz8.1/goktas.pdf]

Goodman, L.A. and Kruskal, W.H. (1954), Measures of association for cross classifications, "Journal of the American Statistical Association", Vol. 49, pp. 732-764.

Goodman, L.A. and Kruskal, W.H. (1959), Measures of association for cross classifications II: Further Discussion and References, "Journal of the American Statistical Association", Vol. 54, pp. 123-163.

Goodman, L.A. and Kruskal, W.H. (1963), Measures of association for cross classifications III: Approximate Sampling Theory, "Journal of the American Statistical Association", Vol. 58, pp. 310-364.

Goodman, L.A. and Kruskal, W.H. (1972), Measures of association for cross classifications IV: Simplification of Asymptotic Variances, "Journal of the American Statistical Association", Vol. 67, pp. 415-421.

Liebetrau, A.M. (1983), "Measures of Association", Sage University Papers Series on Quantitative Applications in the Social Sciences, 07-004, Newbury Park, CA: Sage. [pp. 49-56]

Morton, B.B. and Benedetti, J.K. (1977), Sampling Behavior of Tests for Correlation in Two-Way Contingency Tables, "Journal of the American Statistical Association", Vol. 72, pp. 309-315.

Simon, G. (1978), Alternative analysis for the singly ordered contingency table, "Journal of the American Statistical Association", Vol. 69, pp. 971-976.

## Acknowledgements

This file was inspired by Trujillo-Ortiz, A. and R. Hernandez-Walls.

gkgammatst: Goodman-Kruskal's gamma test. URL address http://www.mathworks.com/matlabcentral/fileexchange/42645-gkgammatst