corrNominal

corrNominal measures strength of association between two unordered (nominal) categorical variables.

Syntax

Description

corrNominal computes $\chi2$, $\Phi$, Cramer's $V$, Goodman-Kruskal's $\lambda_{y|x}$, Goodman-Kruskal's $\tau_{y|x}$, and Theil's $H_{y|x}$ (uncertainty coefficient).

All these indexes measure the association among two unordered qualitative variables.

Additional details about these indexes can be found in the "More About" section or in the "Output section" of this document.

example

out =corrNominal(N) corrNominal with all the default options.

example

out =corrNominal(N, Name, Value) Example of option conflev.

Examples

expand all

  • corrNominal with all the default options.
  • Rows of N indicate type of Bachelor degree: 'Economics' 'Law' 'Literature' Columns of N indicate employment type: 'Private_firm' 'Public_firm' 'Freelance' 'Unemployed'

        N=[150	80	20	50
            80	250	30	140
            30	50	0	120];
        out=corrNominal(N);
    
    Chi2 index
      221.2405
    
    Phi index
        0.4704
    
    Cramer's V 
        0.3326
    
    Test of H_0: independence between rows and columns
                       Coeff         se       zscore       pval   
                      ________    ________    ______    __________
    
        CramerV         0.3326    0.021769    15.278             0
        GKlambdayx     0.22581    0.028383    7.9556    1.7764e-15
        tauyx         0.091674    0.013524    6.7788    1.2121e-11
        Hyx            0.08716    0.011265    7.7374    1.0214e-14
    
    -----------------------------------------
    Indexes and 95% confidence limits
                       Value      StandardError    ConflimL    ConflimU
                      ________    _____________    ________    ________
    
        CramerV         0.3326      0.021769        0.28993    0.37687 
        GKlambdayx     0.22581      0.028383        0.17018    0.28144 
        tauyx         0.091674      0.013524       0.065168    0.11818 
        Hyx            0.08716      0.011265       0.065082    0.10924 
    
    

  • Example of option conflev.
  • Use data from Goodman Kruskal (1954).

        N=[1768   807    189 47
           946   1387    746 53
           115    438    288 16];
        out=corrNominal(N,'conflev',0.99);
    
    Chi2 index
       1.0735e+03
    
    Phi index
        0.3973
    
    Cramer's V 
        0.2810
    
    Test of H_0: independence between rows and columns
                       Coeff         se        zscore    pval
                      ________    _________    ______    ____
    
        CramerV        0.28095    0.0085086     33.02     0  
        GKlambdayx     0.19239     0.012158    15.825     0  
        tauyx         0.080883    0.0046282    17.476     0  
        Hyx           0.075341    0.0041619    18.102     0  
    
    -----------------------------------------
    Indexes and 99% confidence limits
                       Value      StandardError    ConflimL    ConflimU
                      ________    _____________    ________    ________
    
        CramerV        0.28095      0.0085086       0.25904     0.30314
        GKlambdayx     0.19239       0.012158       0.16108     0.22371
        tauyx         0.080883      0.0046282      0.068962    0.092805
        Hyx           0.075341      0.0041619      0.064621    0.086061
    
    

    Related Examples

  • corrNominal with option dispresults.
  •     N=[ 6 14 17 9;
           30 32 17 3];
        out=corrNominal(N,'dispresults',false);
    

  • Example which starts from the original data matrix.
  •     N=[26 26 23 18 9;
            6  7  9 14 23];
        % From the contingency table reconstruct the original data matrix.
        n11=N(1,1); n12=N(1,2); n13=N(1,3); n14=N(1,4); n15=N(1,5);
        n21=N(2,1); n22=N(2,2); n23=N(2,3); n24=N(2,4); n25=N(2,5);
        x11=[1*ones(n11,1) 1*ones(n11,1)];
        x12=[1*ones(n12,1) 2*ones(n12,1)];
        x13=[1*ones(n13,1) 3*ones(n13,1)];
        x14=[1*ones(n14,1) 4*ones(n14,1)];
        x15=[1*ones(n15,1) 5*ones(n15,1)];
        x21=[2*ones(n21,1) 1*ones(n21,1)];
        x22=[2*ones(n22,1) 2*ones(n22,1)];
        x23=[2*ones(n23,1) 3*ones(n23,1)];
        x24=[2*ones(n24,1) 4*ones(n24,1)];
        x25=[2*ones(n25,1) 5*ones(n25,1)];
        % X original data matrix
        X=[x11; x12; x13; x14; x15; x21; x22; x23; x24; x25];
        out=corrNominal(X,'datamatrix',true)
    

    Input Arguments

    expand all

    N — Contingency table (default) or n-by-2 input dataset. Matrix or Table.

    Matrix or table which contains the input contingency table (say of size I-by-J) or the original data matrix.

    In this last case N=crosstab(N(:,1),N(:,2)). As default procedure assumes that the input is a contingency table.

    Data Types: single| double

    Name-Value Pair Arguments

    Specify optional comma-separated pairs of Name,Value arguments. Name is the argument name and Value is the corresponding value. Name must appear inside single quotes (' '). You can specify several name and value pair arguments in any order as Name1,Value1,...,NameN,ValueN.

    Example: 'NoStandardErrors',true , 'dispresults',false , 'Lr',{'a' 'b' 'c'} , 'Lc',{'c1' c2' 'c3' 'c4'} , 'datamatrix',true , 'conflev',0.99

    NoStandardErrors —Just indexes without standard errors and p-values.boolean.

    if NoStandardErrors is true just the indexes are computed without standard errors and p-values. That is no inferential measure is given. The default value of NoStandardErrors is false.

    Example: 'NoStandardErrors',true

    Data Types: Boolean

    dispresults —Display results on the screen.boolean.

    If dispresults is true (default) it is possible to see on the screen all the summary results of the analysis.

    Example: 'dispresults',false

    Data Types: Boolean

    Lr —Vector of row labels.cell.

    Cell containing the labels of the rows of the input contingency matrix N. This option is unnecessary if N is a table, because in this case Lr=N.Properties.RowNames;

    Example: 'Lr',{'a' 'b' 'c'}

    Data Types: cell array of strings

    Lc —Vector of column labels.cell.

    Cell containing the labels of the columns of the input contingency matrix N. This option is unnecessary if N is a table, because in this case Lc=N.Properties.VariableNames;

    Example: 'Lc',{'c1' c2' 'c3' 'c4'}

    Data Types: cell array of strings

    datamatrix —Data matrix or contingency table.boolean.

    If datamatrix is true the first input argument N is forced to be interpreted as a data matrix, else if the input argument is false N is treated as a contingency table. The default value of datamatrix is false, that is the procedure automatically considers N as a contingency table

    Example: 'datamatrix',true

    Data Types: logical

    conflev —Confidence levels to be used to compute confidence intervals.scalar.

    The default value of conflev is 0.95, that is 95 per cent confidence intervals are computed for all the indexes (note that this option is ignored if NoStandardErrors=true).

    Example: 'conflev',0.99

    Data Types: double

    Output Arguments

    expand all

    out — description Structure

    Structure which contains the following fields:

    Value Description
    N

    $I$-by-$J$-array containing contingency table referred to active rows (i.e. referred to the rows which participated to the fit).

    The $(i,j)$-th element is equal to $n_{ij}$, $i=1, 2, \ldots, I$ and $j=1, 2, \ldots, J$. The sum of the elements of out.N is $n$ (the grand total).

    Ntable

    same as out.N but in table format (with row and column names).

    This output is present just if your MATLAB version is not<2013b.

    Chi2

    1-by-2 vector which contains $\chi^2$ index, and p-value.

    Phi

    1-by-2 vector which contains index $\Phi$ index, and p-value. Phi is a chi-square-based measure of association that involves dividing the chi-square statistic by the sample size and taking the square root of the result. More precisely \[ \Phi= \sqrt{ \frac{\chi^2}{n} } \] This index lies in the interval $[0 , \sqrt{\min[(I-1),(J-1)]}$.

    CramerV

    1 x 4 vector which contains Cramer's V index, standard error, z test, and p-value. Cramer'V index is index $\Phi$ divided by its maximum. More precisely \[ V= \sqrt{\frac{\Phi}{\min[(I-1),(J-1)]}}=\sqrt{\frac{\chi^2}{n \min[(I-1),(J-1)]}} \]

    The range of Cramer index is [0, 1]. A Cramer's V in the range of [0, 0.3] is considered as weak, [0.3,0.7] as medium and > 0.7 as strong.

    In order to compute the confidence interval for this index we first find a confidence interval for the non centrality parameter $\Delta$ of the $\chi^2$ distribution with $df=(I-1)(J-1)$ degrees of freedom. (see Smithson (2003); pp. 39-41) $[\Delta_L \Delta_U]$. A confidence interval for $\Delta$ is transformed into one for $V$ by the following transformation

    \[ V_L=\sqrt{\frac{\Delta_L+ df }{n \min[(I-1),(J-1)]}} \] and \[ V_U=\sqrt{\frac{\Delta_U+ df }{n \min[(I-1),(J-1)]}} \]

    GKlambdayx

    1 x 4 vector which contains index $\lambda_{y|x}$ of Goodman and Kruskal standard error, z test, and p-value.

    \[ \lambda_{y|x} = \sum_{i=1}^I \frac{r_i- r}{n-r} \] \[ r_i =\max(n_{ij}) \] \[ r =\max(n_{.j}) \]

    tauyx

    1 x 4 vector which contains tau index $\tau_{y|x}$, standard error, ztest and p-value.

    \[ \tau_{y|x}= \frac{\sum_{i=1}^I \sum_{j=1}^J f_{ij}^2/f_{i.} -\sum_{j=1}^J f_{.j}^2 }{1-\sum_{j=1}^J f_{.j}^2 } \]

    Hyx

    1 x 4 vector which contains the uncertainty coefficient index (proposed by Theil) $H_{y|x}$, standard error, ztest and p-value.

    \[ H_{y|x}= \frac{\sum_{i=1}^I \sum_{j=1}^J f_{ij} \log( f_{ij}/ (f_{i.}f_{.j}))}{\sum_{j=1}^J f_{.j} \log f_{.j} } \]

    TestInd

    4-by-4 array containing index values (first column), standard errors (second column), zscores (third column), p-values (fourth column).

    TestIndtable

    4-by-4 table containing index values (first column), standard errors (second column), zscores (third column), p-values (fourth column).

    This output is present just if your MATLAB version is not<2013b.

    ConfLim

    4-by-4 array containing index values (first column), standard errors (second column), lower confidence limit (third column), upper confidence limit (fourth column).

    ConfLimtable

    4-by-4 table containing index values (first column), standard errors (second column), lower confidence limit (third column), upper confidence limit (fourth column).

    This output is present just if your MATLAB version is not<2013b.

    More About

    expand all

    Additional Details

    $\lambda_{y|x}$ is a measure of association that reflects the proportional reduction in error when values of the independent variable (variable in the rows of the contingency table) are used to predict values of the dependent variable (variable in the columns of the contingency table). The range of $\lambda_{y|x}$ is [0, 1]. A value of 1 means that the independent variable perfectly predicts the dependent variable. On the other hand, a value of 0 means that the independent variable does not help in predicting the dependent variable.

    More generally, let $V(y)$ a measure of variation for the marginal distribution $(f_{.1}=n_{.1}/n, ..., f_{.J}=n_{.J}/n)$ of the response $y$ and let $V(y|i)$ denote the same measure computed for the conditional distribution $(f_{1|i}=n_{1|i}/n, ..., f_{J|i}=n_{J|i}/n)$ of $y$ at the $i$-th setting of the the explanatory variable $x$. A proportional reduction in variation measure has the form.

    \[ \frac{V(y) - E[V(y|x)]}{V(y|x)} \] where $E[V(y|x)]$ is the expectation of the conditional variation taken with respect to the distribution of $x$. When $x$ is a categorical variable having marginal distribution, $(f_{1.}, \ldots, f_{I.})$, \[ E[V(y|x)]= \sum_{i=1}^I (n_{i.}/n) V(y|i) = \sum_{i=1}^I f_{i.} V(y|i) \] If we take as measure of variation $V(y)$ the Gini coefficient \[ V(y)=1 -\sum_{j=1}^J f_{.j} \qquad V(y|i)=1 -\sum_{j=1}^J f_{j|i} \]

    we obtain the index of proportional reduction in variation $\tau_{y|x}$ of Goodman and Kruskal.

    \[ \tau_{y|x}= \frac{\sum_{i=1}^I \sum_{j=1}^J f_{ij}^2/f_{i.} -\sum_{j=1}^J f_{.j}^2 }{1-\sum_{j=1}^J f_{.j}^2 } \] If, on the other hand, we take as measure of variation $V(y)$ the entropy index \[ V(y)=-\sum_{j=1}^J f_{.j} \log f_{.j} \qquad V(y|i) -\sum_{j=1}^J f_{j|i} \log f_{j|i} \]

    we obtain the index $H_{y|x}$, (uncertainty coefficient of Theil).

    \[ H_{y|x}= \frac{\sum_{i=1}^I \sum_{j=1}^J f_{ij} \log( f_{ij}/ (f_{i.}f_{.j}))}{\sum_{j=1}^J f_{.j} \log f_{.j} } \]

    The range of $\tau_{y|x}$ and $H_{y|x}$ is [0 1].

    A large value of of the index represents a strong association, in the sense that we can guess $y$ much better when we know x than when we do not.

    In other words, $\tau_{y|x}=H_{y|x} =1$ is equivalent to no conditional variation in the sense that for each $i$, $n_{j|i}=1$. For example, a value of:

    $\tau_{y|x}=0.85$ indicates that knowledge of x reduces error in predicting values of y by 85 per cent (when the variation measure which is used is the Gini's index).

    $H_{y|x}=0.85$ indicates that knowledge of x reduces error in predicting values of y by 85 per cent (when variation measure which is used is the entropy index)

    References

    Agresti, A. (2002). Categorical Data Analysis. John Wiley & Sons, pp. 23-26.

    Goodman, L. A. and Kruskal, W. H. (1959). Measures of association for cross classifications II: Further Discussion and References, Journal of the American Statistical Association, 54, pp. 123-163.

    Goodman, L. A. and Kruskal, W. H. (1963). Measures of association for cross classifications III: Approximate Sampling Theory, Journal of the American Statistical Association, 58, pp. 310-364.

    Goodman, L. A. and Kruskal, W. H. (1972). Measures of association for cross classifications IV: Simplification of Asymptotic Variances. Journal of the American Statistical Association, 67, pp. 415-421.

    Liebetrau, A. M. (1983). Measures of Association, Sage University Papers Series on Quantitative Applications in the Social Sciences, 07-004, Newbury Park, CA: Sage, pp. 49-56.

    Smithson, M.J. (2003) Confidence Intervals, Quantitative Applications in the Social Sciences Series, No. 140. Thousand Oaks, CA: Sage. pp. 39-41.

    Acknowledgements

    In order to find the confidence interval for the non centrality parameter of the Chi-squared distribution we use routine ncpci from the Effect Size Toolbox Code by Harald Hentschke (University of Tübingen) and Maik Stüttgen (University of Bochum)

    This page has been automatically generated by our routine publishFS