SED navigation bar go to SED home page go to Dataplot home page go to NIST home page SED Home Page SED Staff SED Projects SED Products and Publications Search SED Pages
Dataplot Vol 1 Vol 2

CHI-SQUARE INDEPENDENCE TEST

Name:
    CHI-SQUARE INDEPENDENCE TEST (LET)
Type:
    Analysis Command
Purpose:
    Perform a chi-square test of independence for a two-way contingency table.
Description:
    If we have N observations with two variables where each observation can be classified into one of R mutually exclusive categories for variable one and one of C mutually exclusive categories for variable two, then a cross-tabulation of the data results in a two-way contingency table (also referred to as an RxC contingency table). The resulting contingency table has R rows and C columns.

    A common question with regards to a two-way contingency table is whether we have independence. By independence, we mean that the row and column variables are unassociated (i.e., knowing the value of the row variable will not help us predict the value of column variable and likewise knowing the value of the column variable will not help us predict the value of the row variable).

    A more technical definition for independence is that

      P(row i, column j) = P(row i)*P(column j)       for all i,j

    One such test is the chi-square test for independence.

      H0: The two-way table is independent
      Ha: The two-way table is not independent
      Test Statistic:
        \( T = \sum_{i=1}^{r}{\sum_{j=1}^{c}{\frac{O_{ij} - E_{ij}} {E_{ij}}}} \)

      where

        r = the number of rows in the contingency table
        c = the number of columns in the contingency table
        Oij = the observed frequency of the ith row and jth column
        Eij = the expected frequency of the ith row and jth column
          = \( \frac{R_i C_j}{N} \)
        Ri = the sum of the observed frequencies for row i
        Cj = the sum of the observed frequencies for column j
        N = the total sample size

      Significance Level: \( \alpha \)
      Critical Region: T > CHSPPF(\( \alpha \),(r-1)*(c-1))

      where CHSPPF is the percent point function of the chi-square distribution and (r-1)*(c-1) is the degrees of freedom

      Conclusion: Reject the independence hypothesis if the value of the test statistic is greater than the chi-square value.

    This test statistic can also be formulated as

      \( \sum_{i=1}^{r}{\sum_{j=1}^{c}{d_{ij}^2}} \)

    where

      \( d_{ij}^2 = \frac{O_{ij} - E_{ij}} {\sqrt{E_{ij}}} \)

    The dij are referred to as the standardized residuals and they show the contribution to the chi-square test statistic of each cell.

Syntax 1:
    CHI-SQUARE INDPENDENCE TEST <y1> <y2>
                            <SUBSET/EXCEPT/FOR qualification>
    where <y1> is the first response variable;
                <y2> is the second response variable;
    and where the <SUBSET/EXCEPT/FOR qualification> is optional.

    This syntax is used for the case where you have raw data (i.e., the data has not yet been cross tabulated into a two-way table).

Syntax 2:
    CHI-SQUARE INDEPENDENCE TEST <m>
                            <SUBSET/EXCEPT/FOR qualification>
    where <m> is a matrix containing the two-way table;
    and where the <SUBSET/EXCEPT/FOR qualification> is optional.

    This syntax is used for the case where we the data have already been cross-tabulated into a two-way contingency table.

Syntax 3:
    CHI-SQUARE INDEPENDENCE TEST <n11> <n12> <n21> <n22>
    where <n11> is a parameter containing the value for row 1, column 1 of a 2x2 table;
                <n12> is a parameter containing the value for row 1, column 2 of a 2x2 table;
                <n21> is a parameter containing the value for row 2, column 1 of a 2x2 table;
                and <n22> is a parameter containing the value for row 2, column 2 of a 2x2 table.

    This syntax is used for the special case where you have a 2x2 table. In this case, you can enter the 4 values directly, although you do need to be careful that the parameters are entered in the order expected above.

Examples:
    CHI-SQUARE INDEPENDENCE TEST Y1 Y2
    CHI-SQUARE INDEPENDENCE TEST M
    CHI-SQUARE INDEPENDENCE TEST N11 N12 N21 N22
Note:
    The chi-square approximation is asymptotic. This means that the critical values may not be valid if the expected frequencies are too small.

    Cochran suggests that if the minimum expected frequency is less than 1 or if 20% of the expected frequencies are less than 5, the approximation may be poor. However, Conover suggests that this is probably too conservative, particularly if r and c are not too small. He suggests that the minimum expected frequency should be 0.5 and at least half the expected frequencies should be greater than 1.

    In any event, if there are too many low expected frequencies, you can do one of the following:

    1. If rows or columns with small expected frequencies can be intelligently combined, then this may result in expected frequencies that are sufficiently large.

    2. Use Fisher's exact test.
Note:
    Conover points out that there are really 3 distinct tests:

    1. Only N is fixed. The row and column totals are not fixed (i.e., they are random).

    2. Either the row totals or the column totals are fixed before hand.

    3. Both the row totals and the column totals are fixed before hand.

    Note that in all three cases, the test statistic and the chi-square approximation are the same. What differs is the exact distribution of the test statistic. When either the row or column totals (or both) are fixed, the possible number of contingency tables is reduced.

    As long as the expected frequencies are sufficiently large, the chi-square approximation should be adequate for practical purposes.

Note:
    Some authors recommend using a continuity correction for this test. In this case, 0.5 is added to the observed frequency in each cell. Dataplot performs this test both with the continuity correction and without the continuity correction.
Note:
    The following information is written to the file dpst1f.dat (in the current directory):

      Column 1 - row id
      Column 2 - column id
      Column 3 - row total
      Column 4 - column total
      Column 5 - expected frequency (Eij)
      Column 6 - observed frequency (Oij)

    To read this information into Dataplot, enter

      SKIP 1
      READ DPST1F.DAT ROWID COLID ROWTOT COLTOT ...
                  EXPFREQ OBSFREQ
Note:
    The ASSOCIATION PLOT command can be used to plot the standardized residuals of the chi-square analysis.

    The ODDS RATIO INDEPDNENCE TEST is an alternative test for independence based on the LOG(odds ratio).

Default:
    None
Synonyms:
    None
Related Commands: Reference:
    Conover (1999), "Practical Nonparametric Statistics", Third Edition, Wiley, pp. 204-216.

    Friendly (2000), "Visualizing Categorical Data", SAS Institute Inc., p. 90.

    Cochran (1952), "The Chi-Square Test of Goodness of Fit", Annals of Mathematical Statistics, 23, pp. 315-345.

Applications:
    Categorical Data Analysis
Implementation Date:
    2007/3
Program:
     
    . Example from page 61 of Friendly
    read matrix m
     5  29 14 16
    15  54 14 10
    20  84 17 94
    68 119 26 7
    end of data
    .
    chi-square independence test m
        
    The following output is generated:
               CHI-SQUARE TEST FOR INDEPENDENCE (RXC TABLE)
      
     NULL HYPOTHESIS: THE TWO VARIABLES ARE INDEPENDENT
     ALTERNATIVE HYPOTHESIS: THE TWO VARIABLES ARE NOT INDEPENDENT
      
     SAMPLE 1:
     NUMBER OF OBSERVATIONS                    =      592
     NUMBER OF LEVELS (ROWS)                   =        4
      
     SAMPLE 2:
     NUMBER OF OBSERVATIONS                    =      592
     NUMBER OF LEVELS (COLUMNS)                =        4
      
     WITHOUT YATES CONTINUITY CORRECTION:
     CHI-SQUARE TEST STATISTIC                =    138.2898
     DEGREES OF FREEDOM                       =        9
     CDF VALUE OF TEST STATISTIC              =    1.000000
      
     WITH YATES CONTINUITY CORRECTION:
     CHI-SQUARE TEST STATISTIC                =    132.0374
     DEGREES OF FREEDOM                       =        9
     CDF VALUE OF TEST STATISTIC              =    1.000000
      
      
     WITHOUT YATES CONTINUITY CORRECTION
                                           NULL HYPOTHESIS   NULL
     NULL          CONFIDENCE    CRITICAL  ACCEPTANCE        HYPOTHESIS
     HYPOTHESIS    LEVEL         VALUE     INTERVAL          CONCLUSION
     ===================================================================
     INDEPENDENT      50.0%        8.34     (0,0.500)        REJECT
     INDEPENDENT      80.0%       12.24     (0,0.800)        REJECT
     INDEPENDENT      90.0%       14.68     (0,0.900)        REJECT
     INDEPENDENT      95.0%       16.92     (0,0.950)        REJECT
     INDEPENDENT      97.5%       19.02     (0,0.975)        REJECT
     INDEPENDENT      99.0%       21.67     (0,0.990)        REJECT
      
     WITH YATES CONTINUITY CORRECTION
                                           NULL HYPOTHESIS   NULL
     NULL          CONFIDENCE    CRITICAL  ACCEPTANCE        HYPOTHESIS
     HYPOTHESIS    LEVEL         VALUE     INTERVAL          CONCLUSION
     ===================================================================
     INDEPENDENT      50.0%        8.34     (0,0.500)        REJECT
     INDEPENDENT      80.0%       12.24     (0,0.800)        REJECT
     INDEPENDENT      90.0%       14.68     (0,0.900)        REJECT
     INDEPENDENT      95.0%       16.92     (0,0.950)        REJECT
     INDEPENDENT      97.5%       19.02     (0,0.975)        REJECT
     INDEPENDENT      99.0%       21.67     (0,0.990)        REJECT
        
Date created: 07/25/2007
Last updated: 12/11/2023

Please email comments on this WWW page to alan.heckert@nist.gov.