SED navigation bar go to SED home page go to Dataplot home page go to NIST home page SED Home Page SED Contacts SED Projects SED Products and Publications Search SED Pages
Dataplot Vol 1 Auxiliary Chapter

KOLMOGOROV SMIRNOV TWO SAMPLE

Name:
    ... KOLMOGOROV SMIRNOV TWO SAMPLE TEST
Type:
    Analysis Command
Purpose:
    Perform a Kolmogorov-Smirnov two sample test that two data samples come from the same distribution. Note that we are not specifying what that common distribution is.
Description:
    The one sample Kolmogorov-Smirnov (K-S) test is based on the empirical distribution function (ECDF). Given N data points Y1 Y2 ..., YN the ECDF is defined as

      E(i) = n(i)/N

    where n(i) is the number of points less than Yi This is a step function that increases by 1/N at the value of each data point. We can graph a plot of the empirical distribution function with a cumulative distribution function for a given distribution. The one sample K-S test is based on the maximum distance between these two curves. That is,

      D = max |F(Y(i)) - E(i)|

    where F is the theoretical cumulative distribution function.

    The two sample K-S test is a variation of this. However, instead of comparing an empirical distribution function to a theoretical distribution function, we compare the two empirical distribution functions. That is,

      D = max |E1(i) - E2(i)|

    where E1 and E2 are the empirical distribution functions for the two samples. Note that we compute E1 and E2 at each point in both samples (that is both E1 and E2 are computed at each point in each sample).

    More formally, the Kolmogorov-Smirnov two sample test statistic can be defined as follows.

    H0: The two samples come from a common distribution.
    Ha: The two samples do not come from a common distribution.
    Test Statistic: The Kolmogorov-Smirnov two sample test statistic is defined as

      D = max |E1(i) - E2(i)|

    where E1 and E2 are the empirical distribution functions for the two samples.

    Significance Level: alpha
    Critical Region: The hypothesis regarding the distributional form is rejected if the test statistic, D, is greater than the critical value obtained from a table. There are several variations of these tables in the literature that use somewhat different scalings for the K-S test statistic and critical regions. These alternative formulations should be equivalent, but it is necessary to ensure that the test statistic is calculated in a way that is consistent with how the critical values were tabulated.

    Dataplot uses the critical values from Chakravart, Laha, and Roy (see Reference: below).

    The quantile-quantile plot, bihistogram, and Tukey mean-difference plot are graphical alternatives to the two sample K-S test.

Syntax:
    KOLMOGOROV SMIRNOV TWO SAMPLE TEST <y1> <y2>
                            <SUBSET/EXCEPT/FOR/qualification>
    where <y1> is the first response variable;
                <y2> is the second response variable;
    and where the <SUBSET/EXCEPT/FOR qualification> is optional.
Examples:
    KOLMOGOROV-SMIRNOV TWO SAMPLE TEST Y1 Y2
    KOLMOGOROV-SMIRNOV TWO SAMPLE TEST Y1 Y2 SUBSET Y2 > 0
Note:
    The KOLMOGOROV-SMIRNOV TWO SAMPLE TEST command automatically saves the following parameters.

      STATVAL - value of the K-S two sample statistic
      CUTUPP90 - 90% critical value (alpha = 0.10) for the K-S two sample test statistic
      CUTUPP95 - 95% critical value (alpha = 0.05) for the K-S two sample test statistic
      CUTUPP99 - 99% critical value (alpha = 0.01) for the K-S two sample test statistic

    These parameters can be used in subsequent analysis.

Default:
    None
Synonyms:
    The word test in the command is optional. Also, TWO can be entered as 2. For example,

      KOLMOGOROV SMIRNOV 2 SAMPLE Y1 Y2
Related Commands:
    KOMOGOROV SMIRNOV GOODNESS OF FIT TEST = Perform Kolmogorov-Snirnov goodness of fit test.
    CHI-SQUARE TWO SAMPLE TEST = Perform chi-square two sample test.
    BIHISTOGRAM = Generates a bihistogram.
    QUANTILE-QUANTILE PLOT = Generates a quantile-quantile plot.
    TUKEY MEAN DIFFERENCE PLOT = Generates a Tukey mean difference plot.
Reference:
    "Handbook of Methods of Applied Statistics, Volume I", Chakravart, Laha, and Roy, John Wiley, 1967, pp. 392-394.

    "Numerical Recipes in Fortan: The Art of Scientific Computing", Second Edition, Press, Teukolsky, Vetterlling, and Flannery, Cambridge University Press, 1992, pp. 614-622.

Applications:
    Distributional Analysis
Implementation Date:
    1998/12
Program:
    SKIP 25
    READ AUTO83B.DAT Y1 Y2
    .
    DELETE Y2 SUBSET Y2 < 0
    KOLMOGOROV-SMIRNOPV TWO SAMPLE TEST Y1 Y2

    The following output is generated.

          *************************************************
          **  KOLMOGOROV-SMIRNOPV TWO SAMPLE TEST Y1 Y2  **
          *************************************************
     
     
                      KOLMOGOROV-SMIRNOV TWO SAMPLE TEST
     
    NULL HYPOTHESIS H0:      TWO SAMPLES COME FROM THE SAME (UNSPECIFIED)
    DISTRIBUTION
    ALTERNATE HYPOTHESIS HA: TWO SAMPLES COME FROM DIFFERENT DISTRIBUTIONS
     
    SAMPLE:
       NUMBER OF OBSERVATIONS FOR SAMPLE 1 =      249
       NUMBER OF OBSERVATIONS FOR SAMPLE 2 =       79
     
    TEST:
    KOLMOGOROV-SMIRNOV TEST STATISTIC     =    1.000000
     
       ALPHA LEVEL         CUTOFF              CONCLUSION
               10%        0.37000               REJECT H0
                5%        0.41000               REJECT H0
                1%        0.49000               REJECT H0
        

Date created: 6/5/2001
Last updated: 4/4/2003
Please email comments on this WWW page to alan.heckert@nist.gov.