Dataplot Vol 1 Vol 2

# EXTREME STUDENTIZED DEVIATE TEST

Name:
EXTREME STUDENTIZED DEVIATE TEST
Type:
Analysis Command
Purpose:
Perform a generalized extreme studentized deviate (ESD) test for outliers.
Description:
The generalized extreme Studentized deviate (ESD) test is used to detect one or more outliers in a univariate data set that follows an approximately normal distribution.

The primary limitation of the Grubbs test and the Tietjen-Moore test is that the suspected number of outliers, k, must be specified exactly. If k is not specified correctly, this can distort the conclusions of these tests. On the other hand, the generalized ESD test only requires that an upper bound for the suspected number of outliers be specified.

Given the upper bound, r, the generalized ESD test essentially performs r separate tests: a test for one outlier, a test for two outliers, and so on up to r outliers.

The generalized ESD test is defined for the hypothesis:

 H0: There are no outliers in the data set Ha: There are up to r outliers in the data set Test Statistic: Compute $$R_{1} = \mbox{max}_{i}|x_{i} - \bar{x}|/s$$ with $$\bar{x}$$ and s denoting the sample mean and sample standard deviation, respectively. Remove the observation that maximizes $$|x_{i} - \bar{x}|$$ and then recompute the above statistic with n - 1 observations. Repeat this process until r observations have been removed. This results in the r test statistics R1, R2, ..., Rr. Significance Level: $$\alpha$$ Critical Region: Corresponding to the r test statistics, compute the following r critical values $$\lambda_{i} = \frac{t_{n-i-1,p(n-i)}} {\sqrt{(n-i-1+t_{n-i-1,p}^{2}) (n-i+1)}}$$ where i = 1, 2, ..., r, $$t_{\nu,p}$$ is the 100p percentage point from the t distribution with $$\nu$$ degrees of freedom and $$p = 1 - \frac{\alpha}{2(n-i+1)}$$. The number of outliers is determined by finding the largest i such that Ri > $$\lambda_{i}$$. Simulation studies by Rosner indicate that this critical value approximation is very accurate for n ≥ 25 and reasonably accurate for n ≥ 15.

Note that although the generalized ESD is essentially Grubbs test applied sequentially, there are a few important distinctions:

• The generalized ESD test makes approriate adjustments for the critical values based on the number of outliers being tested for that the sequential application of Grubbs test does not.

• If there is significant masking, applying Grubbs test sequentially may stop too soon. The example below identifies 3 outliers at the 5% level when using the generalized ESD test. However, trying to use Grubbs test sequentially would stop at the first iteration and declare no outliers.

• Grubbs test allows one-sided tests (i.e., you can specify a minimum test or the maximum test) in addition to two-sided tests (both the minimum and the maximum value are tested). The generalized ESD test is restricted to two-sided tests.
Syntax 1:
EXTREME STUDENTIZED DEVIATE TEST <y>
<SUBSET/EXCEPT/FOR qualification>
where <y> is the response variable being tested;
and where the <SUBSET/EXCEPT/FOR qualification> is optional.
Syntax 2:
EXTREME STUDENTIZED DEVIATE MULTIPLE TEST <y1> ... <yk>
<SUBSET/EXCEPT/FOR qualification>
where <y1> ... <yk> is a list of up to k response variables;
and where the <SUBSET/EXCEPT/FOR qualification> is optional.

This syntax performs an extreme studentized deviate test on <y1> then on <y2> and so on. Up to 30 response variables can be specified.

Note that the syntax

EXTREME STUDENTIZED DEVIATE MULTIPLE TEST Y1 TO Y4

is supported. This is equivalent to

EXTREME STUDENTIZED DEVIATE MULTIPLE TEST Y1 Y2 Y3 Y4
Syntax 3:
EXTREME STUDENTIZED DEVIATE REPLICATED TEST <y> <x1> ... <xk>
<SUBSET/EXCEPT/FOR qualification>
where <y> is the response variable;
<x1> ... <xk> is a list of up to k group-id variables;
and where the <SUBSET/EXCEPT/FOR qualification> is optional.

This syntax peforms a cross-tabulation of <x1> ... <xk> and performs an extreme studentized deviate test for each unique combination of cross-tabulated values. For example, if X1 has 3 levels and X2 has 2 levels, there will be a total of 6 extreme studentized deviate tests performed.

Up to six group-id variables can be specified.

Note that the syntax

EXTREME STUDENTIZED DEVIATE REPLICATED TEST Y X1 TO X4

is supported. This is equivalent to

EXTREME STUDENTIZED DEVIATE REPLICATED TEST Y X1 X2 X3 X4
Examples:
EXTREME STUDENTIZED DEVIATE TEST Y1
EXTREME STUDENTIZED DEVIATE TEST Y1 LABID
EXTREME STUDENTIZED DEVIATE MULTIPLE TEST Y1 Y2 Y3
EXTREME STUDENTIZED DEVIATE REPLICATED TEST Y X1 X2
EXTREME STUDENTIZED DEVIATE TEST Y1 SUBSET TAG > 2
EXTREME STUDENTIZED DEVIATE MINIMUM TEST Y1
EXTREME STUDENTIZED DEVIATE MAXIMUM TEST Y1
Note:
The upper bound on the number of outliers to test for is specified with the command

LET NOUTLIER = <value>
Note:
Masking and swamping are two issues that can affect outlier tests.

Masking can occur when we specify too few outliers in the test. For example, if we are testing for a single outlier when there are in fact two (or more) outliers, these additional outliers may influence the value of the test statistic enough so that no points are declared as outliers.

On the other hand, swamping can occur when we specify too many outliers in the test. For example, if we are testing for two outliers when there is in fact only a single outlier, both points may be declared outliers.

The possibility of masking and swamping are an important reason why it is useful to complement formal outlier tests with graphical methods. Graphics can often help identify cases where masking or swamping may be an issue.

Also, masking is one reason that trying to apply a single outlier test sequentially can fail. If there are multiple outliers, masking may cause the outlier test for the first outlier to return a conclusion of no outliers (and so the testing for any additional outliers is not done). Also, applying a single outlier test sequentially does not properly adjust the critical value for the overall test.

The masking/swamping issue explains the primary advantage of the generalized ESD test. When there is masking or swamping, it is not uncommon to see the conclusion for the prescence of outliers change as the value for the number of outliers changes. By weaking the assumption that the exact number of potential outliers is known to the assumption that an upper bound is known (and we can always pick this upper bound a little high if we do not have a good handle on it), we are more likely to avoid distortions caused by masking or swamping.

Note:
Tests for outliers are dependent on knowing the distribution of the data. The generalized ESD test assumes that the data come from an approximately normal distribution. For this reason, it is strongly recommended that the extreme studentized deviate test be complemented with a normal probability plot. If the data are not approximately normally distributed, then the generalized ESD test may be detecting the non-normality of the data rather than the presence of outliers.
Note:
You can specify the number of digits in the generalized ESD output with the command

SET WRITE DECIMALS <value>
Note:
The EXTREME STUDENTIZED DEVIATE TEST command automatically saves the following parameters:

 STATVAL = the value of the test statistic PVAL = the p-value of the test statistic CUTOFF0 = the 0 percent point of the reference distribution CUTOFF50 = the 50 percent point of the reference distribution CUTOFF75 = the 75 percent point of the reference distribution CUTOFF90 = the 90 percent point of the reference distribution CUTOFF95 = the 95 percent point of the reference distribution CUTOFF975 = the 97.5 percent point of the reference distribution CUTOFF99 = the 99 percent point of the reference distribution

If the MULTIPLE or REPLICATED option is used, these values will be written to the file "dpst1f.dat" instead.

Note:
In addition to the EXTREME STUDENTIZED DEVIATE TEST command, the following command can also be used:

LET A = EXTREME STUDENTIZED DEVIATE Y

In addition to the above LET command, built-in statistics are supported for 20+ different commands (enter HELP STATISTICS for details).

Default:
None
Synonyms:
ESD is a synonym for EXTREME STUDENTIZED DEVIATE
MULTIPLE ESD is a synonym for ESD MULTIPLE
REPLICATION ESD is a synonym for ESD REPLICATION
Related Commands:
 TIETJEN-MOORE = Perform the Tietjen-Moore outlier test. GRUBB TEST = Perform a Grubbs outlier test. DIXON TEST = Perform a Dixon outlier test. ANDERSON DARLING TEST = Perform an Anderson Darling normality test. WILKS SHAPIRO NORMALITY TEST = Perform a Wilks Shapiro normality test. HISTOGRAM = Generate a histogram. PROBABILITY PLOT = Generates a probability plot. BOX PLOT = Generate a box plot.
References:
Rosner, Bernard (May 1983), "Percentage Points for a Generalized ESD Many-Outlier Procedure," Technometrics, Vol. 25, No. 2, pp. 165-172.

Iglewicz and Hoaglin (1993), "Volume 16: How to Detect and Handle Outliers," The ASQC Basic Reference in Quality Control: Statistical Techniques, Edward F. Mykytka, Ph.D., Editor.

Applications:
Outlier Detection
Implementation Date:
2009/11
2011/08: Fixed bug where the table for "Conclusions (2-Tailed Test)" was printing the critical values in an inverted order
Program:

.  Step 1: Data from Rosner paper
.
-0.25 0.68 0.94 1.15 1.20 1.26 1.26 1.34 1.38 1.43 1.49 1.49 1.55 1.56
1.58 1.65 1.69 1.70 1.76 1.77 1.81 1.91 1.94 1.96 1.99 2.06 2.09 2.10
2.14 2.15 2.23 2.24 2.26 2.35 2.37 2.40 2.47 2.54 2.62 2.64 2.90 2.92
2.92 2.93 3.21 3.26 3.30 3.59 3.68 4.30 4.64 5.34 5.42 6.01
end of data
.
.  Step 2: Generate a normal probability plot
.
title case asis
title offset 2
label case asis
title Normal Probability Plot
y1label Sorted Data
x1label Theoretical Percent Points
char circle
char fill on
char hw 1.2 0.8
line blank
normal prob plot y
.
.  Step 3: Perform the generalized ESD outlier test
.
set write decimals 5
let noutlier = 10
extreme studentized deviate test y

The following output is generated.

            Generalized Extreme Studentized Deviate Test for
Multiple Outliers (Assumption: Normality)

Response Variable: Y

Summary Statistics:
Number of Observations:                              54
Sample Minimum:                                -0.25000
Sample Maximum:                                 6.00999
Sample Mean:                                    2.32074
Sample SD:                                      1.18286

H0: There are no outliers
Ha: There is exactly     1 outlier
Potential Outlier Value Tested at This Step:              6.00999

Extreme Studentized Deviate Test Statistic Value:         3.11890

Percent Points of the Reference Distribution
-----------------------------------
Percent Point               Value
-----------------------------------
0.0    =          0.000
50.0    =          2.532
75.0    =          2.738
90.0    =          2.987
95.0    =          3.158
97.5    =          3.318
99.0    =          3.516

Conclusions (2-Tailed Test)
----------------------------------------------
Alpha    CDF   Critical Value     Conclusion
----------------------------------------------
10%    90%            2.987      Reject H0
5%    95%            3.158      Accept H0
2.5%  97.5%            3.318      Accept H0
1%    99%            3.516      Accept H0

H0: There are no outliers
Ha: There are exactly     2 outliers
Potential Outlier Value Tested at This Step:              5.41999

Extreme Studentized Deviate Test Statistic Value:         2.94297

Percent Points of the Reference Distribution
-----------------------------------
Percent Point               Value
-----------------------------------
0.0    =          0.000
50.0    =          2.524
75.0    =          2.730
90.0    =          2.980
95.0    =          3.150
97.5    =          3.311
99.0    =          3.508

Conclusions (2-Tailed Test)
----------------------------------------------
Alpha    CDF   Critical Value     Conclusion
----------------------------------------------
10%    90%            2.980      Accept H0
5%    95%            3.150      Accept H0
2.5%  97.5%            3.311      Accept H0
1%    99%            3.508      Accept H0

H0: There are no outliers
Ha: There are exactly     3 outliers
Potential Outlier Value Tested at This Step:              5.33999

Extreme Studentized Deviate Test Statistic Value:         3.17942

Percent Points of the Reference Distribution
-----------------------------------
Percent Point               Value
-----------------------------------
0.0    =          0.000
50.0    =          2.516
75.0    =          2.724
90.0    =          2.972
95.0    =          3.144
97.5    =          3.303
99.0    =          3.500

Conclusions (2-Tailed Test)
----------------------------------------------
Alpha    CDF   Critical Value     Conclusion
----------------------------------------------
10%    90%            2.972      Reject H0
5%    95%            3.144      Reject H0
2.5%  97.5%            3.303      Accept H0
1%    99%            3.500      Accept H0

H0: There are no outliers
Ha: There are exactly     4 outliers
Potential Outlier Value Tested at This Step:              4.63999

Extreme Studentized Deviate Test Statistic Value:         2.81018

Percent Points of the Reference Distribution
-----------------------------------
Percent Point               Value
-----------------------------------
0.0    =          0.000
50.0    =          2.509
75.0    =          2.717
90.0    =          2.964
95.0    =          3.136
97.5    =          3.295
99.0    =          3.491

Conclusions (2-Tailed Test)
----------------------------------------------
Alpha    CDF   Critical Value     Conclusion
----------------------------------------------
10%    90%            2.964      Accept H0
5%    95%            3.136      Accept H0
2.5%  97.5%            3.295      Accept H0
1%    99%            3.491      Accept H0

H0: There are no outliers
Ha: There are exactly     5 outliers
Potential Outlier Value Tested at This Step:             -0.25000

Extreme Studentized Deviate Test Statistic Value:         2.81557

Percent Points of the Reference Distribution
-----------------------------------
Percent Point               Value
-----------------------------------
0.0    =          0.000
50.0    =          2.501
75.0    =          2.709
90.0    =          2.956
95.0    =          3.128
97.5    =          3.287
99.0    =          3.482

Conclusions (2-Tailed Test)
----------------------------------------------
Alpha    CDF   Critical Value     Conclusion
----------------------------------------------
10%    90%            2.956      Accept H0
5%    95%            3.128      Accept H0
2.5%  97.5%            3.287      Accept H0
1%    99%            3.482      Accept H0

H0: There are no outliers
Ha: There are exactly     6 outliers
Potential Outlier Value Tested at This Step:              4.29999

Extreme Studentized Deviate Test Statistic Value:         2.84817

Percent Points of the Reference Distribution
-----------------------------------
Percent Point               Value
-----------------------------------
0.0    =          0.000
50.0    =          2.494
75.0    =          2.701
90.0    =          2.948
95.0    =          3.120
97.5    =          3.278
99.0    =          3.474

Conclusions (2-Tailed Test)
----------------------------------------------
Alpha    CDF   Critical Value     Conclusion
----------------------------------------------
10%    90%            2.948      Accept H0
5%    95%            3.120      Accept H0
2.5%  97.5%            3.278      Accept H0
1%    99%            3.474      Accept H0

H0: There are no outliers
Ha: There are exactly     7 outliers
Potential Outlier Value Tested at This Step:              3.67999

Extreme Studentized Deviate Test Statistic Value:         2.27932

Percent Points of the Reference Distribution
-----------------------------------
Percent Point               Value
-----------------------------------
0.0    =          0.000
50.0    =          2.486
75.0    =          2.693
90.0    =          2.940
95.0    =          3.112
97.5    =          3.270
99.0    =          3.463

Conclusions (2-Tailed Test)
----------------------------------------------
Alpha    CDF   Critical Value     Conclusion
----------------------------------------------
10%    90%            2.940      Accept H0
5%    95%            3.112      Accept H0
2.5%  97.5%            3.270      Accept H0
1%    99%            3.463      Accept H0

H0: There are no outliers
Ha: There are exactly     8 outliers
Potential Outlier Value Tested at This Step:              3.58999

Extreme Studentized Deviate Test Statistic Value:         2.31036

Percent Points of the Reference Distribution
-----------------------------------
Percent Point               Value
-----------------------------------
0.0    =          0.000
50.0    =          2.478
75.0    =          2.685
90.0    =          2.932
95.0    =          3.103
97.5    =          3.262
99.0    =          3.455

Conclusions (2-Tailed Test)
----------------------------------------------
Alpha    CDF   Critical Value     Conclusion
----------------------------------------------
10%    90%            2.932      Accept H0
5%    95%            3.103      Accept H0
2.5%  97.5%            3.262      Accept H0
1%    99%            3.455      Accept H0

H0: There are no outliers
Ha: There are exactly     9 outliers
Potential Outlier Value Tested at This Step:              0.68000

Extreme Studentized Deviate Test Statistic Value:         2.10158

Percent Points of the Reference Distribution
-----------------------------------
Percent Point               Value
-----------------------------------
0.0    =          0.000
50.0    =          2.468
75.0    =          2.677
90.0    =          2.923
95.0    =          3.093
97.5    =          3.253
99.0    =          3.444

Conclusions (2-Tailed Test)
----------------------------------------------
Alpha    CDF   Critical Value     Conclusion
----------------------------------------------
10%    90%            2.923      Accept H0
5%    95%            3.093      Accept H0
2.5%  97.5%            3.253      Accept H0
1%    99%            3.444      Accept H0

H0: There are no outliers
Ha: There are exactly    10 outliers
Potential Outlier Value Tested at This Step:              3.29999

Extreme Studentized Deviate Test Statistic Value:         2.06717

Percent Points of the Reference Distribution
-----------------------------------
Percent Point               Value
-----------------------------------
0.0    =          0.000
50.0    =          2.460
75.0    =          2.668
90.0    =          2.915
95.0    =          3.084
97.5    =          3.242
99.0    =          3.435

Conclusions (2-Tailed Test)
----------------------------------------------
Alpha    CDF   Critical Value     Conclusion
----------------------------------------------
10%    90%            2.915      Accept H0
5%    95%            3.084      Accept H0
2.5%  97.5%            3.242      Accept H0
1%    99%            3.435      Accept H0

Summary Table
----------------------------------------------------------------------
Exact           Test       Critical       Critical       Critical
Number of      Statistic          Value          Value          Value
Outliers          Value            10%             5%             1%
----------------------------------------------------------------------
1        3.11890        2.98680        3.15879        3.51571
2        2.94297        2.97960        3.15142        3.50772
3        3.17942        2.97224        3.14388        3.49952
4        2.81018        2.96469        3.13616        3.49110
5        2.81557        2.95697        3.12824        3.48246
6        2.84817        2.94906        3.12012        3.47358
7        2.27932        2.94094        3.11179        3.46445
8        2.31036        2.93262        3.10324        3.45506
9        2.10158        2.92408        3.09445        3.44539
10        2.06717        2.91530        3.08542        3.43543


NIST is an agency of the U.S. Commerce Department.

Date created: 09/09/2010
Last updated: 11/03/2015