7.
Product and Process Comparisons
7.1. Introduction
|
|||
Definition of outliers | An outlier is an observation that lies an abnormal distance from other values in a random sample from a population. In a sense, this definition leaves it up to the analyst (or a consensus process) to decide what will be considered abnormal. Before abnormal observations can be singled out, it is necessary to characterize normal observations. | ||
Ways to describe data |
Two activities are essential for characterizing a set of data:
|
||
Box plot construction | The box plot is a useful graphical display for describing the behavior of the data in the middle as well as at the ends of the distributions. The box plot uses the median and the lower and upper quartiles (defined as the 25th and 75th percentiles). If the lower quartile is Q1 and the upper quartile is Q3, then the difference (Q3 - Q1) is called the interquartile range or IQ. | ||
Box plots with fences |
A box plot is constructed by drawing a box between the upper and
lower quartiles with a solid line drawn across the box to locate
the median. The following quantities (called fences) are
needed for identifying extreme values in the tails of the
distribution:
|
||
Outlier detection criteria | A point beyond an inner fence on either side is considered a mild outlier. A point beyond an outer fence is considered an extreme outlier. | ||
Example of an outlier box plot |
The data set of N = 90 ordered observations as shown below
is examined for outliers:
30, 171, 184, 201, 212, 250, 265, 270, 272, 289, 305, 306, 322, 322, 336, 346, 351, 370, 390, 404, 409, 411, 436, 437, 439, 441, 444, 448, 451, 453, 470, 480, 482, 487, 494, 495, 499, 503, 514, 521, 522, 527, 548, 550, 559, 560, 570, 572, 574, 578, 585, 592, 592, 607, 616, 618, 621, 629, 637, 638, 640, 656, 668, 707, 709, 719, 737, 739, 752, 758, 766, 792, 792, 794, 802, 818, 830, 832, 843, 858, 860, 869, 918, 925, 953, 991, 1000, 1005, 1068, 1441 The above data is available as a text file. The computations are as follows:
From an examination of the fence points and the data, one point (1441) exceeds the upper inner fence and stands out as a mild outlier; there are no extreme outliers. |
||
Histogram with box plot |
A histogram with an overlaid box plot are shown below.
|
||
Outliers may contain important information | Outliers should be investigated carefully. Often they contain valuable information about the process under investigation or the data gathering and recording process. Before considering the possible elimination of these points from the data, one should try to understand why they appeared and whether it is likely similar values will continue to appear. Of course, outliers are often bad data points. |