7.
Product and Process Comparisons
7.2. Comparisons based on data from one process 7.2.6. What intervals contain a fixed percentage of the population values?
|
|||
Definitions of order statistics and ranks | For a series of measurements \(Y_1, \, \ldots, \, Y_N\), denote the data ordered in increasing order of magnitude by \(Y_{[1]}, \, \ldots, \, Y_{[N]}\). These ordered data are called order statistics. If \(Y_{[j]}\) is the order statistic that corresponds to the measurement \(Y_i\), then the rank for \(Y_i\) is \(j\); i.e., $$ Y_{[j]} \sim Y_i \,\, \Longrightarrow \,\, r_i = j \, .$$ | ||
Definition of percentiles |
Order statistics provide a way of estimating proportions of the
data that should fall above and below a given value, called a
percentile. The \(p\)th
percentile is a value, \(Y_{(p)}\),
such that at most \((100 p)\) %
of the measurements are less than this value and at most \(100(1-p)\) %
are greater. The 50th percentile is called the median.
Percentiles split a set of ordered data into hundredths. (Deciles split ordered data into tenths). For example, 70 % of the data should fall below the 70th percentile. Given n points, the percentile corresponding to the i-th point is
|
||
Estimation of percentiles |
Percentiles can be estimated from \(N\)
measurements as follows:
for the \(p\)th
percentile, set \(p(N+1)\)
equal to \(k + d\)
for \(k\)
an integer, and \(d\),
a fraction greater than or equal to 0 and less than 1.
|
||
Example and interpretation |
For the purpose of illustration, twelve measurements from a
gage study are shown
below. The measurements are resistivities of silicon wafers
measured in ohm.cm.
i Measurements Order stats Ranks 1 95.1772 95.0610 9 2 95.1567 95.0925 6 3 95.1937 95.1065 10 4 95.1959 95.1195 11 5 95.1442 95.1442 5 6 95.0610 95.1567 1 7 95.1591 95.1591 7 8 95.1195 95.1682 4 9 95.1065 95.1772 3 10 95.0925 95.1937 2 11 95.1990 95.1959 12 12 95.1682 95.1990 8To find the 90th percentile, \(p(N+1)\) = 0.9(13) = 11.7; \(k\) = 11, and \(d\) = 0.7. From condition (1) above, \(Y_{(90)}\) is estimated to be 95.1981 ohm.cm. This percentile, although it is an estimate from a small sample of resistivities measurements, gives an indication of the percentile for a population of resistivity measurements. |
||
Note that there are other ways of calculating percentiles in common use |
Hyndman and Fan (1996) in an
American Statistician article evaluated nine different methods (we
will refer to these as R1 through R9) for computing percentiles relative
to six desirable properties. Their goal was to advocate a "standard"
definition for percentiles that would be implemented in statistical
software. Although this has not in fact happened, the article does
provide a useful summary and evaluation of various methods for
computing percentiles. Most statistical and spreadsheet software
use one of the methods described in Hyndman and Fan.
The method described above corresponds to method R6 of Hyndman and Fan. This is the default method used by Dataplot. The method advocated by Hyndman and Fan is R8. For the R8 method, set \( p(N+(1/3)) + (1/3) \) and proceed as above. Note that any p ≤ (2/3)/(N+(1/3)) will be set to the minimum value and any p ≥ (N-(1/3))/(N+(1/3)) will be set to the maximum value. Both R and Dataplot can optionally use this method. For the example given above, R8 gives 95.1972 (compared to 95.1981) for the 90-th percentile. Some software packages set \(1 + p(N-1)\) equal to \(k + d\) and then proceed as above. This is method R7 of Hyndman and Fan. This is the method used by Excel and is the default method for R (the R quantile function can optionally use any of the nine methods discussed in Hyndman and Fan). For the example given above, R7 gives 95.1957. The R6, R7, and R8 methods give fairly similar, but not exactly the same (particularly for small samples), results. For most purposes, any of these three methods should be acceptable. Another method of calculating percentiles (given in some elementary textbooks) starts by calculating \(p N\). If that is not an integer, round up to the next highest integer \(k\) and use \(Y_{[k]}\) as the percentile estimate. If \(p N\) is an integer \(k\), use \( 0.5 \left( Y_{[k]} + Y_{[k+1]} \right) \). One of R6, R7, or R8 would typically be preferred to this method. |
||
Definition of Tolerance Interval | An interval covering population percentiles can be interpreted as "covering a proportion \(p\) of the population with a level of confidence, say, 90 %." This is known as a tolerance interval. |