4.
Process Modeling
4.4.
Data Analysis for Process Modeling
4.4.4.
|
How can I tell if a model fits my data?
|
|
\(R^2\)
Is Not Enough!
|
Model validation is possibly the most important step in the model building
sequence. It is also one of the most overlooked. Often the validation of
a model seems to consist of nothing more than quoting the \(R^2\)
statistic from the fit (which measures the fraction
of the total variability in the response that is accounted for by the model).
Unfortunately, a high \(R^2\)
value does not guarantee
that the model fits the data well. Use of a model that does not fit the
data well cannot provide good answers to the underlying engineering or
scientific questions under investigation.
|
Main Tool: Graphical Residual Analysis
|
There are many statistical tools for model validation, but the primary tool
for most process modeling applications is graphical residual analysis.
Different types of plots of the residuals (see definition
below) from a fitted
model provide information on the adequacy of different aspects of the model.
Numerical methods for model validation, such as the \(R^2\)
statistic, are also useful, but usually to a lesser degree than graphical
methods. Graphical methods have an advantage over numerical methods for
model validation because they readily illustrate a broad range of complex
aspects of the relationship between the model and the data. Numerical methods
for model validation tend to be narrowly focused on a particular aspect of the
relationship between the model and the data and often try to compress that
information into a single descriptive number or test result.
|
Numerical Methods' Forte
|
Numerical methods do play an important role as confirmatory methods for
graphical techniques, however. For example, the
lack-of-fit test
for assessing the correctness of the functional part of the model can aid in
interpreting a borderline residual plot. There are also a few
modeling situations in which graphical methods cannot easily be used. In
these cases, numerical methods provide a fallback position for model
validation. One common situation when numerical validation methods take
precedence over graphical methods is when the number of parameters being
estimated is relatively close to the size of the data set. In this situation
residual plots are often difficult to interpret due to constraints on the
residuals imposed by the estimation of the unknown parameters. One area
in which this typically happens is in optimization applications using designed
experiments. Logistic regression with binary data is another area in which
graphical residual analysis can be difficult.
|
Residuals
|
The residuals from a fitted model are the differences between the responses
observed at each combination values of the explanatory variables and the corresponding
prediction of the response computed using the regression function.
Mathematically, the definition of the residual for the ith
observation in the data set is written
$$ e_i = y_i - f(\vec{x}_i;\hat{\vec{\beta}}) $$
with \(y_i\)
denoting the ith
response in the data set and \(\vec{x}_i\)
represents the list of explanatory variables, each set at the corresponding
values found in the ith observation in the data set.
|
Example
|
The data listed below are from the
Pressure/Temperature example introduced
in Section 4.1.1. The first column
shows the order in which the observations were made, the second column indicates the
day on which each observation was made, and the third column gives the ambient
temperature recorded when each measurement was made. The fourth column lists the
temperature of the gas itself (the explanatory variable) and the fifth column
contains the observed pressure of the gas (the response variable). Finally, the sixth
column gives the corresponding values from the fitted straight-line regression function.
$$ \hat{P} = 7.749695 + 3.930123T $$
and the last
column lists the residuals, the difference between columns five and six.
(The reader can download the pressure/temperature data as a
text file.)
|
Data, Fitted Values & Residuals
|
Run Ambient Fitted
Order Day Temperature Temperature Pressure Value Residual
1 1 23.820 54.749 225.066 222.920 2.146
2 1 24.120 23.323 100.331 99.411 0.920
3 1 23.434 58.775 230.863 238.744 -7.881
4 1 23.993 25.854 106.160 109.359 -3.199
5 1 23.375 68.297 277.502 276.165 1.336
6 1 23.233 37.481 148.314 155.056 -6.741
7 1 24.162 49.542 197.562 202.456 -4.895
8 1 23.667 34.101 138.537 141.770 -3.232
9 1 24.056 33.901 137.969 140.983 -3.014
10 1 22.786 29.242 117.410 122.674 -5.263
11 2 23.785 39.506 164.442 163.013 1.429
12 2 22.987 43.004 181.044 176.759 4.285
13 2 23.799 53.226 222.179 216.933 5.246
14 2 23.661 54.467 227.010 221.813 5.198
15 2 23.852 57.549 232.496 233.925 -1.429
16 2 23.379 61.204 253.557 248.288 5.269
17 2 24.146 31.489 139.894 131.506 8.388
18 2 24.187 68.476 273.931 276.871 -2.940
19 2 24.159 51.144 207.969 208.753 -0.784
20 2 23.803 68.774 280.205 278.040 2.165
21 3 24.381 55.350 227.060 225.282 1.779
22 3 24.027 44.692 180.605 183.396 -2.791
23 3 24.342 50.995 206.229 208.167 -1.938
24 3 23.670 21.602 91.464 92.649 -1.186
25 3 24.246 54.673 223.869 222.622 1.247
26 3 25.082 41.449 172.910 170.651 2.259
27 3 24.575 35.451 152.073 147.075 4.998
28 3 23.803 42.989 169.427 176.703 -7.276
29 3 24.660 48.599 192.561 198.748 -6.188
30 3 24.097 21.448 94.448 92.042 2.406
31 4 22.816 56.982 222.794 231.697 -8.902
32 4 24.167 47.901 199.003 196.008 2.996
33 4 22.712 40.285 168.668 166.077 2.592
34 4 23.611 25.609 109.387 108.397 0.990
35 4 23.354 22.971 98.445 98.029 0.416
36 4 23.669 25.838 110.987 109.295 1.692
37 4 23.965 49.127 202.662 200.826 1.835
38 4 22.917 54.936 224.773 223.653 1.120
39 4 23.546 50.917 216.058 207.859 8.199
40 4 24.450 41.976 171.469 172.720 -1.251
|
Why Use Residuals?
|
If the model fit to the data were correct, the residuals would approximate the
random errors that make the relationship between the explanatory variables and
the response variable a statistical
relationship. Therefore, if the residuals appear to behave randomly, it
suggests that the model fits the data well. On the other hand, if non-random
structure is evident in the residuals, it is a clear sign that the model fits
the data poorly. The subsections listed below detail the types of plots to use
to test different aspects of a model and give guidance on the correct
interpretations of different results that could be observed for each type of
plot.
|
Model Validation Specifics
|
- How can I assess the sufficiency of the functional part of the model?
- How can I detect non-constant variation across the data?
- How can I tell if there was drift in the process?
- How can I assess whether the random errors are independent from one to the next?
- How can I test whether or not the random errors are distributed normally?
- How can I test whether any significant terms are missing or misspecified in the functional part of the model?
- How can I test whether all of the terms in the functional part of the model are necessary?
|