4.4.4. How can I tell if a model fits my data?

4. Process Modeling
4.4. Data Analysis for Process Modeling

4.4.4. How can I tell if a model fits my data?

$R^2$ Is Not Enough!

Model validation is possibly the most important step in the model building sequence. It is also one of the most overlooked. Often the validation of a model seems to consist of nothing more than quoting the $R^2$ statistic from the fit (which measures the fraction of the total variability in the response that is accounted for by the model). Unfortunately, a high $R^2$ value does not guarantee that the model fits the data well. Use of a model that does not fit the data well cannot provide good answers to the underlying engineering or scientific questions under investigation.

Main Tool: Graphical Residual Analysis

There are many statistical tools for model validation, but the primary tool for most process modeling applications is graphical residual analysis. Different types of plots of the residuals (see definition below) from a fitted model provide information on the adequacy of different aspects of the model. Numerical methods for model validation, such as the $R^2$ statistic, are also useful, but usually to a lesser degree than graphical methods. Graphical methods have an advantage over numerical methods for model validation because they readily illustrate a broad range of complex aspects of the relationship between the model and the data. Numerical methods for model validation tend to be narrowly focused on a particular aspect of the relationship between the model and the data and often try to compress that information into a single descriptive number or test result.

Numerical Methods' Forte

Numerical methods do play an important role as confirmatory methods for graphical techniques, however. For example, the lack-of-fit test for assessing the correctness of the functional part of the model can aid in interpreting a borderline residual plot. There are also a few modeling situations in which graphical methods cannot easily be used. In these cases, numerical methods provide a fallback position for model validation. One common situation when numerical validation methods take precedence over graphical methods is when the number of parameters being estimated is relatively close to the size of the data set. In this situation residual plots are often difficult to interpret due to constraints on the residuals imposed by the estimation of the unknown parameters. One area in which this typically happens is in optimization applications using designed experiments. Logistic regression with binary data is another area in which graphical residual analysis can be difficult.

Residuals

The residuals from a fitted model are the differences between the responses observed at each combination values of the explanatory variables and the corresponding prediction of the response computed using the regression function. Mathematically, the definition of the residual for the i^th observation in the data set is written $$ e_i = y_i - f(\vec{x}_i;\hat{\vec{\beta}}) $$ with $y_i$ denoting the i^th response in the data set and $\vec{x}_i$ represents the list of explanatory variables, each set at the corresponding values found in the i^th observation in the data set.

Example

The data listed below are from the Pressure/Temperature example introduced in Section 4.1.1. The first column shows the order in which the observations were made, the second column indicates the day on which each observation was made, and the third column gives the ambient temperature recorded when each measurement was made. The fourth column lists the temperature of the gas itself (the explanatory variable) and the fifth column contains the observed pressure of the gas (the response variable). Finally, the sixth column gives the corresponding values from the fitted straight-line regression function. $$ \hat{P} = 7.749695 + 3.930123T $$ and the last column lists the residuals, the difference between columns five and six. (The reader can download the pressure/temperature data as a text file.)

Data, Fitted Values & Residuals

 Run          Ambient                            Fitted
Order  Day  Temperature  Temperature  Pressure    Value    Residual
 1      1      23.820      54.749      225.066   222.920     2.146
 2      1      24.120      23.323      100.331    99.411     0.920
 3      1      23.434      58.775      230.863   238.744    -7.881
 4      1      23.993      25.854      106.160   109.359    -3.199
 5      1      23.375      68.297      277.502   276.165     1.336
 6      1      23.233      37.481      148.314   155.056    -6.741
 7      1      24.162      49.542      197.562   202.456    -4.895
 8      1      23.667      34.101      138.537   141.770    -3.232
 9      1      24.056      33.901      137.969   140.983    -3.014
10      1      22.786      29.242      117.410   122.674    -5.263
11      2      23.785      39.506      164.442   163.013     1.429
12      2      22.987      43.004      181.044   176.759     4.285
13      2      23.799      53.226      222.179   216.933     5.246
14      2      23.661      54.467      227.010   221.813     5.198
15      2      23.852      57.549      232.496   233.925    -1.429
16      2      23.379      61.204      253.557   248.288     5.269
17      2      24.146      31.489      139.894   131.506     8.388
18      2      24.187      68.476      273.931   276.871    -2.940
19      2      24.159      51.144      207.969   208.753    -0.784
20      2      23.803      68.774      280.205   278.040     2.165
21      3      24.381      55.350      227.060   225.282     1.779
22      3      24.027      44.692      180.605   183.396    -2.791
23      3      24.342      50.995      206.229   208.167    -1.938
24      3      23.670      21.602       91.464    92.649    -1.186
25      3      24.246      54.673      223.869   222.622     1.247
26      3      25.082      41.449      172.910   170.651     2.259
27      3      24.575      35.451      152.073   147.075     4.998
28      3      23.803      42.989      169.427   176.703    -7.276
29      3      24.660      48.599      192.561   198.748    -6.188
30      3      24.097      21.448       94.448    92.042     2.406
31      4      22.816      56.982      222.794   231.697    -8.902
32      4      24.167      47.901      199.003   196.008     2.996
33      4      22.712      40.285      168.668   166.077     2.592
34      4      23.611      25.609      109.387   108.397     0.990
35      4      23.354      22.971       98.445    98.029     0.416
36      4      23.669      25.838      110.987   109.295     1.692
37      4      23.965      49.127      202.662   200.826     1.835
38      4      22.917      54.936      224.773   223.653     1.120
39      4      23.546      50.917      216.058   207.859     8.199
40      4      24.450      41.976      171.469   172.720    -1.251

Why Use Residuals?

If the model fit to the data were correct, the residuals would approximate the random errors that make the relationship between the explanatory variables and the response variable a statistical relationship. Therefore, if the residuals appear to behave randomly, it suggests that the model fits the data well. On the other hand, if non-random structure is evident in the residuals, it is a clear sign that the model fits the data poorly. The subsections listed below detail the types of plots to use to test different aspects of a model and give guidance on the correct interpretations of different results that could be observed for each type of plot.

Model Validation Specifics