4.
Process Modeling
4.3. Data Collection for Process Modeling
|
|||
Output from Process Model is Fitted Mathematical Function | The output from process modeling is a fitted mathematical function with estimated coefficients. For example, in modeling resistivity, \(y\), as a function of dopant density, \(x\), an analyst may suggest the function $$ y = \beta_{0} + \beta_{1}x + \beta_{11}x^{2} + \varepsilon $$ in which the coefficients to be estimated are \(\beta_0\), \(\beta_1\), and \(\beta_{11}\). Even for a given functional form, there is an infinite number of potential coefficient values that potentially may be used. Each of these coefficient values will in turn yield predicted values. | ||
What are Good Coefficient Values? | Poor values of the coefficients are those for which the resulting predicted values are considerably different from the observed raw data \(y\). Good values of the coefficients are those for which the resulting predicted values are close to the observed raw data \(y\). The best values of the coefficients are those for which the resulting predicted values are close to the observed raw data \(y\), and the statistical uncertainty connected with each coefficient is small. | ||
There are two considerations that are useful for the generation
of "best" coefficients:
|
|||
Least Squares Criterion |
For a given data set (e.g., 10 \((x,y)\) pairs),
the most common procedure for obtaining the coefficients for
$$ y = f(x;\vec{\beta} + \varepsilon) $$
is the least squares estimation
criterion. This criterion yields coefficients with
predicted values that are closest to the raw data \(y\)
in the sense that the sum of the squared differences between the
raw data and the predicted values is as small as possible.
The overwhelming majority of regression programs today use the least squares criterion for estimating the model coefficients. Least squares estimates are popular because
|
||
Design of Experiment Principles | As to what values should be used for the \(x\)'s, we look to established experimental design principles for guidance. | ||
Principle 1: Minimize Coefficient Estimation Variation |
The first principle of experimental design is to
control the values within the \(x\)
vector such that after the \(y\)
data are collected, the subsequent model
coefficients are as good, in the sense of having the smallest
variation, as possible.
The key underlying point with respect to design of experiments and process modeling is that even though (for simple \( (x,y) \) fitting, for example) the least squares criterion may yield optimal (minimal variation) estimators for a given distribution of \(x\) values, some distributions of data in the \(x\) vector may yield better (smaller variation) coefficient estimates than other \(x\) vectors. If the analyst can specify the values in the \(x\) vector, then he or she may be able to drastically change and reduce the noisiness of the subsequent least squares coefficient estimates. |
||
Five Designs |
To see the effect of experimental design on process modeling,
consider the following simplest case of fitting a line:
$$ y = \beta_{0} + \beta_{1}x + \varepsilon $$
Suppose the analyst can afford 10 observations (that is, 10
\( (x,y) \)
pairs) for the purpose of
determining optimal (that is, minimal variation) estimators of
\( \beta_0\)
and \(\beta_1\).
What 10 \(x\)
values should be used for the purpose of
collecting the corresponding 10 \(y\)
values? Colloquially, where should the 10 \(x\)
values be sprinkled along the horizontal axis so as to
minimize the variation of the least squares estimated coefficients for
\(\beta_0\)
and \(\beta_1\)?
Should the 10 \(x\)
values be:
For each of the above five experimental designs, there will of course be \(y\) data collected, followed by the generation of least squares estimates for \(\beta_0\) and \(\beta_1\), and so each design will in turn yield a fitted line. |
||
Are the Fitted Lines Better for Some Designs? |
But are the fitted lines, i.e., the fitted process models, better
for some designs than for others? Are the coefficient estimator
variances smaller for some designs than for others? For given
estimates, are the resulting predicted values better (that is,
closer to the observed \(y\)
values)
than for other designs? The
answer to all of the above is YES. It DOES make a difference.
The most popular answer to the above question about which design to use for linear modeling is design #1 with ten equi-spaced points. It can be shown, however, that the variance of the estimated slope parameter depends on the design according to the relationship $$ \mbox{Var}(\hat{\beta}_1) \propto \frac{1}{\sum_{i=1}^{n}(x_i-\bar{x})} $$ Therefore to obtain minimum variance estimators, one maximizes the denominator on the right. To maximize the denominator, it is (for an arbitrarily fixed \(\bar{x}\)), best to position the \(x\)'s as far away from \(\bar{x}\) as possible. This is done by positioning half of the \(x\)'s at the lower extreme and the other half at the upper extreme. This is design #3 above, and this "dumbbell" design (half low and half high) is in fact the best possible design for fitting a line. Upon reflection, this is intuitively arrived at by the adage that "2 points define a line", and so it makes the most sense to determine those 2 points as far apart as possible (at the extremes) and as well as possible (having half the data at each extreme). Hence the design of experiment solution to model processing when the model is a line is the "dumbbell" design--half the \(x\)'s at each extreme. |
||
What is the Worst Design? | What is the worst design in the above case? Of the five designs, the worst design is the one that has maximum variation. In the mathematical expression above, it is the one that minimizes the denominator, and so this is design #4 above, for which almost all of the data are located at the mid-range. Clearly the estimated line in this case is going to chase the solitary point at each end and so the resulting linear fit is intuitively inferior. | ||
Designs 1, 2, and 5 |
What about the other 3 designs? Designs 1, 2, and 5 are
useful only for the case when we think the model may be
linear, but we are not sure, and so we allow additional
points that permit fitting a line if appropriate, but build
into the design the "capacity" to fit beyond a line (e.g.,
quadratic, cubic, etc.) if necessary. In this regard, the
ordering of the designs would be
|