LINEAR FIT
Name:
Type:
Purpose:
Estimate the parameters for a linear, polynomial, or multi-linear
least squares fit.
Description:
The Dataplot FIT command can fit either non-linear models or
linear (including polynomial and multi-linear) models.
Non-linear models are specified by entering an equation (e.g.,
FIT Y = A + B*X). For non-linear fits, Dataplot uses an iterative
modified Levenberg-Marquardt algorithm. Although this algorithm can
handle linear and polynomial models, using non-iterative methods
specifically designed for linear models are both more efficient and
allow additional diagnostics to be computed. The non-iterative
fit method is described here (the help for the non-linear fit can
be accessed with the HELP FIT command).
When the FIT command gives a list of variables without a functional
equation, the non-iterative (linear) algorithm is used.
For linear fits, Dataplot adopted the fitting code from the
OMNITAB II statistical program. This is a modified Gramm-Schmidt
with iterative refinement algorithm. The Gramm-Schmidt algorithm
is based on the QR decomposition and is intended for full rank
models. Since Gramm-Schmidt algorithms and QR decompositions are
well documented in the literature, we do not give the mathematical
details here.
For linear fits, the FIT command generates the following output.
- A table containing the parameter estimates, the parameter
standard deviations, and the parameter t-values is
printed. The t-value is used to determine if a given
paramater is statistically significant.
These values are also written to the file dpst1f.dat. In
addition, lower and upper Bonferroni joint confidence limits
for the parameters are written to dpst1f.dat with a 5E15.7
format. By default, 95% intervals are used for the Bonferroni
intervals. You can define the parameter ALPHA to change the
significance level. For example, to use 90% intervals, enter
the command:
To read these values into Dataplot variables, enter the command
SKIP 1
READ DPST1F.DAT COEF COEFSD TVAL BONL BONU
- The following are written to the file dpst2f.dat
Column 1:
|
standard deviations of the predicted values
|
Column 2:
|
95% lower confidence limit for the predicted values
|
Column 3:
|
95% upper confidence limit for the predicted values
|
Column 4:
|
99% lower confidence limit for the predicted values
|
Column 5:
|
99% upper confidence limit for the predicted values
|
Column 6:
|
95% lower joint Bonferroni confidence limit for the
predicted values
|
Column 7:
|
95% upper joint Bonferroni confidence limit for the
predicted values
|
Column 8:
|
95% lower joint Hotelling confidence limit for the
predicted values
|
Column 9:
|
95% upper joint Hotelling confidence limit for the
predicted values
|
These values are written with a 9E15.7 format. By default, 95%
intervals are used for the Bonferroni and Hotelling intervals.
You can define the parameter ALPHA to change the significance
level. For example, to use 90% intervals, enter the command:
To read these values into Dataplot after the FIT, enter the
command
SKIP 1
SET READ FORMAT 9E15.7
READ DPST2F.DAT PREDSD PRED95LL PRED95UL PREDBLL ...
PREDBUL PREDHLL PREDHUL
- The following are written to the file dpst3f.dat
The variables written to this file are used in "regression
diagnostics". More will be said about this later.
Column 1:
|
the diagonals of the hat matrix (the hat matrix is
\( X(X'X)X' \) where \( X' \) is the transpose of the
\( X \) matrix). In themselves, the diagonal elements
are measures of the leverage of a given point. The
minimum leverage is \( \frac{1} {n} \), the maximum
leverage is 1.0 and the average leverage is
\( \frac{p} {n} \) where \( P \) is the number of
variables in the fit. These elements are also used to
calculate many other diagnostic statistics. Note that
\( H_{ii} = \frac{\mbox{VAR(Predicted Value)}}
{\mbox{Residual Variance}} \)
|
Column 2:
|
the variance of the residuals
\( \mbox{VAR(res)} = \mbox{MSE} (1 - H_{ii}) \)
|
Column 3:
|
the standardized residuals. These are the residuals
divided by the square root of the mean square error.
\( \mbox{STRES} = \frac{\mbox{residual}}
{\sqrt{\mbox{MSE}}} \)
|
Column 4:
|
the internally studentized residuals. These are
the residuals divided by their standard deviations.
|
Column 5:
|
the deleted residuals. These are residuals obtained
from subtracting the predicted values with the i-th
case omitted from the observed value.
|
Column 6:
|
the externally studentized residuals. These are the
deleted residuals divided by their standard deviation.
|
Column 7:
|
Cook's distance. This is a measure of the impact of
the i-th case on all of the estimated
regression coefficients.
\( \mbox{Cook} = \frac{\mbox{res}^2}{p \mbox{MSE}}
\frac{H_{ii}} {(1 - H_{ii})^2} \)
|
Column 8:
|
\( \mbox{DFFITS} = \mbox{EXTSRES}
\sqrt{H_{ii} (1 - H_{ii})} \)
|
Additional diagnostic statistics can be computed from these
values. Several of the texts in the REFERENCE section
below discuss the use and interpretation of these statistics
in more detail. These variables can be read in as follows:
SKIP 1
SET READ FORMAT 8E15.7
READ DPST3F.DAT HII VARRES STDRES ISTUDRES DELRES ...
ESTUDRES COOK DFFITS
SKIP 0
For more disucssion of how these variables can be used, enter
- The variance-covariance matrix of the parameters and the
inverse of the \( X'X \) matrix are written to the file
dpst4f.dat. These values can be used in deriving additional
statistics, intervals and tests. The use of these matrices
is demonstrated in the Program example given in the
HELP REGRESSION DIAGNOSTICS section.
To read these, you can do the following
SKIP 1
READ DPST4F.DAT TEMP1 TEMP2
LET P = 2; . P denotes the number of parameters
LET S2B = VARIABLE TO MATRIX TEMP1 P
LET XTXINV = VARIABLE TO MATRIX TEMP2 P
- A regression ANOVA table is written to dpst5f.dat. In addition
to the ANOVA table, the \( R^2 \), adjusted \( R^2 \), and
Press P statistic are printed. These three parameters
are also saved as the internal parameters RSQUARE, ADJRSQUA,
and PRESSP, respectively.
To view the ANOVA table, enter
LIST dpst5f.dat
Starting with the August 2021 version, the following values
printed in the ANOVA table are now saved as internal
parameters
RESSS
|
-
|
the residual sum of squares
|
SSREG
|
-
|
the regression sum of squares
|
SSTOTAL
|
-
|
the total sum of squares
|
MSE
|
-
|
the mean square error
|
MSR
|
-
|
the mean square of the regression
|
FSTAT
|
-
|
the value of the F statistic
|
FCV95
|
-
|
the 95% critical value for the F statistic
|
FCV99
|
-
|
the 99% critical value for the F statistic
|
- The residual standard deviation and its corresponding degrees
of freedom are are stored in the parameters RESSD and RESDF,
respectively. RESDF is the number of observations minus the
number of independent variables in the fit (including the
constant term). The formula for RESSD is:
\( \mbox{RESSD} = \sqrt{\frac{\sum_{i=1}^{n}
{(Y - \hat{Y})^{2}}} {\mbox{RESDF}}} \)
- If there is replication in the independent variables, the
replication standard deviation and corresponding degrees of
freedom are printed. In addition, a lack of fit F test is
performed. These are stored in the parameters REPDF, REPSD,
and LOFCDF respectively. The formulas are:
\( \mbox{REPDF} = \sum_{i=1}^{nrep}{n_{i} - 1} \)
\( \mbox{REPSD} = \sqrt{\frac{\sum_{i=1}^{n}
{(Y - \bar{Y}_{k})^{2}}}
{\mbox{REPDF}}} \)
with \( nrep \), \( n_{i} \) and \( \bar{Y}_{k} \) denoting
the number of replications, the number of observations in the
i-th replication and the mean of the k-th
replication, respectively.
- Dataplot saves the predicted values from a fit in the variable
PRED and the residual values in the variable RES. These
variables can be used in subsequent LET and PLOT commands to
generate diagnostic plots of residuals and predicted values.
Syntax:
<d> FIT <y> <x1> ... <xk>
<SUBSET/EXCEPT/FOR qualification>
where <d> is the optional specification of the desired
degree:
LINEAR or FIRST-DEGREE (the default)
QUADRATIC or SECOND-DEGREE
CUBIC or THIRD-DEGREE
QUARTIC or FOURTH-DEGREE
QUINTIC or FIFTH-DEGREE
SEXTIC or SIXTH-DEGREE
SEPTIC or SEVENTH-DEGREE
OCTIC or EIGHT-DEGREE
NONIC or NINTH-DEGREE
DEXIC or TENTH-DEGREE;
<y> is the response (= dependent) variable;
<x1> ... <xk> is a list of 1 to 35 independent
variables;
and where the <SUBSET/EXCEPT/FOR qualification> is optional.
The estimated parameters are stored in A0, A1, ... , Ak.
If <d> is omitted, a linear fit is performed. In practice, the
linear and quadratic fits receive heavy use while the other degrees
are rarely used.
Examples:
FIT Y X
LINEAR FIT Y X
FIT Y X1 X2 X5
FIT Y X1 X2 X5 SUBSET TAG > 1
QUADRATIC FIT PRESSURE TEMP
CUBIC FIT V R
Note:
Weighted fits are typically used in the following two situations.
- Weighting is one approach for dealing with non-constant
variation in the residuals. It is not uncommon for the
variance of the residuals to increase for the largest (or
smallest) values of the independent variable. In this case,
weights can be used to give less weight to the less precise
measurements. The NIST/SEMATECH e-Handbook contains a
disucssion of weighted fits and an example of using weights
to address non-constant variation in the following pages
- Weights can also used to implement certain types of robust
fitting. In this case, weights are used to down weight
observations based on the size of the associated residual.
Outlier observations can sometimes distort a fit (i.e., in
trying to fit the outlier point(s), the bulk of the data
is poorly fit). Weighting based on the residuals can often
provide a good fit to the bulk of the data without eliminating
the outlier observations from the analysis.
Enter HELP WEIGHTS and
HELP BIWEIGHT
for examples of this use of weighted fits in Dataplot.
To specify weights for a least squares fit, enter the command
where <var> is a variable containing the weights.
Note that the RES variable contains the absolute value of the
residuals after the fit. For residual plots and analysis, it
may be preferrable to work with the weighted residuals. You can
create this with the command
where W contains the weight variable.
Note:
When there are a large number of independent variables, subset
selection procedures are often employed to identify the best
candidate models. The BEST CP command can
be used to perform a "best subsets" analysis based on Mallows
Cp. Enter HELP BEST CP for details.
Another approach is to generate principal components of the
independent variables and to perform the fit the based on the
first several principal components. Although this approach can
reduce problems introduced by multi-colinearity, the downside is
that the model may be less interpretable.
Note:
Note:
The following matrix commands can be useful in regresssion
diagnostics:
LET VIF = VARIANCE INFLATION FACTORS
LET C = CONDITION INDICES X
LET XTXINV = XTXINV MATRIX X
LET C = CATCHER MATRIX X
The Program example in the HELP REGRESSION
DIAGNOSTICS also gives an example of using these commands.
Note:
For multi-linear fits, enter the following command to omit the
constant term from the model
SET FIT ADDITIVE CONSTANT OFF
To restore the default of including the constant term, enter
SET FIT ADDITIVE CONSTANT ON
Note:
Data transformations are often used to improve the quality of the
fit. For example, some types of non-linear fits can be restated as
linear fits with an appropriate transformation. Also,
transformations are often applied to address non-homogeneous
variation in the fit. The NIST/SEMATECH e-Handbook contains a
disucssion of this issue at
Data transformations can be generated easily if needed via the
LET command. The
BOX-COX LINEARITY PLOT can be a useful
command for determining an approriate transformation.
Some analysts prefer to standardize the indpendent variables
and the dependent variable by subtracting the mean and dividing
by the standard deviation. This is done to provide numerical
stability (note that Dataplot scales the data internally before
performing the regression calculations) and also so that the
data and regression coefficients are on a common scale. The
original regression and standardized model are related as follows
with \( \bar{x} \) and \( s_x \) denoting the mean and standard deviation
of the independent variable and \( \bar{y} \) and \( s_y \) denoting
the mean and standard deviation of the dependent variable.
The parameters are related by
A variation on this is the correlation transformation (also called
the standardized regression model). Specifically
With this transformation, the \( X'X \) matrix reduces to a
correlation matrix of the independent variables. If there are \( p \)
independent variables, these transformations can be generated with the
commands
LET N = SIZE Y
LET FACT = 1/SQRT(N-1)
LOOP FOR K = 1 1 P
LET Z^K = STANDARDIZE X^K
LET Z^K = AFACT*Z^K
END OF LOOP
LET YT = STANDARDIZE Y
LET YT = AFACT*YT
Note:
It is recommended that a FIT be followed by a residual analysis to
assess the model adequacy. Specifically, the typical assumptions for
the residuals are that they are independent with a common
distribution having fixed location and variation. It is usually
assumed that the common distribution is a normal distribution.
The 4-PLOT command generates 4 plots that are useful in testing
these assumptions. The NIST/SEMATECH e-Handbook contains a
more detailed discussion of this issue at
In addition, if there is a single independent variable in the model,
it can be useful to plot the data with the fitted values overlaid.
Linear fits allow a much richer set of diagnostics. For a fuller
description and an example demonstrating these, enter
Note:
If you want to suppress the output to files dpst1f.dat, dpst2f.dat,
dpst3f.dat, dpst4f.dat and dpst5f.dat, enter the command
SET FIT AUXILLARY FILES OFF
Note:
By default, the values written to dpst1f.dat, dpst2f.dat, dpst3f.dat
and dpst4f.dat are written using a Fortran E15.7 format (that is,
exponential format with 7 significant digits). You can specify
the number of signficant digits with the command
SET AUXILLARY FILES DECIMAL POINTS <value>
where the default is 7.
Default:
Synonyms:
Related Commands:
FIT
|
=
|
Generate a non-linear fit.
|
PRED
|
=
|
A variable where predicted values are stored.
|
RES
|
=
|
A variable where residuals are stored.
|
RESSD
|
=
|
A parameter where the residual standard deviation is stored.
|
RESDF
|
=
|
A parameter where the residual degrees of freedom is stored.
|
REPSD
|
=
|
A parameter where the replication standard deviation is
stored.
|
REPDF
|
=
|
A parameter where the replication degrees of freedom is
stored.
|
LOFCDF
|
=
|
A parameter where the lack of fit cdf is stored.
|
WEIGHTS
|
=
|
Sets the weights for the fit command.
|
BIWEIGHT
|
=
|
Perform a biweight transformation.
|
EXACT RATIONAL FIT
|
=
|
Perform an exact rational fit.
|
CALIBRATION
|
=
|
Perform a linear or quadratic calibration fit.
|
LOWESS
|
=
|
Perform a locally weighted least squares smoothing.
|
BOOTSTRAP FIT
|
=
|
= Perform a linear or multi-linear fit based on the
bootstrap.
|
ORTHOGONAL DISTANCE FIT
|
=
|
= Perform an orthogonal distance fit (useful for
errors-in-variables models).
|
SPLINE FIT
|
=
|
Perform a spline fit.
|
SMOOTH
|
=
|
Perform a smoothing.
|
ANOVA
|
=
|
Perform a fixed effects analysis of variance.
|
MEDIAN POLISH
|
=
|
Perform a median polish.
|
PLOT
|
=
|
Generate a data/function plot.
|
4-PLOT
|
=
|
Generate a 4-plot.
|
References:
Draper and Smith (1998), "Applied Regression Analysis", Third ed.,
John Wiley.
Mosteller and Tukey (1977), "Data Analysis and Regression",
Addison-Wesley.
Cook and Weisberg (1982), "Residuals and Influence in Regression",
Chapman and Hall.
Belsley, Kuh, and Welsch, (1980), "Regression Diagnostics",
John Wiley.
Neter, Wasserman, and Kunter (1990), "Applied Linear Statistical
Models", 3rd ed., Irwin.
Note that linear regression is covered in great detail in many
statistics textbooks.
Applications:
Implementation Date:
1987/06
1988/09: Support for constant fit
1992/03: Write COEF, SDCOEF, TCDF to dpst1f.dat
1993/07: Write diagonal of hat matrix and parameter covariance
matrix to file
1994/01: Write SDPRED and limits to file
1994/06: Fix bug in dpst4f.dat file for polynomial models
1996/01: Fix bomb with constant fit
2002/04: Support for no constant term
2002/04: Print error message if singularity detected
2002/06: Additional variables to dpst2f.dat and dpst3f.dat file
2002/06: Write ANOVA table to dpst5f.dat
2003/10: Support for HTML and LaTex output
2013/10: Support for BIC statistic
2014/06: User option to suppress writing to auxiliary files
2019/04: User option to specify number of decimal points for
auxiliary files
2021/08: Save RESSS, SSREG, SSTOTAL, MSE, MSR, FSTAT, FCV95, and
FCV99 as internal parameters
Program:
. ALASKA PIPELINE RADIOGRAPHIC DEFECT BIAS CURVE
. PERFORM A LINEAR REGRESSION
SKIP 25
READ BERGER1.DAT TRUE MEAS BATCH
FIT MEAS TRUE
.
TITLE OFFSET 2
TITLE CASE ASIS
LABEL CASE ASIS
CASE ASIS
.
TITLE Original Data with Predicted Values
X1LABEL True Depth (in .001 inch)
Y1LABEL Measured Depth
CHARACTERS X
LINES BLANK
.
PLOT MEAS PRED VS TRUE
.
LABEL
TITLE
MULTIPLOT CORNER COORDINATES 0 0 100 100
SET 4-PLOT MULTIPLOT ON
TIC MARK LABEL SIZE 4
CHARACTER SIZE 4
.
4-PLOT RES
.
END OF MULTIPLOT
JUSTIFICATION CENTER
MOVE 50 97
TEXT 4-Plot of Residuals (ROSZMAN1.DAT)
The following output is generated
Least Squares Multilinear Fit
Sample Size: 107
Number of Variables: 1
Residual Standard Deviation: 7.86476
Residual Degrees of Freedom: 105
BIC: 448.67856
Replication Case:
Replication Standard Deviation: 6.47902
Replication Degrees of Freedom: 68
Number of Distinct Subsets: 39
Lack of Fit F Ratio: 2.34374
Lack of Fit F CDF (%): 99.88354
Lack of Fit Degrees of Freedom 1: 37
Lack of Fit Degrees of Freedom 2: 68
--------------------------------------------------------------------
Approximate
Parameter Estimates Standard Deviation t-Value
--------------------------------------------------------------------
1 A0 -1.96750 1.57479 -1.2494
2 A1 TRUE 1.22297 0.04107 29.7781
Date created: 09/02/2021
Last updated: 12/04/2023
Please email comments on this WWW page to
alan.heckert@nist.gov.
|
|