LINEAR FIT

Name:

... FIT Type:

Analysis Command Purpose:

Estimate the parameters for a linear, polynomial, or multi-linear least squares fit. Description:

Non-linear models are specified by entering an equation (e.g., FIT Y = A + B*X). For non-linear fits, Dataplot uses an iterative modified Levenberg-Marquardt algorithm. Although this algorithm can handle linear and polynomial models, using non-iterative methods specifically designed for linear models are both more efficient and allow additional diagnostics to be computed. The non-iterative fit method is described here (the help for the non-linear fit can be accessed with the HELP FIT command).

When the FIT command gives a list of variables without a functional equation, the non-iterative (linear) algorithm is used.

For linear fits, Dataplot adopted the fitting code from the OMNITAB II statistical program. This is a modified Gramm-Schmidt with iterative refinement algorithm. The Gramm-Schmidt algorithm is based on the QR decomposition and is intended for full rank models. Since Gramm-Schmidt algorithms and QR decompositions are well documented in the literature, we do not give the mathematical details here.

For linear fits, the FIT command generates the following output.

A table containing the parameter estimates, the parameter standard deviations, and the parameter t-values is printed. The t-value is used to determine if a given paramater is statistically significant.
These values are also written to the file dpst1f.dat. In addition, lower and upper Bonferroni joint confidence limits for the parameters are written to dpst1f.dat with a 5E15.7 format. By default, 95% intervals are used for the Bonferroni intervals. You can define the parameter ALPHA to change the significance level. For example, to use 90% intervals, enter the command:
To read these values into Dataplot variables, enter the command

The following are written to the file dpst2f.dat

Column 1:	standard deviations of the predicted values
Column 2:	95% lower confidence limit for the predicted values
Column 3:	95% upper confidence limit for the predicted values
Column 4:	99% lower confidence limit for the predicted values
Column 5:	99% upper confidence limit for the predicted values
Column 6:	95% lower joint Bonferroni confidence limit for the predicted values
Column 7:	95% upper joint Bonferroni confidence limit for the predicted values
Column 8:	95% lower joint Hotelling confidence limit for the predicted values
Column 9:	95% upper joint Hotelling confidence limit for the predicted values

These values are written with a 9E15.7 format. By default, 95% intervals are used for the Bonferroni and Hotelling intervals. You can define the parameter ALPHA to change the significance level. For example, to use 90% intervals, enter the command:

LET ALPHA = 0.9

To read these values into Dataplot after the FIT, enter the command

The following are written to the file dpst3f.dat

The variables written to this file are used in "regression diagnostics". More will be said about this later.

Column 1:	the diagonals of the hat matrix (the hat matrix is \( X(X'X)X' \) where \( X' \) is the transpose of the \( X \) matrix). In themselves, the diagonal elements are measures of the leverage of a given point. The minimum leverage is \( \frac{1} {n} \), the maximum leverage is 1.0 and the average leverage is \( \frac{p} {n} \) where \( P \) is the number of variables in the fit. These elements are also used to calculate many other diagnostic statistics. Note that \( H_{ii} = \frac{\mbox{VAR(Predicted Value)}} {\mbox{Residual Variance}} \)
Column 2:	the variance of the residuals \( \mbox{VAR(res)} = \mbox{MSE} (1 - H_{ii}) \)
Column 3:	the standardized residuals. These are the residuals divided by the square root of the mean square error. \( \mbox{STRES} = \frac{\mbox{residual}} {\sqrt{\mbox{MSE}}} \)
Column 4:	the internally studentized residuals. These are the residuals divided by their standard deviations.
Column 5:	the deleted residuals. These are residuals obtained from subtracting the predicted values with the i-th case omitted from the observed value.
Column 6:	the externally studentized residuals. These are the deleted residuals divided by their standard deviation.
Column 7:	Cook's distance. This is a measure of the impact of the i-th case on all of the estimated regression coefficients. \( \mbox{Cook} = \frac{\mbox{res}^2}{p \mbox{MSE}} \frac{H_{ii}} {(1 - H_{ii})^2} \)
Column 8:	\( \mbox{DFFITS} = \mbox{EXTSRES} \sqrt{H_{ii} (1 - H_{ii})} \)

Additional diagnostic statistics can be computed from these values. Several of the texts in the REFERENCE section below discuss the use and interpretation of these statistics in more detail. These variables can be read in as follows:

For more disucssion of how these variables can be used, enter

HELP REGRESSION DIAGNOSTICS

The variance-covariance matrix of the parameters and the inverse of the \( X'X \) matrix are written to the file dpst4f.dat. These values can be used in deriving additional statistics, intervals and tests. The use of these matrices is demonstrated in the Program example given in the HELP REGRESSION DIAGNOSTICS section.
To read these, you can do the following

A regression ANOVA table is written to dpst5f.dat. In addition to the ANOVA table, the \( R^2 \), adjusted \( R^2 \), and Press P statistic are printed. These three parameters are also saved as the internal parameters RSQUARE, ADJRSQUA, and PRESSP, respectively.

To view the ANOVA table, enter

LIST dpst5f.dat

Starting with the August 2021 version, the following values printed in the ANOVA table are now saved as internal parameters

RESSS	-	the residual sum of squares
SSREG	-	the regression sum of squares
SSTOTAL	-	the total sum of squares
MSE	-	the mean square error
MSR	-	the mean square of the regression
FSTAT	-	the value of the F statistic
FCV95	-	the 95% critical value for the F statistic
FCV99	-	the 99% critical value for the F statistic

The residual standard deviation and its corresponding degrees of freedom are are stored in the parameters RESSD and RESDF, respectively. RESDF is the number of observations minus the number of independent variables in the fit (including the constant term). The formula for RESSD is:
If there is replication in the independent variables, the replication standard deviation and corresponding degrees of freedom are printed. In addition, a lack of fit F test is performed. These are stored in the parameters REPDF, REPSD, and LOFCDF respectively. The formulas are:
Dataplot saves the predicted values from a fit in the variable PRED and the residual values in the variable RES. These variables can be used in subsequent LET and PLOT commands to generate diagnostic plots of residuals and predicted values.

Syntax:

The estimated parameters are stored in A0, A1, ... , Ak.

If <d> is omitted, a linear fit is performed. In practice, the linear and quadratic fits receive heavy use while the other degrees are rarely used.

Examples:

Note:

Weighting is one approach for dealing with non-constant variation in the residuals. It is not uncommon for the variance of the residuals to increase for the largest (or smallest) values of the independent variable. In this case, weights can be used to give less weight to the less precise measurements. The NIST/SEMATECH e-Handbook contains a disucssion of weighted fits and an example of using weights to address non-constant variation in the following pages
Weights can also used to implement certain types of robust fitting. In this case, weights are used to down weight observations based on the size of the associated residual. Outlier observations can sometimes distort a fit (i.e., in trying to fit the outlier point(s), the bulk of the data is poorly fit). Weighting based on the residuals can often provide a good fit to the bulk of the data without eliminating the outlier observations from the analysis.
Enter HELP WEIGHTS and HELP BIWEIGHT for examples of this use of weighted fits in Dataplot.

To specify weights for a least squares fit, enter the command

WEIGHTS <var>

where <var> is a variable containing the weights.

Note that the RES variable contains the absolute value of the residuals after the fit. For residual plots and analysis, it may be preferrable to work with the weighted residuals. You can create this with the command

LET RESW = W*RES

where W contains the weight variable.

Note:

BEST CP

C_p

Another approach is to generate principal components of the independent variables and to perform the fit the based on the first several principal components. Although this approach can reduce problems introduced by multi-colinearity, the downside is that the model may be less interpretable.

Note:

Enter HELP PARTIAL RESIDUAL PLOT, PARTIAL REGRESSION PLOT, HELP PARTIAL LEVERAGE PLOT, or HELP CCPR PLOT for details. The Program example in the HELP REGRESSION DIAGNOSTICS also gives an example of using these plots.

Note:

The Program example in the HELP REGRESSION DIAGNOSTICS also gives an example of using these commands.

Note:

SET FIT ADDITIVE CONSTANT OFF

To restore the default of including the constant term, enter

SET FIT ADDITIVE CONSTANT ON

Note:

https://www.itl.nist.gov/div898/handbook/pmd/section4/pmd452.htm

Data transformations can be generated easily if needed via the LET command. The BOX-COX LINEARITY PLOT can be a useful command for determining an approriate transformation.

Some analysts prefer to standardize the indpendent variables and the dependent variable by subtracting the mean and dividing by the standard deviation. This is done to provide numerical stability (note that Dataplot scales the data internally before performing the regression calculations) and also so that the data and regression coefficients are on a common scale. The original regression and standardized model are related as follows

\( y_{i}^{'} = \frac{y_{i} - \bar{y}}{s_{y}} \)

with \( \bar{x} \) and \( s_x \) denoting the mean and standard deviation of the independent variable and \( \bar{y} \) and \( s_y \) denoting the mean and standard deviation of the dependent variable.

The parameters are related by

\( \beta_{0}^{'} = \bar{y} - \beta_{1} \bar{x}_1 - \ldots - \beta_{p} \bar{x}_p \)

A variation on this is the correlation transformation (also called the standardized regression model). Specifically

\( x_{ik}^{'} = \frac{1}{\sqrt{n-1}} \frac{x_{ik} - \bar{x}_{k}} {s_{x_k}} \)

With this transformation, the \( X'X \) matrix reduces to a correlation matrix of the independent variables. If there are \( p \) independent variables, these transformations can be generated with the commands

 
LET N = SIZE Y
LET FACT = 1/SQRT(N-1)
LOOP FOR K = 1 1 P
    LET Z^K = STANDARDIZE X^K
    LET Z^K = AFACT*Z^K
END OF LOOP

LET YT = STANDARDIZE Y
LET YT = AFACT*YT

Note:

https://www.itl.nist.gov/div898/handbook/eda/section2/eda2.htm

In addition, if there is a single independent variable in the model, it can be useful to plot the data with the fitted values overlaid.

Linear fits allow a much richer set of diagnostics. For a fuller description and an example demonstrating these, enter

HELP REGRESSION DIAGNOSTICS

Note:

SET FIT AUXILLARY FILES OFF

Note:

SET AUXILLARY FILES DECIMAL POINTS <value>

where the default is 7.

Default:

None Synonyms:

None Related Commands:

FIT	=	Generate a non-linear fit.
PRED	=	A variable where predicted values are stored.
RES	=	A variable where residuals are stored.
RESSD	=	A parameter where the residual standard deviation is stored.
RESDF	=	A parameter where the residual degrees of freedom is stored.
REPSD	=	A parameter where the replication standard deviation is stored.
REPDF	=	A parameter where the replication degrees of freedom is stored.
LOFCDF	=	A parameter where the lack of fit cdf is stored.
WEIGHTS	=	Sets the weights for the fit command.
BIWEIGHT	=	Perform a biweight transformation.
EXACT RATIONAL FIT	=	Perform an exact rational fit.
CALIBRATION	=	Perform a linear or quadratic calibration fit.
LOWESS	=	Perform a locally weighted least squares smoothing.
BOOTSTRAP FIT	=	= Perform a linear or multi-linear fit based on the bootstrap.
ORTHOGONAL DISTANCE FIT	=	= Perform an orthogonal distance fit (useful for errors-in-variables models).
SPLINE FIT	=	Perform a spline fit.
SMOOTH	=	Perform a smoothing.
ANOVA	=	Perform a fixed effects analysis of variance.
MEDIAN POLISH	=	Perform a median polish.
PLOT	=	Generate a data/function plot.
4-PLOT	=	Generate a 4-plot.

References:

John Wiley

Mosteller and Tukey (1977), "Data Analysis and Regression", Addison-Wesley.

Cook and Weisberg (1982), "Residuals and Influence in Regression", Chapman and Hall.

Belsley, Kuh, and Welsch, (1980), "Regression Diagnostics", John Wiley.

Neter, Wasserman, and Kunter (1990), "Applied Linear Statistical Models", 3rd ed., Irwin.

Note that linear regression is covered in great detail in many statistics textbooks.

Applications:

Fitting Implementation Date:

Program:

 
. ALASKA PIPELINE RADIOGRAPHIC DEFECT BIAS CURVE
. PERFORM A LINEAR REGRESSION
SKIP 25
READ BERGER1.DAT TRUE MEAS BATCH
FIT MEAS TRUE
.
TITLE OFFSET 2
TITLE CASE ASIS
LABEL CASE ASIS
CASE ASIS
.
TITLE Original Data with Predicted Values
X1LABEL True Depth (in .001 inch)
Y1LABEL Measured Depth
CHARACTERS X
LINES BLANK
.
PLOT MEAS PRED VS TRUE
.
LABEL
TITLE
MULTIPLOT CORNER COORDINATES 0 0 100 100
SET 4-PLOT MULTIPLOT ON
TIC MARK LABEL SIZE 4
CHARACTER SIZE 4
.
4-PLOT RES
.
END OF MULTIPLOT
JUSTIFICATION CENTER
MOVE 50 97
TEXT 4-Plot of Residuals (ROSZMAN1.DAT)

             Least Squares Multilinear Fit
  
 Sample Size:                                        107
 Number of Variables:                                  1
 Residual Standard Deviation:                    7.86476
 Residual Degrees of Freedom:                        105
 BIC:                                          448.67856
  
 Replication Case:
 Replication Standard Deviation:                 6.47902
 Replication Degrees of Freedom:                      68
 Number of Distinct Subsets:                          39
 Lack of Fit F Ratio:                            2.34374
 Lack of Fit F CDF (%):                         99.88354
 Lack of Fit Degrees of Freedom 1:                    37
 Lack of Fit Degrees of Freedom 2:                    68
  
 --------------------------------------------------------------------
                                                Approximate
            Parameter Estimates          Standard Deviation   t-Value
 --------------------------------------------------------------------
   1  A0                       -1.96750             1.57479   -1.2494
   2  A1        TRUE            1.22297             0.04107   29.7781

plot generated by sample program