|
REGRESSION DIAGNOSTICSName:
The above are demonstrated in the Program example below. Note that this program example is meant to show the mechanics of the various plots and commands and is not intended to be treated as a case study. One purpose of many regression diagnostics is to identify observations with either high leverage or high influence. High leverage points are those that are outliers with respect to the independent variables. Influential points are those that cause large changes in the parameter estimates when they are deleted. Although an influential point will typically have high leverage, a high leverage point is not necessarily an influential point. The leverage is typically defined as the diagonal of the hat matrix (hat matrix = H = X(X'X)**(-1)X'). Dataplot currently writes a number of measures of influence and leverage to the file dpst3f.dat (e.g., the diagonal of the hat matrix, Cook's distance, DFFITS).
At a minimum, the following diagnostics should be generated
The term "regression diagnostics" is typically used to denote diagnostic statistics and plots that go beyond the standard residual analysis. This is a brief discussion of "regression diagnostics" with respect to linear fits performed with a non-iterative algorithm. Regression diagnostics are used to identify outliers in the dependent variable, identify influential points, to identify high leverage points, or in general to uncover features of the data that could cause difficulties for the fitted regression model. The books by Belsley, Kuh, and Welsch and by Cook and Weisberg listed in the References section discuss regression diagnostics in detail. The mathematical derivations for the diagnostics covered here can be found in these books. Chapter 11 of the Neter, Wasserman, and Kuntner book listed in the References section discusses regression diagnostics in a less theoretical way. We will only give a brief overview of the various regression diagnostics. For more thorough guidance on when these various diagnostics are appropriate and how to interpret them, see one of these texts (or some other text on regression that covers these topics). At a minimum, diagnostic analysis of a linear fit includes the various plots of the residuals and predicted values described above. For more complete diagnostics, the variables written to the file dpst3f.dat can be analyzed. Specifically, this file contains
Additional diagnostic statistics can be computed from these values. Several of the texts in the REFERENCE section below discuss the use and interpretation of these statistics in more detail. These variables can be read in as follows:
SET READ FORMAT 8E15.7 READ DPST3F.DAT HII VARRES STDRES ISTUDRES DELRES ... ESTUDRES COOK DFFITS SKIP 0 Many analysts prefer to use the standardized residuals or the internally studentized residuals for the basic residual analysis. Deleted residuals and externally deleted residuals are used to identify outlying Y observations that the normal residuals do not identify (cases with high leverage tend to generate small residuals even if they are outliers). Many regression diagnostics depend on the Hat matrix \( (X(X'X)^{-1}X' \)). Dataplot has limits on the maximum number of columns/rows for a matrix which may prohibit the creation of the full hat matrix for some problems. Fortunately, most of the relevant statistics can be derived from the diagonal elements of this matrix (which can be read from the dpst3f.dat file). These are also referred to as the leverage points. The minimum leverage is \( (1/n) \), the maximum leverage is 1.0, and the average leverage is \( (p/n) \) where \( p \) is the number of variables in the fit. As a rule of thumb, leverage points greater than twice the average leverage can be considered high leverage. High leverage points are outliers in terms of the \( X \) matrix and have an unduly large influence on the predicted values. High leverage points also tend to have small residuals, so they are not typically detected by standard residual analysis. The DFFITS values are a measure of influence that observation i has on the i-th predicted value. As a rule of thumb, for small or moderate size data sets, values greater than 1 indicate an influential case. For large data sets, values greater than \( 2 \sqrt{p/n} \) indicate influential cases. Cook’s distance is a measure of the combined impact of the i-th observation on all of the regression coefficients. It is typically compared to the F distribution. Values near or above the 50-th percentile imply that observation has substantial influence on the regression parameters. The DFFITS values and the Cook’s distance are used to identify high leverage points that are also influential (in the sense that they have a large effect on the fitted model). Once these variables have been read in, they can be printed, plotted, and used in further calculations like any other variable. This is demonstrated in the program example below. They can also be used to derive additional diagnostic statistics. For example, the program example shows how to compute the Mahalanobis distance and Cook’s V statistic. The Mahalanobis distance is a measure of the distance of an observation from the “center” of the observations and is essentially an alternate measure of leverage. The Cook’s V statistic is the ratio of the variances of the predicted values and the variances of the residuals. Another popular diagnostic is the DFBETA statistic. This is similar to Cook’s distance in that it tries to identify points that have a large influence on the estimated regression coefficients. The distinction is that DFBETA assesses the influence on each individual parameter rather than the parameters as a whole. The DFBETA statistics require the catcher matrix, which is described in the multi-collinearity section below, for easy computation. The usual recommendation for DFBETA is that absolute values larger than 1 for small and medium size data sets and larger than \( 2/\sqrt{N} \) for large data sets should be considered influential. The variables written to file dpst3f.dat are calculated without computing any additional regressions. The statistics based on a case being deleted are computed from mathematically equivalent formulas that do not require additional regressions to be performed. The Neter, Wasserman, and Kunter text gives the formulas that are actually used. Robust regression is often recommended when there are significant outliers in the data. One common robust technique is called iteratively reweighted least squares (IRLS) (or M-estimation). Note that these techniques protect against outlying Y values, but they do not protect against outliers in the X matrix. Also, they test for single outliers and are not as sensitive for a group of similar outliers. See the documentation for WEIGHTS, BIWEIGHT, and TRICUBE for more information on performing IRLS regression in DATAPLOT. Techniques for protecting against outliers in the X matrix use alternatives to least squares. Two such methods are least median squares regression (also called LSQ regression) and least trimmed squares regression (also called LTS regression). Dataplot does not support these techniques at this time. The documentation for the WEIGHTS command in the Support chapter discusses one approach for dealing with outliers in the X matrix in the context of IRLS.
The VIF values are often given as their reciprocals (this is called the tolerance). Fortunately, these values can be computed without performing any additional regressions. The computing formulas are based on the catcher matrix, which is \( X(X'X)^{-1} \). The equation is:
where c is the catcher matrix. Another measure of multi-collinearity are the condition indices. The condition indices are calculated as follows:
Belsley, Kuh, and Welsch, (1980), "Regression Diagnostics", John Wiley. Neter, Wasserman, and Kunter (1990), "Applied Linear Statistical Models", 3rd ed., Irwin. . Note: this program example is meant simply to show how to create . the various plots, intervals and statistics. It is not a . case study. . . ZARTHAN COMPAY EXAMPLE FROM . "APPLIED LINEAR STATISTICAL MODELS", BY NETER, WASSERMAN, KUTNER . . Y = SALES . X1 = CONSTANT TERM . X2 = TARGET POPULATION (IN THOUSANDS) . X3 = PER CAPITA DISCRETIONARY INCOME (DOLLARS) . . Step 1: Read the data . DIMENSION 500 COLUMNS LET NVAR = 2 READ DISTRICT Y X2 X3 1 162 274 2450 2 120 180 3254 3 223 375 3802 4 131 205 2838 5 67 86 2347 6 169 265 3782 7 81 98 3008 8 192 330 2450 9 116 195 2137 10 55 53 2560 11 252 430 4020 12 232 372 4427 13 144 236 2660 14 103 157 2088 15 212 370 2605 END OF DATA . LET N = SIZE Y LET X1 = 1 FOR I = 1 1 N LET STRING SY = Sales LET STRING SX2 = Population LET STRING SX3 = Income LET P = NVAR + 1 . . Step 1: Basic Preliminary Plots . 1) Independent against dependent . 2) INDEPENDENT AGAINST INDEPENDENT . TITLE OFFSET 2 TITLE AUTOMATIC TITLE CASE ASIS LABEL CASE ASIS CASE ASIS . LINE BLANK CHARACTER CIRCLE CHARACTER HW 1 0.75 CHARACTER FILL ON Y1LABEL DISPLACEMENT 12 X1LABEL DISPLACEMENT 10 X2LABEL DISPLACEMENT 15 . MULTIPLOT CORNER COORDINATES 5 5 95 95 MULTIPLOT SCALE FACTOR 2 MULTIPLOT 2 2 . Y1LABEL ^SY LOOP FOR K = 2 1 P LET RIJ = CORRELATION Y X^K LET RIJ = ROUND(RIJ,3) X1LABEL ^SX^K X2LABEL Correlation: ^RIJ PLOT Y VS X^K END OF LOOP . LOOP FOR K = 2 1 P LET IK1 = K + 1 Y1LABEL ^SX^K LOOP FOR J = IK1 1 P LET RIJ = CORRELATION X^K X^J LET RIJ = ROUND(RIJ,3) X1LABEL ^SX^J X2LABEL Correlation: ^RIJ PLOT X^K VS X^J END OF LOOP END OF LOOP END OF MULTIPLOT LABEL . JUSTIFICATION CENTER MOVE 50 97 TEXT Basic Preliminary Plots . . Step 2: Generate the fit . SET WRITE DECIMALS 5 SET LIST NEW WINDOW OFF FEEDBACK OFF capture program3.out FIT Y X2 X3 WRITE " " WRITE " " WRITE " " WRITE " ANOVA Table" WRITE " " LIST dpst5f.dat WRITE " " WRITE " " CAPTURE SUSPEND . . Step 3b: Generate the basic residual analysis . TITLE SIZE 4 TIC MARK LABEL SIZE 4 CHARACTER HW 2 1.5 SET 4PLOT MULTIPLOT ON TITLE AUTOMATIC 4-PLOT RES TITLE SIZE 2 TIC MARK LABEL SIZE 2 CHARACTER HW 1 0.75 . MULTIPLOT 2 2 TITLE AUTOMATIC Y1LABEL Predicted Values X1LABEL Response Values PLOT PRED VS Y Y1LABEL Residuals X1LABEL Predicted Values PLOT RES VS PRED LOOP FOR K = 2 1 P Y1LABEL Residuals X1LABEL Predicted Values PLOT RES VS X^K END OF LOOP END OF MULTIPLOT LABEL TITLE . . Step 3c: Read the information written to auxiliary files. Some of . these values will be used later in this macro. . SKIP 1 READ dpst1f.dat COEF COEFSD TVALUE PARBONLL PARBONUL READ dpst2f.dat PREDSD PRED95LL PRED95UL PREDBLL PREDBUL PREDHLL PREDHUL READ dpst3f.dat HII VARRES STDRES STUDRES DELRES ESTUDRES COOK DFFITS READ dpst4f.dat TEMP1 TEMP2 LET S2B = VARIABLE TO MATRIX TEMP1 P LET XTXINV = VARIABLE TO MATRIX TEMP2 P DELETE TEMP1 TEMP2 SKIP 0 . . Step 4: Partial Residual and Partial Regression Plots, CCPR and . Partial Leverage Plots not generated . MULTIPLOT 2 2 TITLE AUTOMATIC Y1LABEL Residuals + A1*X2 X1LABEL X2 PARTIAL RESIDUAL PLOT Y X2 X3 X2 Y1LABEL Residuals + A2*X3 X1LABEL X3 PARTIAL RESIDUAL PLOT Y X2 X3 X3 . Y1LABEL Residuals: X2 Removed X1LABEL Residuals: X2 Versus X3 PARTIAL REGRESSION PLOT Y X2 X3 X2 Y1LABEL Residuals: X3 Removed X1LABEL Residuals: X3 Versus X2 PARTIAL REGRESSION PLOT Y X2 X3 X3 END OF MULTIPLOT TITLE LABEL JUSTIFICATION CENTER MOVE 50 97 Text Partial Residual and Partial Regression Plots . . Step 5: Calculate function to calculate regression estimate . for new data (F) . LET A0 = COEF(1) LET FUNCTION F = A0 LET DUMMY = PREDSD(1) . . Use (2*NVAR) rather than (2*P) if no constant term in . joint interval . LOOP FOR K = 1 1 NVAR LET INDX = K + 1 LET A^K = COEF(INDX) LET FUNCTION F = F + (A^K)*(Z^K) END OF LOOP . LET Z1 = DATA 220 375 LET Z2 = DATA 2500 3500 LET YNEW = F . CAPTURE RESUME PRINT " " PRINT " " PRINT "NEW X VALUES, ESTIMATED NEW VALUE" PRINT Z1 Z2 YNEW CAPTURE SUSPEND . . Step 5b: Print the X'X inverse matrix and parameter . variance-covariance matrix . CAPTURE RESUME PRINT " " PRINT " " PRINT "THE X'X INVERSE MATRIX" PRINT XTXINV . PRINT " " PRINT " " PRINT "THE PARAMETER VARIANCE-COVARIANCE MATRIX" PRINT S2B CAPTURE SUSPEND . . Step 5c: Calculate: . 1) The variance of a new point (S2YHAT) . 2) A confidence interval for a new point . 3) A joint confidence interval for more than one point . 4) A prediction interval for a new point . 5) A Scheffe joint prediction interval for more than one point . LET NPT = SIZE YNEW LET NP = N - P LOOP FOR IK = 1 1 NPT LET XNEW(1) = 1 LET XNEW(2) = Z1(IK) LET XNEW(3) = Z2(IK) LOOP FOR K = 1 1 P LET DUMMY2 = VECTOR DOT PRODUCT XNEW S2B^K LET SUMZ(K) = DUMMY2 END OF LOOP LET S2YHAT = VECTOR DOT PRODUCT SUMZ XNEW LET S2YPRED = MSE + S2YHAT LET YHATS2(IK) = S2YHAT LET YPREDS2(IK) = S2YPRED LET SYHAT = SQRT(S2YHAT) LET YHATS(IK) = SYHAT LET SYPRED = SQRT(S2YPRED) LET YPREDS(IK) = SYPRED LET YHAT = YNEW(IK) CAPTURE RESUME PRINT " " PRINT " " PRINT "THE PREDICTED VALUE FOR THE NEW POINT = ^YHAT" PRINT "THE VARIANCE OF THE NEW VALUE = ^S2YHAT" PRINT "THE VARIANCE FOR PREDICTION INTERVALS = ^S2YPRED" CAPTURE SUSPEND LET T = TPPF(.975,NP) LET YHATU = YHAT + T*SYHAT LET YHATL = YHAT - T*SYHAT LET YPREDU = YHAT + T*SYPRED LET YPREDL = YHAT - T*SYPRED CAPTURE RESUME PRINT " " PRINT "95% CONFIDENCE INTERVAL FOR YHAT: ^YHATL <= YHAT <= ^YHATU" PRINT "95% PREDICTION INTERVAL FOR YHAT: ^YPREDL <= YHAT <= ^YPREDU" CAPTURE SUSPEND END OF LOOP . LET ALPHA = 0.10 LET DUMMY = 1 - ALPHA/(2*NPT) LET B = TPPF(DUMMY,NP) LET JOINTBU = YNEW + B*YHATS LET JOINTBL = YNEW - B*YHATS CAPTURE RESUME PRINT " " PRINT "90% BONFERRONI JOINT CONFIDENCE INTERVALS FOR NEW VALUES" PRINT JOINTBL YNEW JOINTBU CAPTURE SUSPEND LET W = P*FPPF(.90,P,NP) LET W = SQRT(W) LET JOINTWU = YNEW + W*YHATS LET JOINTWL = YNEW - W*YHATS CAPTURE RESUME PRINT " " PRINT "90% HOTELLING JOINT CONFIDENCE INTERVALS FOR NEW VALUES" PRINT JOINTWL YNEW JOINTWU CAPTURE SUSPEND LET JOINTBU = YNEW + B*YPREDS LET JOINTBL = YNEW - B*YPREDS CAPTURE RESUME PRINT " " PRINT "90% BONFERRONI JOINT PREDICTION INTERVALS FOR NEW VALUES" PRINT JOINTBL YNEW JOINTBU CAPTURE SUSPEND LET S = NPT*FPPF(.90,NPT,NP) LET S = SQRT(S) LET JOINTSU = YNEW + S*YPREDS LET JOINTSL = YNEW - S*YPREDS CAPTURE RESUME PRINT " " PRINT "90% SCHEFFE JOINT PREDICTION INTERVALS FOR NEW VALUES" PRINT JOINTSL YNEW JOINTSU CAPTURE SUSPEND . . Step 6: Plot various diagnostic statistics . . Derive Cook's V statistic and Mahalanobis distance . LET V = (PREDSD**2)/(RESSD**2) LET MAHAL = ((HII-1/N)/(1-HII))*(N*(N-2)/(N-1)) LET HBAR = P/N LET DUMMY = SUM HII SET WRITE FORMAT 5F10.5 CAPTURE RESUME PRINT " " PRINT " HII COOK DFFITS V MAHAL" PRINT HII COOK DFFITS V MAHAL CAPTURE SUSPEND SET WRITE FORMAT . . Step 6b: Plot various diagnostic statistics . . Plot various residuals . X1LABEL XLIMITS 0 15 MAJOR XTIC MARK NUMBER 4 XTIC OFFSET 0 1 MULTIPLOT 2 2 TITLE STANDARDIZED RESIDUALS PLOT STDRES TITLE INTERNALLY STUDENTIZED RESIDUALS PLOT STUDRES TITLE DELETED RESIDUALS X1LABEL PRESS STATISTIC = ^PRESSP PLOT DELRES X1LABEL TITLE EXTERNALLY STUDENTIZED RESIDUALS PLOT ESTUDRES END OF MULTIPLOT . . Plot several diagnotic statistics . . MULTIPLOT 2 2 CHARACTER FILL ON OFF CHARACTER CIRCLE BLANK LINE BLANK SOLID DOTTED TITLE PLOT OF LEVERAGE POINTS Y1LABEL X1LABEL DOTTED LINE AT 2*AVERAGE LEVERAGE YTIC OFFSET 0 0.1 LET TEMP6 = DATA 1 N LET DUMMY = 2*HBAR LET TEMP4 = DATA DUMMY DUMMY LET TEMP5 = DATA HBAR HBAR SPIKE ON SPIKE BASE HBAR SPIKE LINE DOTTED . PLOT HII AND PLOT TEMP4 TEMP5 VS TEMP6 SPIKE OFF YTIC OFFSET 0 0 . CHARACTER CIRCLE BLANK BLANK LINE BLANK DOTTED DOTTED Y1LABEL X1LABEL DOTTED LINES AT 1 AND 2*SQRT(P/N) LET TEMP4 = DATA 1 1 LET DUMMY = 2*SQRT(P/N) LET TEMP5 = DATA DUMMY DUMMY TITLE PLOT OF DFFITS POINTS . PLOT DFFITS AND PLOT TEMP4 TEMP5 VS TEMP6 . X1LABEL DOTTED LINES AT 20 AND 50TH PERCENTILES OF F DISTRIBUTION LET DUMMY = FPPF(.20,P,NP) LET TEMP4 = DATA DUMMY DUMMY LET DUMMY = FPPF(.50,P,NP) LET TEMP5 = DATA DUMMY DUMMY TITLE PLOT OF COOK'S DISTANCE . PLOT COOK AND PLOT TEMP4 TEMP5 VS TEMP6 . X1LABEL TITLE PLOT OF MAHALANOBIS DISTANCE PLOT MAHAL . END OF MULTIPLOT . . Step 6c: Calculate: . . 1. Catcher matrix . 2. Variance inflation factors . 3. Condition numbers of X'X (based on singular . values of scaled X) . 4. Partial regression plots (also called added variable plots) . 5. Partial leverage plots . 6. DFBETA'S . LET XMAT = CREATE MATRIX X1 X2 X3 LET C = CATCHER MATRIX XMAT LET VIF = VARIANCE INFLATION FACTORS XMAT LET RJ = 0 FOR I = 1 1 NP LET RJ = 1 - (1/VIF) SUBSET VIF > 0 LET TOL = 1/VIF SUBSET VIF > 0 LET DCOND = CONDITION INDICES XMAT CAPTURE RESUME PRINT " " PRINT " " PRINT " CONDITION" PRINT " Rj-SQUARE VIF TOLERANCE INDICES" SET WRITE FORMAT 5X,F5.3,F10.5,F12.5,F10.5 PRINT RJ VIF TOL DCOND CAPTURE SUSPEND . MULTIPLOT 2 2 CHARACTER CIRCLE LINE BLANK SPIKE BLANK LIMITS DEFAULT TIC OFFSET 0 0 MAJOR TIC MARK NUMBER DEFAULT . LOOP FOR K = 2 1 P TITLE Partial Regression Plot for X^K PARTIAL REGRESSION PLOT Y X2 X3 X^K . TITLE PARTIAL LEVERAGE FOR X^K PARTIAL LEVERAGE PLOT Y X2 X3 X^K . LET DUMMY = XTXINV^K(K) LET DFBETA^K = (C^K*ESTUDRES)/SQRT(DUMMY*(1-HII)) END OF LOOP END OF MULTIPLOT . LET DUMMY = XTXINV1(1) TITLE PLOT OF DFBETA'S (B0, B1, B2) LINE BLANK ALL CHARACTER B0 B1 B2 CHARACTER SIZE 2 ALL LET TEMP4 = SEQUENCE 1 1 N LET DFBETA1 = (C1*ESTUDRES)/SQRT(DUMMY*(1-HII)) PLOT DFBETA1 DFBETA2 DFBETA3 VS TEMP4
Date created: 09/02/2021 |
Last updated: 12/04/2023 Please email comments on this WWW page to alan.heckert@nist.gov. |