1.4.2.1.3. Quantitative Output and Interpretation

1. Exploratory Data Analysis
1.4. EDA Case Studies
1.4.2. Case Studies
1.4.2.1. Normal Random Numbers

1.4.2.1.3. Quantitative Output and Interpretation

As a first step in the analysis, common summary statistics are computed from the data.

      Sample size  = 500
      Mean         =  -0.2935997E-02 
      Median       =  -0.9300000E-01
      Minimum      =  -0.2647000E+01 
      Maximum      =   0.3436000E+01  
      Range        =   0.6083000E+01  
      Stan. Dev.   =   0.1021041E+01

Location

One way to quantify a change in location over time is to fit a straight line to the data using an index variable as the independent variable in the regression. For our data, we assume that data are in sequential run order and that the data were collected at equally spaced time intervals. In our regression, we use the index variable X = 1, 2, ..., N, where N is the number of observations. If there is no significant drift in the location over time, the slope parameter should be zero.

      Coefficient     Estimate      Stan. Error   t-Value
          B₀        0.699127E-02     0.9155E-01    0.0764
          B₁       -0.396298E-04     0.3167E-03   -0.1251
 
      Residual Standard Deviation = 1.02205
      Residual Degrees of Freedom = 498

The absolute value of the t-value for the slope parameter is smaller than the critical value of t_0.975,498 = 1.96. Thus, we conclude that the slope is not different from zero at the 0.05 significance level.

Variation

One simple way to detect a change in variation is with Bartlett's test, after dividing the data set into several equal-sized intervals. The choice of the number of intervals is somewhat arbitrary, although values of four or eight are reasonable. We will divide our data into four intervals.

      H₀:  σ₁² = σ₂² = σ₃² = σ₄² 
      H_a:  At least one σ_i² is not equal to the others.

      Test statistic:  T = 2.373660
      Degrees of freedom:  k - 1 = 3
      Significance level:  α = 0.05
      Critical value:  Χ²_1-α,k-1 = 7.814728
      Critical region:  Reject H₀ if T > 7.814728

In this case, Bartlett's test indicates that the variances are not significantly different in the four intervals.

Randomness

There are many ways in which data can be non-random. However, most common forms of non-randomness can be detected with a few simple tests including the lag plot shown on the previous page.

Another check is an autocorrelation plot that shows the autocorrelations for various lags. Confidence bands can be plotted at the 95 % and 99 % confidence levels. Points outside this band indicate statistically significant values (lag 0 is always 1).

autocorrelation plot

The lag 1 autocorrelation, which is generally the one of most interest, is 0.045. The critical values at the 5% significance level are -0.087 and 0.087. Since 0.045 is within the critical region, the lag 1 autocorrelation is not statistically significant, so there is no evidence of non-randomness.

A common test for randomness is the runs test.

      H₀:  the sequence was produced in a random manner
      H_a:  the sequence was not produced in a random manner  

      Test statistic:  Z = -1.0744
      Significance level:  α = 0.05
      Critical value:  Z_1-α/2 = 1.96 
      Critical region:  Reject H₀ if |Z| > 1.96

The runs test fails to reject the null hypothesis that the data were produced in a random manner.

Distributional Analysis

Probability plots are a graphical test for assessing if a particular distribution provides an adequate fit to a data set.

A quantitative enhancement to the probability plot is the correlation coefficient of the points on the probability plot, or PPCC. For this data set the PPCC based on a normal distribution is 0.996. Since the PPCC is greater than the critical value of 0.987 (this is a tabulated value), the normality assumption is not rejected.

Chi-square and Kolmogorov-Smirnov goodness-of-fit tests are alternative methods for assessing distributional adequacy. The Wilk-Shapiro and Anderson-Darling tests can be used to test for normality. The results of the Anderson-Darling test follow.

      H₀:  the data are normally distributed
      H_a:  the data are not normally distributed

      Adjusted test statistic:  A² = 1.0612
      Significance level:  α = 0.05
      Critical value:  0.787
      Critical region:  Reject H₀ if A² > 0.787

The Anderson-Darling test rejects the normality assumption at the 0.05 significance level.

Outlier Analysis

A test for outliers is the Grubbs test.

      H₀:  there are no outliers in the data
      H_a:  the maximum value is an outlier

      Test statistic:  G = 3.368068
      Significance level:  α = 0.05
      Critical value for an upper one-tailed test:  3.863087         
      Critical region:  Reject H₀ if G > 3.863087

For this data set, Grubbs' test does not detect any outliers at the 0.05 significance level.

Model

Since the underlying assumptions were validated both graphically and analytically, we conclude that a reasonable model for the data is:

Y_i = C + E_i

where C is the estimated value of the mean, -0.00294. We can express the uncertainty for C as a 95 % confidence interval (-0.09266, 0.08678).

Univariate Report

It is sometimes useful and convenient to summarize the above results in a report.

 Analysis of 500 normal random numbers
  
 1: Sample Size                           = 500
  
 2: Location
    Mean                                  = -0.00294
    Standard Deviation of Mean            = 0.045663
    95% Confidence Interval for Mean      = (-0.09266,0.086779)
    Drift with respect to location?       = NO
  
 3: Variation
    Standard Deviation                    = 1.021042
    95% Confidence Interval for SD        = (0.961437,1.088585)
    Drift with respect to variation?
    (based on Bartletts test on quarters
    of the data)                          = NO
  
 4: Data are Normal?
      (as tested by Normal PPCC)         = YES
      (as tested by Anderson-Darling)    = NO
  
 5: Randomness
    Autocorrelation                       = 0.045059
    Data are Random?
      (as measured by autocorrelation)    = YES
  
 6: Statistical Control
    (i.e., no drift in location or scale,
    data are random, distribution is 
    fixed, here we are testing only for
    fixed normal)
    Data Set is in Statistical Control?   = YES
  
 7: Outliers?
    (as determined by Grubbs' test)       = NO