1.3.5.12. Autocorrelation

1. Exploratory Data Analysis
1.3. EDA Techniques
1.3.5. Quantitative Techniques

1.3.5.12. Autocorrelation

Purpose:
Detect Non-Randomness, Time Series Modeling

The autocorrelation ( Box and Jenkins, 1976) function can be used for the following two purposes:

To detect non-randomness in data.
To identify an appropriate time series model if the data are not random.

Definition

Given measurements, Y₁, Y₂, ..., Y_N at time X₁, X₂, ..., X_N, the lag k autocorrelation function is defined as

\[ r_{k} = \frac{\sum_{i=1}^{N-k}(Y_{i} - \bar{Y})(Y_{i+k} - \bar{Y})} {\sum_{i=1}^{N}(Y_{i} - \bar{Y})^{2} } \]

Although the time variable, X, is not used in the formula for autocorrelation, the assumption is that the observations are equi-spaced.

Autocorrelation is a correlation coefficient. However, instead of correlation between two different variables, the correlation is between two values of the same variable at times X_i and X_i+k.

When the autocorrelation is used to detect non-randomness, it is usually only the first (lag 1) autocorrelation that is of interest. When the autocorrelation is used to identify an appropriate time series model, the autocorrelations are usually plotted for many lags.

Autocorrelation Example

Lag-one autocorrelations were computed for the the LEW.DAT data set.

 
 lag     autocorrelation
  0.      1.00
  1.     -0.31
  2.     -0.74
  3.      0.77
  4.      0.21
  5.     -0.90
  6.      0.38
  7.      0.63
  8.     -0.77
  9.     -0.12
 10.      0.82
 11.     -0.40
 12.     -0.55
 13.      0.73
 14.      0.07
 15.     -0.76
 16.      0.40
 17.      0.48
 18.     -0.70
 19.     -0.03
 20.      0.70
 21.     -0.41
 22.     -0.43
 23.      0.67
 24.      0.00
 25.     -0.66
 26.      0.42
 27.      0.39
 28.     -0.65
 29.      0.03
 30.      0.63
 31.     -0.42
 32.     -0.36
 33.      0.64
 34.     -0.05
 35.     -0.60
 36.      0.43
 37.      0.32
 38.     -0.64
 39.      0.08
 40.      0.58
 41.     -0.45
 42.     -0.28
 43.      0.62
 44.     -0.10
 45.     -0.55
 46.      0.45
 47.      0.25
 48.     -0.61
 49.      0.14

Questions

The autocorrelation function can be used to answer the following questions.

Was this sample data set generated from a random process?
Would a non-linear or time series model be a more appropriate model for these data than a simple constant plus error model?

Importance

Randomness is one of the key assumptions in determining if a univariate statistical process is in control. If the assumptions of constant location and scale, randomness, and fixed distribution are reasonable, then the univariate process can be modeled as:

\[ Y_{i} = A_0 + E_{i} \] where E_i is an error term.

If the randomness assumption is not valid, then a different model needs to be used. This will typically be either a time series model or a non-linear model (with time as the independent variable).

Related Techniques

Autocorrelation Plot
Run Sequence Plot
Lag Plot
Runs Test

Case Study

The heat flow meter data demonstrate the use of autocorrelation in determining if the data are from a random process.

Software

The autocorrelation capability is available in most general purpose statistical software programs. Both Dataplot code and R code can be used to generate the analyses in this section. These scripts use the LEW.DAT data file.