1.
Exploratory Data Analysis
1.1.
EDA Introduction
1.1.7.
|
General Problem Categories
|
|
Problem Classification
|
The following table is a convenient way to classify EDA
problems.
|
Univariate and Control
|
UNIVARIATE
Data:
A single column of numbers, Y.
Model:
Output:
- A number (the estimated constant in the model).
- An estimate of uncertainty for the constant.
- An estimate of the distribution for the error.
Techniques:
|
CONTROL
Data:
A single column of numbers, Y.
Model:
Output:
A "yes" or "no" to the question "Is the
system out of control?".
Techniques:
|
|
Comparative and Screening
|
COMPARATIVE
Data:
A single response variable and k independent variables
(Y, X1, X2,
... , Xk), primary focus is on
one (the primary factor) of these independent
variables.
Model:
y = f(x1, x2,
..., xk) + error
Output:
A "yes" or "no" to the question "Is the primary factor
significant?".
Techniques:
|
SCREENING
Data:
A single response variable and k independent variables
(Y, X1, X2,
... , Xk).
Model:
y = f(x1, x2,
..., xk) + error
Output:
- A ranked list (from most important to least
important) of factors.
- Best settings for the factors.
- A good model/prediction equation relating Y to
the factors.
Techniques:
|
|
Optimization and Regression
|
OPTIMIZATION
Data:
A single response variable and k independent variables
(Y, X1, X2,
... , Xk).
Model:
y = f(x1, x2,
..., xk) + error
Output:
Best settings for the factor variables.
Techniques:
|
REGRESSION
Data:
A single response variable and k independent variables
(Y, X1, X2,
... , Xk).
The independent variables can be continuous.
Model:
y = f(x1, x2,
..., xk) + error
Output:
A good model/prediction equation relating Y to
the factors.
Techniques:
|
|
Time Series and Multivariate
|
TIME SERIES
Data:
A column of time dependent numbers, Y. In addition,
time is an indpendent variable. The time variable
can be either explicit or implied. If the data
are not equi-spaced, the time variable should be
explicitly provided.
Model:
yt = f(t) + error
The model can be either a time domain based or
frequency domain based.
Output:
A good model/prediction equation relating Y to
previous values of Y.
Techniques:
|
MULTIVARIATE
Data:
k factor variables
(X1, X2, ... ,
Xk).
Model:
The model is not explicit.
Output:
Identify underlying correlation structure in the data.
Techniques:
Note that multivarate analysis is only covered lightly
in this Handbook.
|
|