 1. Exploratory Data Analysis
1.1. EDA Introduction

## General Problem Categories

Problem Classification The following table is a convenient way to classify EDA problems.
Univariate and Control
 UNIVARIATE Data: A single column of numbers, Y. Model: y = constant + error Output: A number (the estimated constant in the model). An estimate of uncertainty for the constant. An estimate of the distribution for the error. Techniques: CONTROL Data: A single column of numbers, Y. Model: y = constant + error Output: A "yes" or "no" to the question "Is the system out of control?". Techniques:
Comparative and Screening
 COMPARATIVE Data: A single response variable and k independent variables (Y, X1, X2, ... , Xk), primary focus is on one (the primary factor) of these independent variables. Model: y = f(x1, x2, ..., xk) + error Output: A "yes" or "no" to the question "Is the primary factor significant?". Techniques: SCREENING Data: A single response variable and k independent variables (Y, X1, X2, ... , Xk). Model: y = f(x1, x2, ..., xk) + error Output: A ranked list (from most important to least important) of factors. Best settings for the factors. A good model/prediction equation relating Y to the factors. Techniques:
Optimization and Regression
 OPTIMIZATION Data: A single response variable and k independent variables (Y, X1, X2, ... , Xk). Model: y = f(x1, x2, ..., xk) + error Output: Best settings for the factor variables. Techniques: REGRESSION Data: A single response variable and k independent variables (Y, X1, X2, ... , Xk). The independent variables can be continuous. Model: y = f(x1, x2, ..., xk) + error Output: A good model/prediction equation relating Y to the factors. Techniques:
Time Series and Multivariate
 TIME SERIES Data: A column of time dependent numbers, Y. In addition, time is an indpendent variable. The time variable can be either explicit or implied. If the data are not equi-spaced, the time variable should be explicitly provided. Model: yt = f(t) + error The model can be either a time domain based or frequency domain based. Output: A good model/prediction equation relating Y to previous values of Y. Techniques: MULTIVARIATE Data: k factor variables (X1, X2, ... , Xk). Model: The model is not explicit. Output: Identify underlying correlation structure in the data. Techniques: Star Plot Scatter Plot Matrix Conditioning Plot Profile Plot Principal Components Clustering Discrimination/Classification Note that multivarate analysis is only covered lightly in this Handbook. 