1.
Exploratory Data Analysis
1.3. EDA Techniques 1.3.5. Quantitative Techniques


Purpose: Detect significant factors 
The analysis of variance (ANOVA)
(Neter, Wasserman,
and Kutner, 1990) is used to detect significant
factors in a multifactor model. In the multifactor model,
there is a response (dependent) variable and one or more
factor (independent) variables. This is a common model
in designed experiments
where the experimenter sets the values for each of the factor
variables and then measures the response variable.
Each factor can take on a certain number of values. These are referred to as the levels of a factor. The number of levels can vary betweeen factors. For designed experiments, the number of levels for a given factor tends to be small. Each factor and level combination is a cell. Balanced designs are those in which the cells have an equal number of observations and unbalanced designs are those in which the number of observations varies among cells. It is customary to use balanced designs in designed experiments. 

Definition 
The Product and Process
Comparisons chapter (chapter 7) contains a more extensive
discussion of twofactor
ANOVA, including the details for the mathematical
computations.
The model for the analysis of variance can be stated in two mathematically equivalent ways. We explain the model for a twoway ANOVA (the concepts are the same for additional factors). In the following discussion, each combination of factors and levels is called a cell. In the following, the subscript i refers to the level of factor 1, j refers to the level of factor 2, and the subscript k refers to the kth observation within the (i,j)th cell. For example, Y_{235} refers to the fifth observation in the second level of factor 1 and the third level of factor 2. The first model is
\( R_{ijk} = Y_{ijk}  \hat{\mu}_{ij} \) The second model is
\( R_{ijk} = Y_{ijk}  \hat{\mu}  \hat{\alpha}_{i}  \hat{\beta}_{j} \) The distinction between these models is that the second model divides the cell mean into an overall mean and factor effects. This second model makes the factor effect more explicit, so we will emphasize this approach. 

Model Validation  Note that the ANOVA model assumes that the error term, E_{ijk}, should follow the assumptions for a univariate measurement process. That is, after performing an analysis of variance, the model should be validated by analyzing the residuals.  
MultiFactor ANOVA Example 
An analysis of variance was performed for the
JAHANMI2.DAT data set.
The data contains four, twolevel factors: table speed,
down feed rate, wheel grit size, and batch. There
are 30 measurements of ceramic strength for each factor
combination for a total of 480 measurements.
SOURCE DF SUM OF SQUARES MEAN SQUARE F STATISTIC  TABLE SPEED 1 26672.726562 26672.726562 6.7080 DOWN FEED RATE 1 11524.053711 11524.053711 2.8982 WHEEL GRIT SIZE 1 14380.633789 14380.633789 3.6166 BATCH 1 727143.125000 727143.125000 182.8703 RESIDUAL 475 1888731.500000 3976.276855 TOTAL (CORRECTED) 479 2668446.000000 5570.868652 RESIDUAL STANDARD DEVIATION = 63.05772781 FACTOR LEVEL N MEAN SD(MEAN)  TABLE SPEED 1 240 657.53168 2.87818 1 240 642.62286 2.87818 DOWN FEED RATE 1 240 645.17755 2.87818 1 240 654.97723 2.87818 WHEEL GRIT SIZE 1 240 655.55084 2.87818 1 240 644.60376 2.87818 BATCH 1 240 688.99890 2.87818 2 240 611.15594 2.87818The ANOVA decomposes the variance into the following component sum of squares:
H_{0}: All individual batch means are equal. The F statistic is the mean square for the factor divided by the residual mean square. This statistic follows an F distribution with (k1) and (Nk) degrees of freedom where k is the number of levels for the given factor. Here, we see that the size of the "direction" effect dominates the size of the other effects. For our example, the critical F value (upper tail) for α = 0.05, (k1) = 1, and (Nk) = 475 is 3.86111. Thus, "table speed" and "batch" are significant at the 5 % level while "down feed rate" and "wheel grit size" are not significant at the 5 % level. In addition to the quantitative ANOVA output, it is recommended that any analysis of variance be complemented with model validation. At a minimum, this should include


Questions 
The analysis of variance can be used to answer the following
questions:


Related Techniques 
Onefactor analysis of variance Twosample ttest Box plot Block plot DOE mean plot 

Case Study  The quantitative ANOVA approach can be contrasted with the more graphical EDA approach in the ceramic strength case study.  
Software  Most general purpose statistical software programs can perform multifactor analysis of variance. Both Dataplot code and R code can be used to generate the analyses in this section. These scripts use the JAHANMI2.DAT data file. 