12. ANOVA I: Inference for Multiple Means via Variance Partition and the F Test

12.0 Notation Table

Symbol	Meaning
\(k\)	number of groups or treatments
\(i\)	group index, \(i=1,\dots,k\)
\(j\)	observation index within group \(i\)
\(n_i\)	sample size in group \(i\)
\(N\)	total sample size, \(N=\sum_{i=1}^{k} n_i\)
\(Y_{ij}\)	random response from group \(i\), observation \(j\)
\(y_{ij}\)	observed value from group \(i\), observation \(j\)
\(\mu_i\)	population mean response for group \(i\)
\(\bar{y}_{i\cdot}\)	sample mean of group \(i\)
\(\bar{y}_{\cdot\cdot}\)	grand mean of all observations
\(H_0\)	null hypothesis, \(\mu_1=\mu_2=\cdots=\mu_k\)
\(H_1\)	alternative hypothesis, at least one population mean is different
\(SST\)	total sum of squares
\(SSB\)	between-groups sum of squares
\(SSW\)	within-groups sum of squares
\(MSB\)	mean square between groups, \(MSB=SSB/(k-1)\)
\(MSW\)	mean square within groups, \(MSW=SSW/(N-k)\)
\(F\)	ANOVA test statistic, \(F=MSB/MSW\)
\(df_B\)	between-groups degrees of freedom, \(df_B=k-1\)
\(df_W\)	within-groups degrees of freedom, \(df_W=N-k\)
\(df_T\)	total degrees of freedom, \(df_T=N-1\)
\(df_1\)	numerator degrees of freedom for the \(F\) test, \(df_1=df_B=k-1\)
\(df_2\)	denominator degrees of freedom for the \(F\) test, \(df_2=df_W=N-k\)
\(\alpha\)	significance level
\(p\)	p-value for the \(F\) test
\(s\)	estimated common standard deviation, \(s=\sqrt{MSW}\)
\(e_{ij}\)	residual, \(e_{ij}=y_{ij}-\bar{y}_{i\cdot}\)
\(r_{ij}\)	standardized residual, \(r_{ij}=e_{ij}/\{s\sqrt{1-1/n_i}\}\)

Notation Note

Some textbooks use \(SSA\), \(SSE\), \(MSA\), and \(MSE\) instead of \(SSB\), \(SSW\), \(MSB\), and \(MSW\).

In this module, we use the between/within notation:

\[SSA = SSB,\]

\[SSE = SSW,\]

\[MSA = MSB,\]

\[MSE = MSW.\]

This notation is used because it directly matches the main ANOVA idea:

\[SST = SSB + SSW.\]

Notation Note

Some textbooks use \(a\), \(SSA\), \(SSE\), \(MSA\), and \(MSE\).

In this module, we use \(k\), \(SSB\), \(SSW\), \(MSB\), and \(MSW\) because these symbols directly match the teaching idea of between-group and within-group variation.

The equivalent notation is:

\[a = k\]

\[SSA = SSB\]

\[SSE = SSW\]

\[MSA = MSB\]

\[MSE = MSW\]

12.1 Introduction

In earlier modules, we studied inference for means in one-sample and two-sample settings. However, many industrial engineering and management problems involve more than two process conditions. For example, we may need to compare several suppliers, shift teams, machine settings, production lines, or service protocols.

In this situation, performing many two-sample tests separately is not a good default approach. Each test has a chance of making a Type I error. If we run many tests, the chance of finding at least one false ``difference’’ becomes larger.

This module introduces one-way analysis of variance, or one-way ANOVA, as a unified method for comparing several population means. Suppose there are \(k\) groups with population means

\[\mu_1, \mu_2, \ldots, \mu_k.\]

The main question is whether these population means are all equal:

\[H_0: \mu_1 = \mu_2 = \cdots = \mu_k.\]

The alternative hypothesis is that the population means are not all equal:

\[H_a: \text{at least one population mean is different.}\]

This alternative should be interpreted carefully. It does not mean that every pair of means is different. It only means that at least one group mean differs from the others.

One-way ANOVA answers this question by separating the total variation in the data into two parts:

\[SST = SSB + SSW.\]

Here, \(SST\) is the total sum of squares, \(SSB\) is the sum of squares between groups, and \(SSW\) is the sum of squares within groups.

The between-group variation measures how far the group means are from the grand mean. The within-group variation measures how much observations vary inside their own groups.

The resulting test statistic is the \(F\)-ratio:

\[F = \frac{MSB}{MSW}.\]

This ratio compares average between-group variation with average within-group variation. A larger \(F\) value gives stronger evidence that the population means are not all equal.

12.2 Learning Outcomes

After completing this session, students should be able to:

Explain when one-way ANOVA is appropriate in industrial engineering and management problems.
State the one-way ANOVA model and describe its assumptions in operational terms.
Formulate \(H_0\) and \(H_a\) for comparing \(k\) population means.
Explain that “at least one mean differs” means the means are not all equal, not necessarily that every pair of means is different.
Compute and interpret \(SSB\), \(SSW\), and \(SST\).
Use the variance decomposition identity:

\[SST = SSB + SSW\]
Construct the one-way ANOVA table, including sums of squares, degrees of freedom, mean squares, and the \(F\) statistic.
Carry out the \(F\) test using either a critical value or a p-value.
Interpret the ANOVA decision in both statistical and practical terms.
Explain why many separate pairwise tests can increase the risk of false positive conclusions.
Describe the purpose of post-hoc comparisons after a significant ANOVA result.
Describe a basic diagnostic workflow using residuals, boxplots, and groupwise standard deviations.

12.3 Main Concepts

12.3.1 Model, assumptions, and hypotheses

One-way ANOVA is used when we want to compare the means of more than two groups. The response variable must be quantitative, and the groups are formed by one categorical factor.

In industrial engineering and management settings, the groups may represent different suppliers, machine settings, materials, shift teams, production lines, or service policies.

Let \(Y_{ij}\) be the response from observation \(j\) in group \(i\), where

\[i=1,\dots,k, \qquad j=1,\dots,n_i.\]

The population mean of group \(i\) is denoted by \(\mu_i\).

For example, if the factor is machine setting, then \(\mu_i\) represents the true mean response under machine setting \(i\).

The goal of one-way ANOVA is to test whether the group population means are all equal.

The null hypothesis is

\[H_0: \mu_1 = \mu_2 = \cdots = \mu_k.\]

The alternative hypothesis is

\[H_1: \text{at least one population mean is different.}\]

The alternative hypothesis should be interpreted carefully. It does not mean that all group means are different from each other. It only means that the means are not all equal. For example, it may be that only one group mean differs from the others.

To use the classical one-way ANOVA \(F\) test, we assume that:

observations are independent within and across groups;
each group is approximately normally distributed;
the groups have approximately equal variances.

The independence assumption is usually supported by random sampling or random assignment. The equal-variance assumption means that the groups should have similar levels of spread. The normality assumption means that the observations within each group should be reasonably close to a normal distribution, especially when the sample size is small.

Thus, one-way ANOVA is an extension of hypothesis testing for means. Instead of comparing one mean or two means, it compares \(k\) population means using one overall \(F\) test.

12.3.2 Variance partition and sums of squares

A central idea in one-way ANOVA is variance partition. This means that the total variation in the response is separated into two parts:

variation between group means;
variation within groups.

Let \(\bar{y}_{i\cdot}\) be the sample mean of group \(i\), and let \(\bar{y}_{\cdot\cdot}\) be the grand mean across all \(N\) observations.

The total sum of squares is

\[SST=\sum_{i=1}^{k}\sum_{j=1}^{n_i} \left(y_{ij}-\bar{y}_{\cdot\cdot}\right)^2.\]

This measures the total variation of all observations around the grand mean.

The between-groups sum of squares is

\[SSB=\sum_{i=1}^{k} n_i \left(\bar{y}_{i\cdot}-\bar{y}_{\cdot\cdot}\right)^2.\]

This measures how far the group means are from the grand mean. The term \(n_i\) is included because each group mean represents \(n_i\) observations.

The within-groups sum of squares is

\[SSW=\sum_{i=1}^{k}\sum_{j=1}^{n_i} \left(y_{ij}-\bar{y}_{i\cdot}\right)^2.\]

This measures how far individual observations are from their own group mean. It represents random variation among units under the same treatment or condition.

These three quantities are connected by the ANOVA identity:

\[SST = SSB + SSW.\]

This identity means that total variation can be separated into variation explained by group membership and variation left unexplained within the groups.

In operational terms, \(SSB\) measures systematic differences between treatment averages, while \(SSW\) measures background noise within the same treatment condition.

12.3.3 The ANOVA table and the F test

ANOVA uses both sums of squares and degrees of freedom. Similar to the sums of squares, the degrees of freedom are also separated into two parts:

\[N - 1 = (k - 1) + (N - k).\]

Here, \(k-1\) is the degrees of freedom for between-group variation, and \(N-k\) is the degrees of freedom for within-group variation.

The mean squares are obtained by dividing each sum of squares by its corresponding degrees of freedom:

\[MSB = \frac{SSB}{k-1}, \qquad MSW = \frac{SSW}{N-k}.\]

The quantity \(MSB\) measures the average variation between group means. The quantity \(MSW\) measures the average variation within groups. It is also the pooled estimate of the common error variance, \(\sigma^2\).

The one-way ANOVA test statistic is the \(F\)-ratio:

\[F = \frac{MSB}{MSW}.\]

This ratio compares between-group variation with within-group variation.

Under \(H_0\), the group means are assumed to be equal. In this case, \(MSB\) and \(MSW\) should measure the same basic error variation. Therefore, \(F\) tends to be close to 1.

When the group means are truly different, \(MSB\) tends to become larger than \(MSW\). Therefore, a large \(F\) value gives stronger evidence that the population means are not all equal.

Under the standard ANOVA assumptions and \(H_0\), the reference distribution is

\[F \sim F(k-1,\ N-k),\]

where \(k-1\) is the numerator degrees of freedom and \(N-k\) is the denominator degrees of freedom.

The p-value is the right-tail probability:

\[p = P(F_{k-1,N-k} \geq f_{\text{obs}}).\]

A small p-value means that the observed \(F\) value is unusually large if \(H_0\) is true. This suggests that the between-group variation is too large to be explained by within-group noise alone.

Figure 12.1: Reading an F reference distribution for one-way ANOVA

This figure shows theoretical \(F\) distributions for one-way ANOVA. The goal is to show how the reference distribution changes with the degrees of freedom.

In one-way ANOVA, the degrees of freedom are

\[df_1 = k - 1\]

and

\[df_2 = N - k.\]

Here, \(df_1\) is the numerator degrees of freedom, and \(df_2\) is the denominator degrees of freedom. When the groups are balanced, \(n\) is the sample size per group, so \(N = kn\).

To read the figure, first check the selected values of \(df_1\) and \(df_2\). Then look at the vertical critical line. This line marks the upper-tail cutoff for the chosen significance level \(\alpha\).

The smooth curve is the theoretical \(F\) density under \(H_0\). The rejection region is the area to the right of the critical line. This is because large \(F\) values indicate that the between-group variation is large relative to the within-group variation.

The main idea is that the \(F\) reference distribution depends on the degrees of freedom. When \(df_2\) is small, the right tail is heavier, so large \(F\) values are less unusual. As a result, the critical value is larger. When \(df_2\) increases, the distribution becomes more concentrated near 1, and the critical value becomes smaller. This happens because \(MSW\) estimates the common error variance more precisely when more within-group information is available.

In practice, the critical value gives the decision rule:

\[\text{Reject } H_0 \text{ if } F_{\text{obs}} > F_{\alpha}(df_1,df_2).\]

If the observed \(F\) value is to the right of the critical line, then \(p < \alpha\). This gives evidence that the population means are not all equal.

When \(k=2\), one-way ANOVA is closely related to the usual two-sample comparison of means. In that case, the ANOVA framework gives the same basic conclusion, but it expresses the evidence using an \(F\) ratio.

12.3.4 Estimation and interpretation beyond the global test

A significant \(F\) test gives evidence that the population means are not all equal. However, it does not tell us which groups are different, or how large the differences are.

For practical interpretation, we should also examine the group sample means:

\[\bar{y}_{1\cdot}, \bar{y}_{2\cdot}, \ldots, \bar{y}_{k\cdot}.\]

These means help us understand the direction and size of the observed differences. For example, in a process comparison, we may want to know which machine setting has the highest average output, or which supplier has the lowest average defect rate.

After a significant ANOVA result, we may also compare pairs of group means, such as

\[\bar{y}_{1\cdot} - \bar{y}_{2\cdot}, \qquad \bar{y}_{1\cdot} - \bar{y}_{3\cdot}, \qquad \bar{y}_{2\cdot} - \bar{y}_{3\cdot}.\]

These pairwise comparisons are useful because the global ANOVA test only tells us that at least one mean differs. It does not identify the specific groups responsible for the difference.

The pooled estimate of the common standard deviation is

\[s = \sqrt{MSW}.\]

This value summarizes the typical within-group variation. It is appropriate when the equal-variance assumption is reasonable. If group variances are very different, this pooled estimate may be misleading.

A useful effect-size summary is the proportion of total variation explained by group membership:

\[\eta^2 = \frac{SSB}{SST}.\]

This value is also closely related to \(R^2\) in the one-way ANOVA model.

The value of \(\eta^2\) is descriptive. It is not a decision rule. It tells us how much of the observed variation in the response is associated with the group factor.

For example, if

\[\eta^2 = 0.40,\]

then about 40% of the total variation in the response is explained by group membership.

A larger \(\eta^2\) means that the group factor explains more variation. A smaller \(\eta^2\) means that most of the variation remains within the groups.

12.3.5 Multiple comparisons: concept and caution

After rejecting \(H_0\), we know that the population means are not all equal. However, the global ANOVA test does not tell us which groups are different. Therefore, analysts often examine pairwise differences, such as

\[\mu_i - \mu_j.\]

A natural estimate of this difference is

\[\bar{y}_{i\cdot} - \bar{y}_{j\cdot}.\]

Under the equal-variance ANOVA model, the standard error for comparing two group means uses the pooled standard deviation estimate

\[s = \sqrt{MSW}.\]

Thus,

\[SE(\bar{Y}_{i\cdot}-\bar{Y}_{j\cdot}) = s\sqrt{\frac{1}{n_i}+\frac{1}{n_j}}.\]

A pairwise comparison can be written as

\[T = \frac{(\bar{y}_{i\cdot}-\bar{y}_{j\cdot})-(\mu_i-\mu_j)} {s\sqrt{\frac{1}{n_i}+\frac{1}{n_j}}}.\]

For testing whether two population means are equal, the null hypothesis is usually

\[H_0: \mu_i - \mu_j = 0.\]

Then the statistic becomes

\[T = \frac{\bar{y}_{i\cdot}-\bar{y}_{j\cdot}} {s\sqrt{\frac{1}{n_i}+\frac{1}{n_j}}}.\]

Under the classical ANOVA assumptions, this statistic is compared with a \(t\) distribution with

\[N-k\]

degrees of freedom.

The main caution is multiplicity. If there are \(k\) groups, the number of pairwise comparisons is

\[m = \frac{k(k-1)}{2}.\]

For example, if \(k=3\), then

\[m = \frac{3(3-1)}{2}=3.\]

The three comparisons are:

Group 1 vs. Group 2;
Group 1 vs. Group 3;
Group 2 vs. Group 3.

If many tests are performed separately, the chance of at least one false positive becomes larger. This is called the familywise error rate.

A simple adjustment method is the Bonferroni correction. If the desired overall significance level is \(\alpha\), then each pairwise test uses

\[\alpha^\star = \frac{\alpha}{m}.\]

This makes each individual test stricter, so the overall false-positive risk is controlled more carefully.

In practice, software often uses more efficient post-hoc methods, such as Tukey-type procedures for all pairwise comparisons. The main idea is the same: after a significant ANOVA result, pairwise conclusions should control the error rate across multiple comparisons.

Figure 12.2: How between-groups separation drives the F-ratio

This figure uses simulated data to help us see how one-way ANOVA works. The purpose is to separate two ideas clearly: variation within groups and variation between groups. In the figure, the sample size per group, \(n\), is fixed, while the group means and the within-group spread can change.

To read the figure, first look at the vertical spread of the points within each group. This spread shows the within-group variation. In ANOVA, this contributes to \(SSW\) and then to \(MSW\). If the points inside a group are widely spread out, then the within-group variation is large.

Next, compare the locations of the group mean markers. If the group means are close to one another, then the between-group variation is small. If the group means are far apart, then the between-group variation is large. In ANOVA, this contributes to \(SSB\) and then to \(MSB\).

The grand mean line gives a common reference point. It helps us see how far each group mean is from the overall center of the data.

The key idea is that the \(F\) statistic becomes large when the group means are well separated compared with the amount of variation inside the groups. If the group means are close together and the within-group variation is large, then \(MSB\) and \(MSW\) will be similar, so \(F\) will tend to be near 1. If the group means move farther apart while the within-group variation stays about the same, then \(MSB\) becomes larger than \(MSW\), and \(F\) increases.

This helps explain what ANOVA can detect. When process noise is large, even visible mean differences may not be strong enough to produce a large \(F\) value. When process noise is small, more moderate mean differences may be easier to detect. For this reason, good experimental control and careful measurement are important in ANOVA.

Figure 12.3: Familywise error inflation under many pairwise tests

This figure shows why we should be careful when making many pairwise comparisons after ANOVA. The figure uses simulation under a global null setting, where all population means are equal.

In this setting, there is actually no true difference among the groups. Therefore, any significant pairwise result is a false positive.

The simulation repeats the experiment many times. In each repetition, new samples are generated under \(H_0\). The value \(n\) represents the sample size per group.

To read the figure, first choose the individual test level \(\alpha\). Then look at how the error rate changes as the number of groups increases.

The curve shows the probability of getting at least one significant pairwise comparison when all group means are truly equal. This probability is called the familywise error rate.

The key point is that the number of pairwise comparisons increases quickly as the number of groups increases:

\[m = \frac{k(k-1)}{2}.\]

For example, if there are \(k=3\) groups, there are only three pairwise comparisons. However, if there are \(k=6\) groups, there are already

\[m = \frac{6(6-1)}{2} = 15\]

pairwise comparisons.

If each comparison uses \(\alpha=0.05\), then each individual test has a 5% chance of a false positive. But when many tests are performed, the chance of at least one false positive becomes much larger than 5%.

This is why we should not freely run many separate pairwise tests without adjustment.

A simple adjustment is the Bonferroni correction. If we want the overall familywise error rate to be controlled at \(\alpha\), then each pairwise test uses a smaller level:

\[\alpha^\star = \frac{\alpha}{m}.\]

This makes each pairwise test stricter.

In practice, software often provides post-hoc methods such as Tukey-type comparisons. These methods are designed to compare group means while controlling the error rate across the family of comparisons.

The practical message is:

ANOVA first tests whether there is evidence that not all means are equal.
If the ANOVA result is significant, post-hoc comparisons may be used to identify which groups differ.
However, post-hoc comparisons must control the increased false-positive risk caused by many pairwise tests.

12.3.6 Diagnostics and a minimal post-hoc workflow

Diagnostics help us check whether the one-way ANOVA assumptions are reasonable enough for statistical inference. The main assumptions are independence, approximately normal errors, and similar error variance across groups.

For one-way ANOVA, the residual for observation \(j\) in group \(i\) is

\[e_{ij}=y_{ij}-\bar{y}_{i\cdot}.\]

This residual measures how far an observation is from its own group mean. Large residuals may indicate unusual observations or possible outliers.

A practical standardized residual is

\[r_{ij}= \frac{e_{ij}}{s\sqrt{1-1/n_i}},\]

where

\[s=\sqrt{MSW}.\]

The standardized residual adjusts the residual by the estimated within-group standard deviation. It also accounts for the fact that the group mean \(\bar{y}_{i\cdot}\) is estimated from the same data.

In practice, diagnostics should include:

boxplots or groupwise displays to compare centers and spreads;
residual plots to check whether the variability looks similar across groups;
normal probability plots or histograms of residuals to check whether the normality assumption is reasonable;
inspection of unusual observations that may strongly affect the result.

The goal is not to prove that the assumptions are perfectly true. Instead, the goal is to check whether the assumptions are plausible enough for the ANOVA result to be trusted.

A minimal post-hoc workflow is as follows.

First, fit the one-way ANOVA model and report the ANOVA table, including \(SSB\), \(SSW\), degrees of freedom, mean squares, the \(F\) statistic, and the p-value or critical-value decision.

Second, interpret the global test at a prespecified significance level \(\alpha\).

If the global test is not significant, we usually stop and conclude that there is not enough evidence that the group means differ.

If the global test is significant, we then examine the group means and their uncertainty. At this stage, the ANOVA result tells us that not all means are equal, but it does not tell us which groups are different.

Third, if specific group differences are important, perform a post-hoc comparison method that controls the error rate across multiple comparisons. Examples include Bonferroni-adjusted comparisons or Tukey-type comparisons.

Finally, if the diagnostics show strong unequal variance, severe outliers, or clear non-normal residual patterns, the ANOVA result should be interpreted carefully. In such cases, analysts may consider improving the design, transforming the response, increasing sample size, or using an alternative method.

Example 12.1: One-way ANOVA by hand with small balanced data

A distribution center is comparing mean pick-and-pack time, measured in minutes, across three picking policies. The policies are applied to independent orders sampled from similar operating days. Each policy is applied to five orders for a quick pilot comparison.

The goal is to decide whether there is evidence of a mean time difference among the three policies.

Question: Using one-way ANOVA at \(\alpha=0.05\), is there evidence that the mean pick-and-pack time differs among the three policies?

The observed times are:

Policy A: 12, 11, 10, 13, 12
Policy B: 14, 15, 13, 16, 14
Policy C: 11, 9, 10, 8, 10

The group means are:

\[\bar{y}_{A\cdot}=11.6,\qquad \bar{y}_{B\cdot}=14.4,\qquad \bar{y}_{C\cdot}=9.6.\]

Because the design is balanced, the grand mean can be computed as the average of the three group means:

\[\bar{y}_{\cdot\cdot} = \frac{11.6+14.4+9.6}{3} = 11.8667.\]

The between-groups sum of squares is

\[SSB = 5(11.6-11.8667)^2 + 5(14.4-11.8667)^2 + 5(9.6-11.8667)^2.\]

Thus,

\[SSB = 58.1333.\]

This value measures how far the policy means are from the grand mean.

Next, compute the within-groups sum of squares. This measures how far the observations are from their own policy mean.

For Policy A:

\[(12-11.6)^2+(11-11.6)^2+(10-11.6)^2+(13-11.6)^2+(12-11.6)^2 = 5.2.\]

For Policy B:

\[(14-14.4)^2+(15-14.4)^2+(13-14.4)^2+(16-14.4)^2+(14-14.4)^2 = 5.2.\]

For Policy C:

\[(11-9.6)^2+(9-9.6)^2+(10-9.6)^2+(8-9.6)^2+(10-9.6)^2 = 5.2.\]

Therefore,

\[SSW = 5.2+5.2+5.2 = 15.6.\]

The degrees of freedom are

\[df_B = k-1 = 3-1 = 2,\]

and

\[df_W = N-k = 15-3 = 12.\]

The mean squares are

\[MSB = \frac{SSB}{df_B} = \frac{58.1333}{2} = 29.0667,\]

and

\[MSW = \frac{SSW}{df_W} = \frac{15.6}{12} = 1.3000.\]

The \(F\) statistic is

\[F = \frac{MSB}{MSW} = \frac{29.0667}{1.3000} = 22.36.\]

At \(\alpha=0.05\), with \(df_B=2\) and \(df_W=12\), the critical value is approximately

\[F_{0.05}(2,12)=3.885.\]

Since

\[22.36 > 3.885,\]

we reject \(H_0\).

Answer: At the 5% significance level, there is statistically significant evidence that not all policy mean pick-and-pack times are equal. Based on the sample means, Policy B appears slower on average, while Policy C appears faster in this pilot study. However, ANOVA only tells us that at least one policy mean differs. To identify which policy pairs differ, a post-hoc comparison would be needed.

Example 12.2: Interpreting an ANOVA table and planning post-hoc comparisons

A call center is evaluating four training programs for new agents. The response variable is average customer handling time, measured in seconds, during the first week. Agents are randomly assigned to one of the four programs, and the analysis is performed using software.

Management wants to know whether the training programs have different mean handling times. If there is evidence of a difference, management also wants a principled follow-up to identify which programs may differ.

Question: Given the ANOVA output below, what is the correct global conclusion at \(\alpha=0.05\), and what should be the next step regarding group differences?

The ANOVA output is:

Source	df	SS	MS	F	p-value
Between groups	3	9600	3200	4.00	0.012
Within groups	76	60800	800
Total	79	70400

Here, the factor is training program. Since there are four programs,

\[k=4.\]

The degrees of freedom are consistent with one-way ANOVA:

\[df_B = k-1 = 4-1 = 3,\]

and

\[df_W = N-k = 80-4 = 76.\]

The test statistic is

\[F = \frac{MSB}{MSW} = \frac{3200}{800} = 4.00.\]

The p-value is

\[p = 0.012.\]

Since

\[p = 0.012 < 0.05,\]

we reject \(H_0\).

Therefore, at the 5% significance level, there is statistically significant evidence that the mean handling times are not all equal across the four training programs.

This conclusion must be interpreted carefully. It does not mean that all four programs are different from each other. It only means that at least one program mean differs from at least one other program mean.

The pooled estimate of the common standard deviation is

\[s = \sqrt{MSW} = \sqrt{800} \approx 28.3.\]

This value describes the typical within-program variation in handling time after accounting for differences among program means.

The proportion of total variation explained by the training program is

\[\eta^2 = \frac{SSB}{SST} = \frac{9600}{70400} \approx 0.136.\]

Thus, about 13.6% of the observed variation in handling time is associated with training program differences.

The next step depends on the management question. If management wants to compare all training programs with each other, then a post-hoc method for all pairwise comparisons, such as Tukey-type comparisons, should be used.

If management wants to compare each new program with a current standard program, then the comparison family should be defined around that baseline. In that case, a baseline-focused multiple-comparison method is more appropriate.

Answer: Reject \(H_0\) at \(\alpha=0.05\). There is evidence that not all training programs have the same mean handling time. However, the ANOVA table alone does not identify which programs differ. The next step is to conduct post-hoc comparisons with an explicitly stated multiple-comparison adjustment.

Example 12.3: When diagnostics threaten validity

A manufacturing engineer compares mean cure time across three oven temperature settings. The one-way ANOVA \(F\) test is significant. However, the residual plot by group shows two possible problems:

the highest temperature setting has much larger spread than the other groups;
two observations are extreme compared with the rest of the data.

The engineer needs to decide whether the ANOVA conclusion is stable enough for an operational decision.

Question: How should the ANOVA conclusion be qualified, and what analysis actions are appropriate before making a decision?

A significant \(F\) test suggests that not all group means are equal. However, this conclusion depends on whether the ANOVA assumptions are reasonable. In this example, the diagnostics raise concerns about equal variance and possible outliers.

The classical one-way ANOVA test assumes that the groups have a common error variance. If one temperature setting has much larger spread, then \(MSW\) becomes a pooled estimate that may not represent all groups well. This can make the \(F\) test and p-value less reliable, especially when sample sizes are unequal. Residual plots are commonly used to check whether variability is roughly constant across fitted values or groups. :contentReference[oaicite:0]{index=0}

The extreme observations should also be investigated. A few unusual points may strongly affect the group mean, \(SSB\), \(SSW\), and the final \(F\) statistic. In an industrial setting, these points may come from special causes, recording errors, equipment problems, material differences, or real but rare process behavior.

Before making a decision, the analyst should first examine the data context. If the extreme points are caused by measurement or recording errors, they should be corrected if possible. If they are valid observations, they should not be removed without justification.

Next, the analyst should check whether a transformation is appropriate. For time-like responses, a log transformation may help when larger means are associated with larger variances. If the unequal variance problem remains, the analyst should consider a method designed for unequal variances, such as Welch’s ANOVA, rather than relying only on the classical pooled-variance ANOVA. Welch’s ANOVA is commonly used when the equal-variance assumption is not appropriate. :contentReference[oaicite:1]{index=1}

The post-hoc stage should also be handled carefully. If the global ANOVA result is unstable because of unequal variance or outliers, then post-hoc comparisons based on the same assumptions may also be unstable. The analyst should only report post-hoc results after the diagnostic concerns have been addressed.

Answer: The conclusion that the mean cure times differ should be treated as provisional. The significant \(F\) test may reflect real mean differences, but it may also be affected by unequal variances or extreme observations. The recommended actions are to investigate the extreme points, check for special causes or measurement errors, consider a variance-stabilizing transformation, and rerun the analysis with diagnostics. If unequal variance remains, use an unequal-variance method such as Welch’s ANOVA before making a final operational decision.

12.4 Discussion and Common Errors

A common error is to interpret a significant ANOVA result as:

All group means are different.

This interpretation is too strong. The correct interpretation is:

At least one group mean is different.

ANOVA is a global test. It tells us whether the population means are not all equal, but it does not identify which specific groups differ. To locate the differences, structured post-hoc comparisons are needed.

Another common error is to run many pairwise t tests at \(0.05\) without adjustment. Each test has a chance of producing a false positive. When many tests are performed, the chance of at least one false positive becomes larger than \(0.05\). This is why multiple-comparison control is needed when several pairwise comparisons are interpreted.

Misuse of assumptions is also common. Classical one-way ANOVA assumes that the groups have similar error variances. If the group variances are clearly different, the pooled estimate \(MSW\) may be misleading. In that case, the \(F\) test and the p-value may not be stable.

Small sample sizes also require caution. When each group has only a few observations, normality checks may not be very informative. A normal probability plot may not clearly show whether the normal-error assumption is reasonable. In this case, good design, careful measurement, and sufficient replication are more important than treating diagnostics as a simple pass-or-fail rule.

Another reporting error is to present only the p-value. A useful ANOVA report should include:

the factor and its levels;
the sample size in each group;
the group means and standard deviations;
the ANOVA table, including df, SS, MS, \(F\), and p-value;
an effect-size or practical magnitude summary.

If post-hoc comparisons are reported, the comparison family and adjustment method should also be stated. Conclusions should be written in terms of estimated mean differences and their uncertainty, not only in terms of significance.

12.5 Summary

One-way ANOVA is used to compare \(k\) population means under one categorical factor. It is useful in industrial engineering and management problems where several treatments, suppliers, machine settings, shift teams, or service policies must be compared.

The method separates total variation into two parts:

\[SST = SSB + SSW.\]

Here, \(SSB\) measures variation between group means, while \(SSW\) measures variation within groups.

The test statistic is the \(F\)-ratio:

\[F = \frac{MSB}{MSW}.\]

A large \(F\) value means that the group means are separated relative to the within-group variation. This gives stronger evidence that the population means are not all equal.

The global ANOVA test answers the question:

Is there evidence that at least one group mean is different?

It does not answer:

Which specific groups are different?

For that reason, post-hoc comparisons may be needed after a significant ANOVA result. These comparisons should control the false-positive risk caused by multiple testing.

Finally, ANOVA conclusions should be supported by diagnostics. Residual plots, boxplots, groupwise standard deviations, and normal probability plots help assess whether the assumptions of independence, similar variances, and approximately normal errors are reasonable for the intended inference.