12. ANOVA I: Inference for Multiple Means via Variance Partition and the F Test
==========================================================================

12.0 Notation Table
-------------------

.. list-table::
   :header-rows: 1
   :widths: 22 78

   * - Symbol
     - Meaning
   * - :math:`a`
     - number of groups (treatments)
   * - :math:`i`
     - group index, :math:`i=1,\dots,a`
   * - :math:`j`
     - observation index within group :math:`i`
   * - :math:`n_i`
     - sample size in group :math:`i`
   * - :math:`N`
     - total sample size, :math:`N=\sum_{i=1}^a n_i`
   * - :math:`Y_{ij}`
     - response from group :math:`i`, unit :math:`j`
   * - :math:`\mu_i`
     - mean response in group :math:`i`
   * - :math:`\mu`
     - overall mean response (grand mean)
   * - :math:`\varepsilon_{ij}`
     - random error term
   * - :math:`\bar{Y}_{i\cdot}`
     - sample mean in group :math:`i`
   * - :math:`\bar{Y}_{\cdot\cdot}`
     - overall sample mean (grand mean)
   * - :math:`SST`
     - total sum of squares
   * - :math:`SSA`
     - between-groups (treatments) sum of squares
   * - :math:`SSE`
     - within-groups (error) sum of squares
   * - :math:`MSA`
     - mean square for treatments, :math:`SSA/(a-1)`
   * - :math:`MSE`
     - mean square error, :math:`SSE/(N-a)`
   * - :math:`F`
     - ANOVA test statistic, :math:`MSA/MSE`
   * - :math:`df_1`
     - numerator degrees of freedom, :math:`a-1`
   * - :math:`df_2`
     - denominator degrees of freedom, :math:`N-a`
   * - :math:`\alpha`
     - significance level
   * - :math:`p`
     - p-value for the F test
   * - :math:`s`
     - pooled standard deviation estimate, :math:`s=\sqrt{MSE}`
   * - :math:`e_{ij}`
     - residual, :math:`e_{ij}=y_{ij}-\bar{y}_{i\cdot}`
   * - :math:`r_{ij}`
     - standardized residual (one-way), :math:`r_{ij}=e_{ij}/\{s\sqrt{1-1/n_i}\}`

12.1 Introduction
-----------------

In earlier modules, inference for means was developed for one sample and for two samples. Many operational and management decisions, however, require comparing more than two process conditions, such as several suppliers, shift teams, machine settings, or service protocols. Performing many two-sample tests separately is not an acceptable default because the chance of at least one false “difference” increases rapidly as the number of comparisons grows.

This module introduces one-way analysis of variance (one-way ANOVA) as a unified method for testing whether multiple population means are equal. The central idea is variance partition: total variability in the data is decomposed into variability explained by group membership and residual variability within groups. The resulting test statistic is an F-ratio that compares “between-groups” variation to “within-groups” variation.

12.2 Learning Outcomes
----------------------

After completing this session, students should be able to:

- State the one-way ANOVA model and its sampling assumptions in operational terms.
- Formulate :math:`H_0` and :math:`H_1` for comparing :math:`a` means and explain what “at least one mean differs” means.
- Compute and interpret :math:`SSA`, :math:`SSE`, and :math:`SST`, and use the identity :math:`SST=SSA+SSE`.
- Construct the one-way ANOVA table, including degrees of freedom and mean squares.
- Carry out the F test, interpret the p-value, and connect the decision to practical meaning.
- Explain why multiple pairwise comparisons inflate the risk of false positives and apply a simple familywise control idea.
- Describe a basic diagnostic workflow using residuals and groupwise displays.

12.3 Main Concepts
------------------

12.3.1 Model, assumptions, and hypotheses
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

One-way ANOVA models a quantitative response :math:`Y` under a single categorical factor with :math:`a` levels. In quality and process settings, the factor levels are often “treatments” such as alternative materials, machine types, or service policies. The model is

.. math::

   Y_{ij}=\mu_i+\varepsilon_{ij}
   \qquad i=1,\dots,a,\ \ j=1,\dots,n_i

The parameter :math:`\mu_i` is the mean response at level :math:`i`. The error term :math:`\varepsilon_{ij}` captures unit-to-unit variation not explained by the factor. This model is a special case of a linear model with indicator (dummy) variables, so it can be fit with the same computational machinery used for regression.

The standard sampling assumptions for the classical one-way ANOVA F test are stated as follows. Observations are independent within and across groups, typically justified by a randomized assignment or by independent sampling of units. Errors have mean zero and share a common variance :math:`\sigma^2` across groups (homoscedasticity). Under Normal sampling, the error distributions are Normal, which gives the exact F reference distribution for the test statistic.

The null and alternative hypotheses are

.. math::

   H_0:\mu_1=\mu_2=\cdots=\mu_a

.. math::

   H_1:\text{at least two of the means are not equal}

The alternative does not specify which means differ. It asserts only that the factor is associated with changes in the mean response.

12.3.2 Variance partition and sums of squares
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Let :math:`\bar{Y}_{i\cdot}` be the sample mean of group :math:`i`, and let :math:`\bar{Y}_{\cdot\cdot}` be the overall mean across all :math:`N` observations. The total variability around the grand mean is quantified by the total sum of squares

.. math::

   SST=\sum_{i=1}^a\sum_{j=1}^{n_i}\left(y_{ij}-\bar{y}_{\cdot\cdot}\right)^2

ANOVA decomposes this total variability into two components. The between-groups (treatment) sum of squares measures how far group means are from the grand mean, weighted by group sizes:

.. math::

   SSA=\sum_{i=1}^a n_i\left(\bar{y}_{i\cdot}-\bar{y}_{\cdot\cdot}\right)^2

The within-groups (error) sum of squares measures unit-to-unit variability around each group mean:

.. math::

   SSE=\sum_{i=1}^a\sum_{j=1}^{n_i}\left(y_{ij}-\bar{y}_{i\cdot}\right)^2

A fundamental identity links these quantities:

.. math::

   SST = SSA + SSE

This identity formalizes the idea that each observation’s deviation from the grand mean can be split into a part due to the group mean’s deviation from the grand mean (between-groups) and a part due to the observation’s deviation from its group mean (within-groups). In operations language, :math:`SSA` summarizes systematic shifts between treatment averages, while :math:`SSE` summarizes noise among units under the same treatment.

12.3.3 The ANOVA table and the F test
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

ANOVA uses both sums of squares and degrees of freedom. The degrees of freedom also partition:

.. math::

   N-1 = (a-1) + (N-a)

The mean squares are sums of squares divided by their corresponding degrees of freedom:

.. math::

   MSA = \frac{SSA}{a-1},
   \qquad
   MSE = \frac{SSE}{N-a}

The quantity :math:`MSE` is the pooled estimate of the common variance :math:`\sigma^2` and is also called the within-groups variance estimate. The F statistic is the ratio

.. math::

   F=\frac{MSA}{MSE}

Under :math:`H_0`, both :math:`MSA` and :math:`MSE` estimate the same variance scale, so :math:`F` tends to be near 1. When the group means are genuinely different, :math:`MSA` tends to increase relative to :math:`MSE`, so :math:`F` tends to be larger than 1. Under Normal sampling and :math:`H_0`, the reference distribution is

.. math::

   F \sim F(df_1=a-1,\ df_2=N-a)

The p-value for the one-way ANOVA is the right-tail probability :math:`p=P(F_{df_1,df_2}\ge f_\text{obs})`. A small p-value indicates that the observed between-groups variability is too large to attribute to within-groups noise alone.

Figure 12.1: Reading an F reference distribution for one-way ANOVA
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

This figure uses theoretical F distributions rather than real data, because the learning goal is to isolate how the reference distribution depends on degrees of freedom. No repeated sampling is used in this visualization; instead, :math:`df_1=a-1` and :math:`df_2=N-a` are treated as design summaries implied by the number of groups and the total sample size. In this context, :math:`n` is the per-group sample size when groups are balanced, and it influences :math:`df_2` through :math:`N`.

To read the figure, first note the selected degrees of freedom shown in the dropdown. Then identify the vertical critical line marking the upper-tail cutoff at the chosen :math:`\alpha` level. The smooth curve is the theoretical F density, and the rejection region is the area to the right of the critical line, corresponding to unusually large F-ratios under :math:`H_0`.

The main statistical message is that the F reference distribution changes substantially with degrees of freedom. When :math:`df_2` is small, the right tail is heavier and large values of :math:`F` are less surprising, so the critical value is larger. As :math:`df_2` increases, the distribution concentrates more near 1 and the critical value decreases, which corresponds to improved precision in estimating :math:`\sigma^2` through :math:`MSE`.

In practice, the plotted critical value operationalizes the rule “reject :math:`H_0` for sufficiently large :math:`F`.” An observed test statistic beyond the critical line implies :math:`p<\alpha` and supports the claim that not all group means are equal. When :math:`a=2`, this ANOVA decision reduces to the familiar two-sample mean comparison, and the F-ratio perspective clarifies why one unified framework can handle both two-group and multi-group cases.

.. raw:: html

   <iframe src="../_static/figures/stat2/m12_fdist.html"
   scrolling="no"
   style="width:95%; height:560px; border:none; overflow:hidden; display:block; margin:0 auto;">
   </iframe>

12.3.4 Estimation and interpretation beyond the global test
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

A significant F test indicates evidence that at least one mean differs, but it does not specify which groups differ or by how much. For practical interpretation, it is standard to report group means, their differences, and an uncertainty summary based on the pooled variability estimate :math:`s=\sqrt{MSE}`. The pooled nature of :math:`s` is appropriate only when the equal-variance assumption is plausible.

A compact effect-size summary is the proportion of explained variability:

.. math::

   R^2 = \frac{SSA}{SST}

This value is sometimes reported as :math:`\eta^2` in one-way ANOVA contexts. It is descriptive rather than a decision rule: it quantifies how much of the total variation in the observed outcomes is associated with group membership in the fitted model.

12.3.5 Multiple comparisons: concept and caution
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

After rejecting :math:`H_0`, analysts often examine pairwise differences :math:`\mu_i-\mu_j`. A natural estimator is :math:`\bar{y}_{i\cdot}-\bar{y}_{j\cdot}`. Under the equal-variance model, the standard error for comparing two means uses the pooled variance estimate:

.. math::

   SE(\bar{Y}_{i\cdot}-\bar{Y}_{j\cdot}) = s\sqrt{\frac{1}{n_i}+\frac{1}{n_j}}

A t-based comparison can be formed using

.. math::

   T=\frac{(\bar{y}_{i\cdot}-\bar{y}_{j\cdot})-(\mu_i-\mu_j)}
   {s\sqrt{\frac{1}{n_i}+\frac{1}{n_j}}}

Under Normal sampling, this statistic is referenced to a t distribution with :math:`N-a` degrees of freedom. This connection explains why software can provide confidence intervals for group differences immediately after fitting the one-way model.

The key caution is multiplicity. If :math:`a` groups are compared pairwise, the number of pairwise tests is

.. math::

   m=\frac{a(a-1)}{2}

Even when :math:`H_0` is true, repeatedly applying an individual cutoff such as :math:`\alpha=0.05` across many tests makes it likely that at least one comparison will appear “significant” by chance. The probability of at least one false positive among a family of tests is the familywise error rate, and it is typically much larger than :math:`\alpha` unless adjustments are used.

A simple control idea is Bonferroni: use a smaller individual level :math:`\alpha^\star=\alpha/m` for each pairwise test so that the overall familywise error is controlled conservatively. In practice, software often provides more efficient procedures (for example, Tukey-type controls for all pairwise comparisons), but the conceptual point is the same: post-hoc conclusions require explicit error-rate management.

Figure 12.2: How between-groups separation drives the F-ratio
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

This figure uses simulated data to isolate the mechanism of one-way ANOVA under controlled conditions. Simulation is pedagogically appropriate here because it allows the within-group standard deviation and the between-group mean separation to be varied independently while keeping sample sizes fixed. In this figure, :math:`n` denotes the per-group sample size, and each dropdown setting displays one simulated dataset for that design.

To read the figure, first compare the vertical spread of points within each group; this visualizes within-group variability, which contributes to :math:`SSE` and therefore to :math:`MSE`. Next compare the horizontal pattern of group mean markers; their separation from one another and from the grand mean visualizes between-groups variability, which contributes to :math:`SSA` and therefore to :math:`MSA`. The grand mean line is a reference for the total-variation baseline.

The main message is that the F statistic becomes large when group means are separated relative to the within-group noise level. When the dropdown is set to a “small effect” condition, the group means are close and :math:`MSA` is not much larger than :math:`MSE`, so :math:`F` tends to be near 1. Under a “large effect” condition, mean separation increases :math:`SSA` while the within-group spread remains similar, so :math:`MSA/MSE` increases, making rejection of :math:`H_0` more plausible.

Operationally, the figure clarifies what ANOVA can and cannot detect. If process noise is large, substantial differences in group means are required for a strong signal, and increasing :math:`n` reduces uncertainty by stabilizing both group means and the pooled variance estimate. If noise is well controlled, even moderate mean shifts may be detectable, which motivates both good experimental design and careful measurement.

.. raw:: html

   <iframe src="../_static/figures/stat2/m12_variance_partition.html"
   scrolling="no"
   style="width:95%; height:560px; border:none; overflow:hidden; display:block; margin:0 auto;">
   </iframe>

Figure 12.3: Familywise error inflation under many pairwise tests
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

This figure uses repeated-sampling simulation under a global null scenario where all group means are equal. Simulation is appropriate because the familywise error rate is defined as a long-run frequency: it is the probability that at least one false positive occurs when an entire family of tests is performed. Here, “repetition” means independent Monte Carlo experiments, each generating new samples under :math:`H_0`, and :math:`n` is the per-group sample size used in each experiment.

To read the figure, choose an individual test level :math:`\alpha` from the dropdown, then scan along the x-axis as the number of groups increases. The empirical curve is computed from repeated experiments and represents the observed frequency of “at least one significant pairwise test” when all means are truly equal. The second curve is a simple independence-based approximation that is included as a reference, not as a guarantee, because pairwise tests are not strictly independent.

The statistical message is that using a fixed individual cutoff such as :math:`0.05` across many comparisons leads to a rapidly increasing familywise error rate. When the number of groups increases, the number of pairwise comparisons grows quadratically, so the chance of at least one false discovery rises quickly. Reducing the individual cutoff reduces the familywise error, which is the logic behind Bonferroni-type corrections and other multiple-comparison procedures.

In practice, the figure explains why post-hoc analysis must be planned and reported carefully. A significant global F test does not justify unrestricted pairwise testing at :math:`0.05`, and a non-significant global test does not rule out practically relevant differences that may require more power or a refined design. The responsible workflow is to pair ANOVA with a clearly stated multiple-comparisons method and a decision criterion appropriate for the operational stakes.

.. raw:: html

   <iframe src="../_static/figures/stat2/m12_fwer.html"
   scrolling="no"
   style="width:95%; height:560px; border:none; overflow:hidden; display:block; margin:0 auto;">
   </iframe>

12.3.6 Diagnostics and a minimal post-hoc workflow
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Diagnostics focus on whether the model assumptions are plausible enough for the inferential claims. The residuals for one-way ANOVA are

.. math::

   e_{ij}=y_{ij}-\bar{y}_{i\cdot}

A practical standardized residual for groupwise checking is

.. math::

   r_{ij}=\frac{e_{ij}}{s\sqrt{1-1/n_i}}

which accounts for the fact that group means are estimated from the same data used to compute residuals. Residual plots should be used to check whether variability looks similar across groups and whether extreme outliers dominate the conclusions. A normal probability plot of residuals is commonly used to assess whether Normal sampling is an implausible approximation for the error structure.

A minimal post-hoc workflow (when post-hoc is included) is the following. Fit the one-way model and report the ANOVA table and the F test result at a prespecified :math:`\alpha`. If the global test is significant, examine group means and confidence intervals, then perform a multiple-comparison procedure that explicitly controls error rates for the set of comparisons you intend to interpret. If diagnostics suggest strong heteroscedasticity or severe outliers, consider design changes, transformations, or alternative methods rather than relying on fragile p-values.

Example 12.1 (One-way ANOVA by hand with small balanced data)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

A distribution center is comparing mean pick-and-pack time (minutes) across three picking policies. The policies are implemented on independent orders sampled from similar days, and each policy is applied to five orders to obtain a quick pilot comparison. The goal is to decide whether there is evidence of a mean time difference that justifies a larger experiment.

**Question:** Using one-way ANOVA at :math:`\alpha=0.05`, is there evidence that the mean pick-and-pack time differs among the three policies?

The observed times are:

- Policy A: 12, 11, 10, 13, 12
- Policy B: 14, 15, 13, 16, 14
- Policy C: 11,  9, 10,  8, 10

The group means are :math:`\bar{y}_{A\cdot}=11.6`, :math:`\bar{y}_{B\cdot}=14.4`, and :math:`\bar{y}_{C\cdot}=9.6`. The grand mean is :math:`\bar{y}_{\cdot\cdot}=(11.6+14.4+9.6)/3=11.8667` because the design is balanced. The between-groups sum of squares is

.. math::

   SSA = 5(11.6-11.8667)^2 + 5(14.4-11.8667)^2 + 5(9.6-11.8667)^2 = 58.5333

The within-groups sum of squares is computed from deviations around each group mean:

.. math::

   SSE = \sum_{j=1}^5(y_{Aj}-11.6)^2 + \sum_{j=1}^5(y_{Bj}-14.4)^2 + \sum_{j=1}^5(y_{Cj}-9.6)^2 = 10.4

The degrees of freedom are :math:`df_1=a-1=2` and :math:`df_2=N-a=15-3=12`. Therefore :math:`MSA=58.5333/2=29.2667` and :math:`MSE=10.4/12=0.8667`, so

.. math::

   F = \frac{29.2667}{0.8667}=33.77

This value is far into the right tail of :math:`F(2,12)`, so the p-value is very small.

**Answer:** Reject :math:`H_0` at :math:`\alpha=0.05`. The data provide strong evidence that at least one policy has a different mean pick-and-pack time, and the observed group means suggest that Policy B is slower on average while Policy C is faster in this pilot.

Example 12.2 (Interpreting an ANOVA table and planning post-hoc comparisons)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

A call center is evaluating four training programs for new agents, using average customer handling time (seconds) over the first week as the response. Agents are randomly assigned to one of the four programs, and the analysis is performed with software that outputs the ANOVA table. Management wants a conclusion and a principled follow-up on which programs differ.

**Question:** Given the ANOVA output below, what is the correct global conclusion at :math:`\alpha=0.05`, and what should be the next step regarding group differences?

The ANOVA output is:

- Treatments: :math:`df=3`, :math:`SS=9600`, :math:`MS=3200`, :math:`F=4.00`, :math:`p=0.012`
- Error: :math:`df=76`, :math:`SS=60800`, :math:`MS=800`
- Total: :math:`df=79`, :math:`SS=70400`

The p-value :math:`p=0.012` is below :math:`0.05`, so the correct global conclusion is that not all mean handling times are equal across the four programs. This conclusion is about the existence of at least one mean difference; it does not identify a best program or specify which pairs differ. The pooled standard deviation estimate is :math:`s=\sqrt{800}\approx 28.3` seconds, which describes residual variation after accounting for program-level mean differences.

The appropriate next step depends on the decision objective. If the purpose is to compare all programs pairwise, a multiple-comparisons method that controls the familywise error rate for the set of pairwise comparisons should be used, and results should be reported with adjusted confidence intervals or adjusted p-values. If the purpose is to compare each program to a designated baseline, then the family of comparisons should be defined accordingly, and an adjustment should match that comparison family rather than all pairs.

**Answer:** Reject :math:`H_0` at :math:`\alpha=0.05` and conclude that at least one training program has a different mean handling time. Proceed to post-hoc comparisons only with an explicitly stated multiplicity control aligned with the intended comparison family.

Example 12.3 (When diagnostics threaten validity)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

A manufacturing engineer compares mean cure time across three oven temperature settings. The ANOVA F test is significant, but the residual plot by group shows that the highest temperature setting has much larger spread than the others, and two points are extreme relative to the rest. The engineer needs guidance on whether the ANOVA conclusion is stable.

**Question:** How should the ANOVA conclusion be qualified, and what analysis actions are appropriate before making a decision?

A significant F test can be driven either by genuine mean shifts or by violations that distort the pooled variance estimate. If one group has much larger variance, :math:`MSE` may represent an average that does not reflect the variability structure relevant for mean comparisons, especially when sample sizes are unequal. If a small number of extreme points dominate :math:`SSA` or :math:`SSE`, the p-value may reflect outliers more than a stable process shift.

Before deciding, the analyst should examine the data context for special causes and measurement issues, because operational explanations can justify excluding or separately modeling anomalous runs. A transformation of the response (for example, a log transform for time-like outcomes) may stabilize variance if variability scales with the mean. If heteroscedasticity persists, a method designed for unequal variances or a more robust design may be required, and any post-hoc comparison should be interpreted as conditional on the diagnostic adequacy.

**Answer:** The conclusion “means differ” should be treated as provisional until variance heterogeneity and outliers are addressed. The recommended actions are to investigate special causes, consider a variance-stabilizing transformation, and re-run the analysis with diagnostics to ensure that the inference is not an artifact of unequal variances or extreme points.

12.4 Discussion and Common Errors
---------------------------------

A frequent error is to interpret a significant ANOVA result as “all means are different.” The correct interpretation is that at least one mean differs, and additional structured comparisons are needed to locate differences. Another frequent error is to run many pairwise t tests at :math:`0.05` without acknowledging that the familywise error rate is then much larger than :math:`0.05`.

Misuse of assumptions is also common. When groups have clearly different variances, the pooled estimate :math:`MSE` may be misleading, and conclusions can change under alternative methods. When sample sizes are very small, Normality checks have limited sensitivity, so the better practice is to emphasize design improvements, replication, and measurement control rather than to rely on a normality plot as a “pass/fail” criterion.

A reporting error is to present only the p-value. A complete operational report includes the factor levels, sample sizes, group means, an ANOVA table (df, SS, MS, F, p), and an effect-size or practical magnitude summary. If post-hoc comparisons are made, the comparison family and the adjustment method must be stated, and conclusions should be phrased in terms of estimated mean differences with uncertainty.

12.5 Summary
------------

One-way ANOVA is a linear-model framework for comparing :math:`a` population means under a single categorical factor. The method relies on a variance partition of total variability into between-groups and within-groups components, producing an F-ratio that is large when group means are separated relative to within-group noise. The global F test addresses whether any mean difference exists, while post-hoc comparisons require explicit control of multiplicity to avoid inflated false positives. Diagnostics based on residuals and groupwise displays are essential for assessing whether equal-variance and Normal-error assumptions are plausible for the intended inference.