13. ANOVA II: Mean Comparisons with Blocks and Two-Factor Designs (Randomized Blocks and Two-Way ANOVA)
=================================================================================================

13.0 Notation Table
-------------------

.. list-table::
   :header-rows: 1
   :widths: 22 78

   * - Symbol
     - Meaning
   * - :math:`Y`
     - Quantitative response
   * - :math:`A,\ B`
     - Categorical factors (or :math:`B` as a blocking factor)
   * - :math:`a,\ b`
     - Number of levels of :math:`A` and :math:`B`
   * - :math:`i,\ j,\ k`
     - Indices for :math:`A` level, :math:`B` level, replicate
   * - :math:`n_{ij}`
     - Replicates in cell :math:`(i,j)`
   * - :math:`N=\sum_{i=1}^a\sum_{j=1}^b n_{ij}`
     - Total sample size
   * - :math:`\mu`
     - Overall (grand) mean
   * - :math:`\alpha_i`
     - Main effect of :math:`A` at level :math:`i`
   * - :math:`\beta_j`
     - Main effect of :math:`B` (or block effect) at level :math:`j`
   * - :math:`(\alpha\beta)_{ij}`
     - Interaction effect for cell :math:`(i,j)`
   * - :math:`\varepsilon_{ijk}`
     - Random error term
   * - :math:`\bar{Y}_{ij}`
     - Sample mean within cell :math:`(i,j)`
   * - :math:`\bar{Y}_{i\cdot},\ \bar{Y}_{\cdot j},\ \bar{Y}_{\cdot\cdot}`
     - Marginal means and grand mean
   * - :math:`SS,\ MS`
     - Sum of squares, mean square
   * - :math:`df_E,\ MS_E`
     - Error degrees of freedom, error mean square
   * - :math:`F`
     - ANOVA test statistic (ratio of mean squares)

13.1 Introduction
-----------------

In the previous ANOVA module, the main goal was to compare multiple population means when a single categorical factor explains the differences among groups. That setting is appropriate when the experimental units are comparable and when the only systematic source of variation is the treatment factor.

In many operations and management studies, there is a second categorical source of variation that is not the treatment of interest. Typical examples include operators, days, machines, stores, or subjects, which create baseline differences that inflate the error variance if ignored. This module extends one-way ANOVA to (i) block designs that reduce extraneous variation and (ii) two-factor designs where interaction can change how main effects should be interpreted.

13.2 Learning Outcomes
----------------------

By the end of this session, students should be able to:

- Explain why blocking (including repeated measures) can reduce error variation and improve precision.
- Write down the two-way ANOVA mean model, identify main effects and interaction, and state the required conditions.
- Interpret an interaction plot and describe interaction in applied terms without relying on derivations.
- Recognize the special case of a randomized complete block design and its key limitation when there is no replication.
- Perform basic model checking using residual-based diagnostics and write a practical interpretation paragraph for a results report.

13.3 Main Concepts
------------------

13.3.1 Blocking and repeated measures as variance reduction
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Blocking is a design strategy that groups experimental units into relatively homogeneous sets and randomizes treatments within each set. The block factor is not the primary business question, but it represents a known and substantial source of variation that would otherwise appear in the error term. The operational goal is improved sensitivity for detecting treatment differences.

Repeated measures is a common form of blocking in which the same unit is measured under multiple conditions. The key advantage is that comparisons can be made within the same unit, which often reduces the variation of differences because measurements on the same unit tend to be positively correlated. A variance identity summarizes this idea for two measurements :math:`Y_i` and :math:`Y_j` on the same unit:

.. math::

   \mathrm{Var}(Y_i - Y_j) = \mathrm{Var}(Y_i) + \mathrm{Var}(Y_j) - 2\,\mathrm{Cov}(Y_i, Y_j)

When :math:`\mathrm{Cov}(Y_i, Y_j) > 0`, the variance of the within-unit difference is smaller than it would be under independent sampling. This is the same logic that explains why paired comparisons can be more precise than unpaired comparisons.

13.3.2 Two-way ANOVA mean model and the meaning of interaction
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Two-way ANOVA studies a quantitative response :math:`Y` under two categorical predictors :math:`A` and :math:`B`. The standard mean model with interaction is:

.. math::

   Y_{ijk} = \mu + \alpha_i + \beta_j + (\alpha\beta)_{ij} + \varepsilon_{ijk}

The usual sampling conditions are that observations are independent within and across cells (given the design), and the errors satisfy :math:`\mathbb{E}(\varepsilon_{ijk})=0` and :math:`\mathrm{Var}(\varepsilon_{ijk})=\sigma^2`. Under Normal sampling, we additionally assume :math:`\varepsilon_{ijk}\sim N(0,\sigma^2)`. The equal-variance condition is a statement about the conditional distributions across all :math:`(i,j)` combinations.

Interaction means that the effect of one factor depends on the level of the other factor. In mean-model terms, interaction is present when the cell means cannot be expressed additively as :math:`\mu_{ij}=\mu+\alpha_i+\beta_j`. In applied terms, interaction occurs when a policy, treatment, or design choice performs differently across operating environments.

13.3.3 Interaction plots: how to read and how to interpret
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Figure 13.1 is designed to connect the definition of interaction to an operational reading rule. The data are simulated to control the presence of interaction and to show how sampling noise changes when the per-cell sample size :math:`n` increases. In this figure, a “repetition” refers to repeating the entire two-factor experiment under the same underlying means but with new random error draws, and the displayed dataset for each :math:`n` is one such realization. The symbol :math:`n` denotes the number of observations collected in each :math:`(A=i,\ B=j)` cell.

To read the figure, start by fixing one level of :math:`B` and scanning across the levels of :math:`A`, following one line at a time. Then compare the two lines corresponding to different :math:`B` levels and check whether their vertical separation stays roughly constant as :math:`A` changes. The points represent empirical sample means :math:`\bar{Y}_{ij}` from the simulated dataset, while the connecting lines are visual guides that make “parallel versus non-parallel” easier to assess.

The main message is that interaction corresponds to non-parallel response curves: the difference between :math:`B` levels changes as :math:`A` changes. When :math:`n` is small, sample means are noisy and lines may appear slightly non-parallel by chance, so interpretation should be cautious. When :math:`n` is larger, the sample means stabilize, and the visual pattern (parallel or non-parallel) more reliably reflects the underlying interaction structure.

In reporting terms, the figure supports a disciplined writing order: first describe whether interaction is present (and what “depends on what”), and only then discuss main effects. If interaction is visually and statistically important, marginal main-effect summaries can be misleading because they average over different response patterns across the other factor. Practical interpretation should be written in terms of specific combinations (e.g., “under high temperature, supplier 3 improves performance, but under low temperature the improvement is smaller”).

.. raw:: html

   <iframe src="../_static/figures/stat2/m13_interaction_plot.html"
   scrolling="no"
   style="width:95%; height:560px; border:none; overflow:hidden; display:block; margin:0 auto;">
   </iframe>

13.3.4 Randomized complete block designs as a special two-factor case
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

A randomized complete block (RCB) design is a two-factor layout in which :math:`A` is the treatment factor of interest and :math:`B` is a blocking factor, with each treatment appearing exactly once in each block. In many operational settings, the block factor is an unavoidable baseline source of variation (operators, days, machines, batches), and the RCB design removes much of that variation from the error term.

The additive RCB mean model is:

.. math::

   Y_{ij} = \mu + \alpha_i + \beta_j + \varepsilon_{ij}

This model assumes there is no treatment-by-block interaction. That assumption is not a minor technicality: it states that treatment differences are consistent across blocks. If treatment differences change substantially across blocks, then the additive RCB analysis can be misleading, and the “extra” variation appears inside the error term.

Figure 13.2 explains why blocking can improve precision even when the treatment means are unchanged. The data are simulated so that we can repeat many comparable experiments and quantify variability of an estimator under two designs. Here, a “repetition” is one entire experiment, and the figure aggregates many repetitions to form empirical sampling distributions. In this figure, :math:`n` denotes the number of blocks (paired units) used in the blocked design, and the same total sample size is used for the completely randomized comparison.

To read the figure, compare the spread of the two sampling distributions for the same estimand (a difference of treatment means). The histograms are empirical, obtained from repeated simulation, and the vertical markers summarize the typical center and variability. A narrower sampling distribution indicates a smaller standard error, which translates into tighter confidence intervals and higher power for the same nominal significance level.

The statistical message is that blocking is most beneficial when within-block measurements are positively correlated or when blocks capture a large portion of baseline variability. As :math:`n` increases, both designs become more precise, but the blocked design retains an advantage when correlation is present because the within-block differencing cancels shared baseline effects. When the correlation is near zero (or blocks are not meaningful), the gain from blocking can shrink and may not justify the additional design constraints.

In practice, this figure supports an explicit planning statement: “We block on :math:`B` because it is expected to explain substantial variability in :math:`Y` that is not of primary interest.” It also motivates careful block selection: blocks should be formed using variables that plausibly drive response variability but are not themselves manipulated treatments. If blocking is chosen poorly, the design can lose flexibility without delivering a meaningful reduction in error variance.

.. raw:: html

   <iframe src="../_static/figures/stat2/m13_blocking_precision.html"
   scrolling="no"
   style="width:95%; height:560px; border:none; overflow:hidden; display:block; margin:0 auto;">
   </iframe>

13.3.5 Two-way ANOVA testing logic and what to report
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

With replication in each cell, two-way ANOVA typically tests three hypotheses: interaction, main effect of :math:`A`, and main effect of :math:`B`. The reporting logic is to assess interaction first because strong interaction changes the interpretation of main effects. In a fixed-effects two-way ANOVA with replication, the test statistics are ratios of mean squares:

.. math::

   F_{AB}=\frac{MS_{AB}}{MS_E},\quad F_A=\frac{MS_A}{MS_E},\quad F_B=\frac{MS_B}{MS_E}

The precise degrees of freedom depend on :math:`a`, :math:`b`, and the replication structure, but the key point is that :math:`MS_E` estimates :math:`\sigma^2` under the model conditions. When there is no replication (one observation per cell), an interaction test is not available, and using an RCB analysis requires the additivity (no-interaction) assumption.

In writing results, it is usually not sufficient to state “significant” or “not significant.” A complete interpretation paragraph should include (i) the direction and size of relevant differences, (ii) the operational meaning of those differences, and (iii) whether conclusions are robust under diagnostic checks. If interaction is present, the direction and size of differences must be described at specific levels of the other factor.

13.3.6 Residual diagnostics for two-way and block models
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Model checking in ANOVA relies on residuals, which summarize what the model does not explain. For a full two-way model with replication, a common within-cell standardized residual uses the fitted cell mean and accounts for the fact that :math:`\bar{Y}_{ij}` is estimated from :math:`n_{ij}` observations:

.. math::

   r_{ijk}=\frac{Y_{ijk}-\bar{Y}_{ij}}{s\sqrt{1-\frac{1}{n_{ij}}}}

Here, :math:`s^2` is the pooled error variance estimate from the ANOVA fit, and the factor :math:`1-\frac{1}{n_{ij}}` reflects the variance reduction from using a sample mean as the fitted value. Standardized residuals should not show systematic structure if the mean model is appropriate and if variance is roughly constant.

Figure 13.3 uses simulated data to show what diagnostics look like when the fitted model matches the data-generating mechanism versus when a key structure is omitted. The data are simulated so that “truth” is known and patterns can be attributed to model mismatch rather than unknown field confounders. A “repetition” would mean re-running the same two-way design with new random errors; the figure shows one dataset because residual patterns can be read from a single realized study. In this figure, :math:`n` is the number of observations in each :math:`(i,j)` cell.

To read the diagnostic panels, first examine the residuals-versus-fitted plot and check for structure around the zero line. A random cloud with stable spread is consistent with a reasonable mean model and roughly constant variance, while curvature or changing spread suggests misspecification or heteroscedasticity. Next examine the normal probability (Q–Q) plot, where the empirical ordered residuals are compared to theoretical normal quantiles; strong S-shaped deviations or extreme tail departures suggest non-normality or outliers.

The main message is that omitted interaction can create systematic residual patterns even when the overall mean levels look plausible. When the fitted model is correct, increasing :math:`n` typically tightens the residual cloud (because :math:`s` is estimated more precisely) and reduces ambiguity in diagnosing structure. When the model is incorrect, increasing :math:`n` often makes the diagnostic problem more visible because the systematic pattern is less likely to be masked by noise.

In practice, this figure supports a two-part conclusion: “The ANOVA indicates (interaction / no interaction), and the residual diagnostics do (or do not) show violations serious enough to question the model-based inference.” If diagnostics suggest non-constant variance, a transformation or a variance-stabilizing approach may be appropriate. If diagnostics suggest strong non-normality driven by outliers, operational investigation of special-cause variation is necessary before final decisions.

.. raw:: html

   <iframe src="../_static/figures/stat2/m13_residual_diagnostics.html"
   scrolling="no"
   style="width:95%; height:560px; border:none; overflow:hidden; display:block; margin:0 auto;">
   </iframe>

13.4 Discussion and Common Errors
---------------------------------

A frequent error is interpreting main effects when there is clear interaction. Main effects are averages over the other factor, and averaging can hide reversals or performance trade-offs that matter operationally. When interaction is present, interpretation should focus on cell means and contrasts at specific factor combinations.

A second error is treating blocks as “another treatment factor” in interpretation. The block factor is introduced to reduce noise, not to create a managerial recommendation about blocks, unless the study explicitly targets those differences. Reporting should emphasize that the blocking variable was controlled to improve precision for the treatment comparison.

A third error is ignoring the additivity assumption in an RCB design with no replication. With one observation per :math:`(i,j)` cell, there is no direct way to estimate or test interaction, so the analysis relies on design knowledge and subject-matter judgment. If treatment-by-block interaction is plausible, replication or a different design is required.

A final error is treating residual checks as optional cosmetics. Residual patterns can indicate that the fitted mean structure is missing a key term (often interaction) or that the equal-variance condition fails. When diagnostics reveal issues, the correct response is to revise the model or redesign the study, not to “force” a conclusion from the original ANOVA table.

13.5 Summary
------------

This module extended one-way ANOVA to two-factor thinking in two forms: blocking designs and two-way ANOVA with interaction. Blocking (including repeated measures) is motivated by variance reduction and higher sensitivity for treatment comparisons. Two-way ANOVA introduces the interaction concept and a reporting order that prioritizes interaction assessment before main effects. Finally, residual diagnostics were presented as an essential check of mean-model adequacy and distributional conditions for ANOVA-based inference.

Example 13.1 (Randomized complete block: machines across operators)
-------------------------------------------------------------------

A manufacturing line is considering three assembly machines, and cycle time (seconds) is the response. Management expects operators to differ in baseline speed, so operators are used as blocks, and each operator runs each machine once in a randomized order. The goal is to compare machines after controlling for operator-to-operator variability.

**Question:** Is there evidence at the 5% level that mean cycle time differs among machines after accounting for operator blocks?

The data (cycle time in seconds) are summarized below, with one observation per machine within each operator block:

.. list-table::
   :header-rows: 1
   :widths: 18 16 16 16 16 16

   * - Machine
     - Op 1
     - Op 2
     - Op 3
     - Op 4
     - Op 5
   * - :math:`A_1`
     - 42.93
     - 48.13
     - 47.93
     - 50.68
     - 53.02
   * - :math:`A_2`
     - 46.80
     - 51.03
     - 51.29
     - 54.22
     - 51.43
   * - :math:`A_3`
     - 47.82
     - 50.11
     - 53.84
     - 54.64
     - 57.37

Because there is no replication within each machine-by-operator cell, the analysis uses the additive block model :math:`Y_{ij}=\mu+\alpha_i+\beta_j+\varepsilon_{ij}` and relies on the assumption that machine differences are consistent across operators. The treatment (machine) test compares :math:`MS_A` to :math:`MS_E`, where :math:`df_A=a-1=2` and :math:`df_E=(a-1)(b-1)=8`. For these data, the machine test statistic is :math:`F=9.50` with :math:`p\approx 0.0077`, indicating that the between-machine variation is large relative to the residual variation after blocking on operators.

**Answer:** Yes. At the 5% level, there is evidence that mean cycle time differs among machines after accounting for operator blocks, so management should compare machine means and practical effect sizes rather than treating machines as equivalent.

Example 13.2 (Two-way ANOVA with interaction: supplier by temperature)
----------------------------------------------------------------------

A logistics team monitors package compression strength and suspects that both supplier (three suppliers) and storage temperature (two levels: low vs high) affect performance. Because temperature conditions vary across routes, management needs to know whether supplier differences are stable across temperatures. A balanced study is run with :math:`n=6` packages tested in each supplier-by-temperature cell.

**Question:** Is there evidence that the supplier effect depends on temperature, and how should the result be interpreted operationally?

The estimated cell means (strength units) are:

.. list-table::
   :header-rows: 1
   :widths: 22 20 20

   * - Supplier (:math:`A`)
     - Low temp (:math:`B_1`)
     - High temp (:math:`B_2`)
   * - :math:`A_1`
     - 91.35
     - 106.31
   * - :math:`A_2`
     - 100.16
     - 96.33
   * - :math:`A_3`
     - 100.79
     - 119.30

The appropriate first test is the interaction test :math:`H_0:(\alpha\beta)_{ij}=0` for all :math:`i,j`, using :math:`F_{AB}=MS_{AB}/MS_E`. For this dataset, :math:`F_{AB}\approx 27.03` with a very small p-value, which supports a practically important interaction. The mean pattern also shows non-parallel behavior: supplier :math:`A_2` weakens under high temperature while :math:`A_1` and :math:`A_3` strengthen, so a single “best supplier” statement without specifying temperature would be incomplete.

**Answer:** Yes. There is strong evidence of interaction, so supplier recommendations must be temperature-specific; reporting should compare suppliers separately at low and high temperature rather than relying on marginal main effects.

Example 13.3 (Repeated measures as blocking: same units under all treatments)
----------------------------------------------------------------------------

A service operation evaluates three interface designs for a picking application, and the response is task completion time. Because workers differ substantially in baseline speed, the study measures the same workers under all three interfaces, with randomized order to reduce learning effects. This design is repeated measures and can be analyzed as a randomized complete block design with workers as blocks.

**Question:** Why can this repeated-measures design be more precise than assigning different workers to different interfaces?

The repeated-measures comparison focuses on within-worker differences, which remove a large portion of worker-to-worker baseline variability. If measurements under two interfaces on the same worker are positively correlated, then the variance of the difference is reduced by the covariance term in :math:`\mathrm{Var}(Y_i-Y_j)=\mathrm{Var}(Y_i)+\mathrm{Var}(Y_j)-2\mathrm{Cov}(Y_i,Y_j)`. In operational terms, each worker acts as their own control, so the interface comparison is less contaminated by differences in experience, dexterity, or speed.

**Answer:** The repeated-measures (blocked) design can be more precise because it subtracts out stable worker-specific effects and often yields smaller standard errors for interface differences than an unblocked design with independent groups.