10. Hypothesis Testing III: Two-Sample Tests for Differences in Means, Proportions, and Variability
================================================================================================

10.0 Notation Table
------------------

.. list-table::
   :header-rows: 1
   :widths: 22 78

   * - Symbol
     - Meaning
   * - :math:`X_{1i},\,X_{2j}`
     - Observations from population 1 and 2
   * - :math:`n_1,\,n_2`
     - Sample sizes (group 1, group 2)
   * - :math:`\mu_1,\,\mu_2`
     - Population means
   * - :math:`\sigma_1^2,\,\sigma_2^2`
     - Population variances
   * - :math:`\bar{X}_1,\,\bar{X}_2`
     - Sample means
   * - :math:`S_1^2,\,S_2^2`
     - Sample variances
   * - :math:`\Delta=\mu_1-\mu_2`
     - Mean difference (parameter of interest)
   * - :math:`\Delta_0`
     - Null value for mean difference
   * - :math:`S_p^2`
     - Pooled variance estimator (equal-variance model)
   * - :math:`t`
     - Student :math:`t` distribution
   * - :math:`\nu`
     - Degrees of freedom
   * - :math:`(D_1,\dots,D_n)`
     - Paired differences (within-pair)
   * - :math:`\bar{D},\,S_D`
     - Mean and SD of paired differences
   * - :math:`p_1,\,p_2`
     - Population proportions
   * - :math:`\hat{p}_1,\,\hat{p}_2`
     - Sample proportions
   * - :math:`\hat{p}`
     - Pooled proportion under :math:`H_0: p_1=p_2`
   * - :math:`z`
     - Standard Normal test statistic
   * - :math:`F`
     - :math:`F` distribution
   * - :math:`\alpha`
     - Significance level (Type I error bound)
   * - :math:`\beta`
     - Type II error at a specified alternative
   * - :math:`1-\beta`
     - Power at a specified alternative

10.1 Introduction
----------------

In the previous hypothesis testing modules, the central workflow was emphasized: state :math:`H_0` and :math:`H_a`, choose a test statistic, compute a P-value under :math:`H_0`, and compare it to :math:`\alpha`. That workflow remains unchanged in two-sample problems, but the modeling choices become more important because the data can be independent samples, paired measurements, or binary outcomes.

This module extends one-sample testing to the two-sample setting. The main challenge is selecting a procedure that matches the data collection design and the inferential target, such as a mean difference, a proportion difference, or a variance ratio.

10.2 Learning Outcomes
----------------------

After completing this session, students should be able to:

- Identify whether a two-sample problem is *independent-samples* or *paired* based on the sampling design.
- Test :math:`H_0:\mu_1-\mu_2=\Delta_0` using either a pooled two-sample :math:`t` test (equal variances) or Welch’s :math:`t` test (unequal variances).
- Conduct a paired :math:`t` test by transforming paired data into one-sample differences.
- Test :math:`H_0:p_1-p_2=0` using a two-sample proportion :math:`z` test under large-sample conditions.
- Test :math:`H_0:\sigma_1^2/\sigma_2^2=1` using an :math:`F` test, and state its sensitivity to non-Normal data.
- Justify a procedure choice using assumptions, sample size, and the operational meaning of Type I error and power.

10.3 Main Concepts
------------------

10.3.1 Two-Sample Mean Testing (Independent Samples)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

In independent-samples problems, the two groups represent different experimental units. Typical examples include two suppliers, two production lines, or two shift teams, where no unit is measured twice. The target parameter is the mean difference :math:`\Delta=\mu_1-\mu_2`, and a common null hypothesis is :math:`H_0:\Delta=\Delta_0` (often :math:`\Delta_0=0`).

Two practical versions are used when population variances are unknown. The pooled two-sample :math:`t` test assumes equal population variances, while Welch’s test does not. Both procedures use the same numerator :math:`\bar{X}_1-\bar{X}_2-\Delta_0`, but they use different standard errors and degrees of freedom.

**Pooled two-sample** :math:`t` **test (equal variances)**

Assume independent samples from approximately Normal populations, and :math:`\sigma_1^2=\sigma_2^2=\sigma^2`. The pooled variance estimator is

.. math::

   S_p^2=\frac{(n_1-1)S_1^2+(n_2-1)S_2^2}{n_1+n_2-2}

The test statistic for :math:`H_0:\Delta=\Delta_0` is

.. math::

   T_0=\frac{(\bar{X}_1-\bar{X}_2)-\Delta_0}{S_p\sqrt{\frac{1}{n_1}+\frac{1}{n_2}}}

Under :math:`H_0`, :math:`T_0` follows a :math:`t` distribution with :math:`\nu=n_1+n_2-2`. This distribution is wider than the standard Normal for small :math:`\nu`, which increases the critical value needed to control Type I error.

**Welch** :math:`t` **test (unequal variances)**

Assume independent samples, with approximate Normality, but do not impose :math:`\sigma_1^2=\sigma_2^2`. The Welch test statistic is

.. math::

   T_0^*=\frac{(\bar{X}_1-\bar{X}_2)-\Delta_0}{\sqrt{\frac{S_1^2}{n_1}+\frac{S_2^2}{n_2}}}

An approximate degrees of freedom is

.. math::

   \nu \approx \frac{\left(\frac{S_1^2}{n_1}+\frac{S_2^2}{n_2}\right)^2}{\frac{\left(\frac{S_1^2}{n_1}\right)^2}{n_1-1}+\frac{\left(\frac{S_2^2}{n_2}\right)^2}{n_2-1}}

This approximation adapts to both sample sizes and sample variances. When one group has much higher variability or a smaller sample size, :math:`\nu` can be much smaller than :math:`n_1+n_2-2`, leading to more conservative critical values for the same :math:`\alpha`.

**Figure 10.1 narrative (m10_two_sample_t)**

This figure uses simulated data to visualize why a :math:`t` reference distribution is used when variability is estimated from the sample. The data are simulated because the goal is to repeat the same sampling-and-testing process many times under a known truth, which is required to study long-run behavior. In this figure, a “repetition” means generating two new independent samples, computing the pooled two-sample :math:`t` statistic, and storing the resulting value; :math:`n` denotes the per-group sample size in each repetition.

To read the figure, first select a value of :math:`n` using the dropdown and then focus on the histogram of simulated :math:`t` values. Next compare the histogram to the smooth reference curve, which represents the theoretical :math:`t` density under :math:`H_0`. The histogram is empirical (simulation output), while the smooth curve is theoretical (the model-based reference distribution).

The main message is that the empirical distribution approaches the theoretical :math:`t` curve when the model assumptions hold, and that the distribution becomes more concentrated as :math:`n` increases. When :math:`n` is small, the curve has heavier tails because the degrees of freedom :math:`\nu=n_1+n_2-2` are small, and the estimated standard error is more variable. When :math:`n` is large, the tails shrink and the distribution becomes closer to the standard Normal shape, which explains why :math:`t`-based procedures and :math:`z`-based procedures become similar in large samples.

The purpose of the figure is to connect the formula for :math:`T_0` to the idea of repeated sampling under :math:`H_0`. The point to take from the plot is obtained by matching the histogram shape to the reference curve and by observing the change in spread across :math:`n`. This reading supports correct interpretation of a P-value as a tail probability under the reference distribution for the chosen test statistic.

.. raw:: html

   <iframe src="../_static/figures/stat2/m10_two_sample_t.html"
   scrolling="no"
   style="width:95%; height:560px; border:none; overflow:hidden; display:block; margin:0 auto;">
   </iframe>

Example 10.1 (Independent samples, Welch two-sample :math:`t` test)
"""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""

A factory compares average cycle time (minutes) between Line A and Line B for the same product family. The lines are run by different operators, so the samples are independent, and the observed variability differs across lines. Management wants evidence that Line A is faster on average.

**Question:** Using :math:`\alpha=0.05`, test :math:`H_0:\mu_A-\mu_B=0` versus :math:`H_a:\mu_A-\mu_B<0` given :math:`n_1=12`, :math:`\bar{x}_1=52.1`, :math:`s_1=4.8` for Line A and :math:`n_2=10`, :math:`\bar{x}_2=55.4`, :math:`s_2=6.1` for Line B.

Because the standard deviations are noticeably different and the sample sizes are moderate, Welch’s test is appropriate. The test statistic is

.. math::

   t^*=\frac{52.1-55.4}{\sqrt{\frac{4.8^2}{12}+\frac{6.1^2}{10}}}\approx -1.39

The degrees of freedom are computed by the Welch approximation, giving :math:`\nu\approx 17`. The one-sided P-value is :math:`P(T_{\nu}\le -1.39)\approx 0.092`, which is larger than :math:`0.05`.

**Answer:** Do not reject :math:`H_0` at :math:`\alpha=0.05`. The sample suggests Line A may be faster, but the evidence is not strong enough to conclude a mean reduction in cycle time under the chosen Type I error bound.

10.3.2 Paired Mean Testing (Paired :math:`t` Test)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

A paired design occurs when the same unit is measured twice (before/after), or when units are matched into natural pairs (same batch, same machine, same operator). The key feature is that within-pair measurements tend to move together because they share a common baseline. In that setting, analyzing differences within each pair removes part of the baseline variation and can increase power.

Let :math:`D_i=X_{1i}-X_{2i}` be the difference for pair :math:`i`, for :math:`i=1,\dots,n`. Testing :math:`H_0:\mu_D=\Delta_0` is then a one-sample mean test on :math:`D_1,\dots,D_n`, with test statistic

.. math::

   T_0=\frac{\bar{D}-\Delta_0}{S_D/\sqrt{n}}

Under approximate Normality of the differences, :math:`T_0\sim t_{\nu}` with :math:`\nu=n-1` under :math:`H_0`. The interpretation is that all inference is about the mean of the within-pair differences, not about two unrelated population means.

**Figure 10.2 narrative (m10_paired_vs_independent)**

This figure is based on simulated data because the learning goal is to compare the long-run variability of two estimators under controlled conditions. A “repetition” means generating a fresh dataset, computing a mean difference for an independent-samples design and a mean difference for a paired design, and recording the resulting estimate. In this figure, :math:`n` denotes the number of units per group for the independent design and also the number of pairs for the paired design.

To read the figure, select the within-pair similarity setting using the dropdown and then compare the spreads of the two density curves. The “Independent samples” curve represents the sampling distribution of :math:`\bar{X}_1-\bar{X}_2` when the groups are unrelated, while the “Paired samples” curve represents the sampling distribution of :math:`\bar{D}` when two measurements are taken on the same units. Both curves are empirical because they are constructed from many repetitions, and their smoothness is due to plotting a density estimate of simulated outcomes.

The main message is that pairing can substantially reduce the spread of the sampling distribution when the two measurements in each pair are similar. Under stronger within-pair similarity, the paired curve becomes narrower, which corresponds to a smaller standard error :math:`S_D/\sqrt{n}` and typically higher power for detecting a nonzero mean difference. Under weak similarity, the paired and independent curves are closer, which indicates less benefit from pairing.

The purpose of the figure is to operationalize the phrase “pairing removes baseline unit-to-unit variation.” The point is obtained by comparing how concentrated the paired sampling distribution is relative to the independent one under the same :math:`n`. This supports a correct procedure choice: if the study design truly creates pairs, then a paired analysis targets the correct parameter and can improve sensitivity.

.. raw:: html

   <iframe src="../_static/figures/stat2/m10_paired_vs_independent.html"
   scrolling="no"
   style="width:95%; height:560px; border:none; overflow:hidden; display:block; margin:0 auto;">
   </iframe>

Example 10.2 (Paired :math:`t` test for a process improvement)
"""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""

A maintenance team evaluates a new calibration routine on the same eight machines. For each machine, vibration level is measured before and after calibration, and the analysis focuses on the within-machine change. The pairing is essential because machines differ substantially in their baseline vibration.

**Question:** Let :math:`D=\text{after}-\text{before}` (negative means improvement). Using :math:`\alpha=0.05`, test :math:`H_0:\mu_D=0` versus :math:`H_a:\mu_D<0` for differences (in mm/s):
:math:`-2.1,\,-1.5,\,-0.8,\,-3.0,\,-1.2,\,-2.4,\,-0.5,\,-1.9`.

The sample mean and standard deviation of differences are :math:`\bar{d}\approx -1.675` and :math:`s_D\approx 0.838`. The test statistic is

.. math::

   t=\frac{-1.675}{0.838/\sqrt{8}}\approx -5.65

Under :math:`H_0`, the reference distribution is :math:`t_{\nu}` with :math:`\nu=7`. The one-sided P-value is far below :math:`0.05`, so the observed result is very unlikely if the mean change were truly zero.

**Answer:** Reject :math:`H_0`. There is strong evidence that the calibration reduces vibration on average, and the conclusion is specifically about the mean within-machine change.

10.3.3 Two-Sample Proportion Testing (Binary Outcomes)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

For binary outcomes, each observation is a success/failure indicator such as “defective / not defective” or “on-time / late.” The inferential target is the difference in population proportions :math:`p_1-p_2`. A standard null hypothesis is :math:`H_0:p_1-p_2=0` (equivalently :math:`p_1=p_2`).

Under :math:`H_0:p_1=p_2`, the pooled proportion estimator is

.. math::

   \hat{p}=\frac{x_1+x_2}{n_1+n_2}

where :math:`x_1` and :math:`x_2` are the observed counts of successes in the two samples. The large-sample :math:`z` statistic is

.. math::

   Z_0=\frac{\hat{p}_1-\hat{p}_2}{\sqrt{\hat{p}(1-\hat{p})\left(\frac{1}{n_1}+\frac{1}{n_2}\right)}}

This approximation is justified when each sample has enough expected successes and failures. A common operational rule is that :math:`n_i\hat{p}_i` and :math:`n_i(1-\hat{p}_i)` should not be too small in both groups, so the Normal approximation to the sampling distribution is reasonable.

**Figure 10.3 narrative (m10_two_prop_z)**

This figure uses simulated binomial data because the goal is to verify when the :math:`z` reference distribution is a good approximation for a two-proportion test. A “repetition” means generating new counts of successes for both groups under the same underlying proportion, computing the pooled :math:`z` statistic, and storing the value. In this figure, :math:`n` denotes the per-group sample size used in each repetition.

To read the figure, select :math:`n` and then inspect the histogram of simulated :math:`z` values. Compare the histogram to the smooth standard Normal reference curve, which represents the theoretical approximation used to compute P-values in practice. The histogram is empirical (from repeated sampling), while the curve is theoretical (model-based reference).

The main message is that the approximation improves as :math:`n` increases because the sampling distribution of :math:`\hat{p}_1-\hat{p}_2` becomes more Normal and the pooled standardization becomes more stable. When :math:`n` is small, discreteness and skewness can cause visible deviations from the Normal curve, which can distort the true Type I error rate. When :math:`n` is large, the histogram aligns closely with the reference curve, supporting the usual :math:`z`-based P-value calculation.

The purpose of the figure is to justify the large-sample conditions as a *distributional* requirement rather than a memorized rule. The point is obtained by observing whether the empirical histogram matches the reference curve well enough for the chosen :math:`n`. This supports correct decision-making about whether a two-proportion :math:`z` test is appropriate for a given dataset.

.. raw:: html

   <iframe src="../_static/figures/stat2/m10_two_prop_z.html"
   scrolling="no"
   style="width:95%; height:560px; border:none; overflow:hidden; display:block; margin:0 auto;">
   </iframe>

Example 10.3 (Two-sample proportion :math:`z` test for defect rates)
"""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""

A quality engineer compares defect rates from Supplier 1 and Supplier 2 for the same component specification. From Supplier 1, :math:`18` defects are observed in :math:`120` inspected items, and from Supplier 2, :math:`12` defects are observed in :math:`150` items. The operational question is whether Supplier 1 is worse.

**Question:** Using :math:`\alpha=0.05`, test :math:`H_0:p_1-p_2=0` versus :math:`H_a:p_1-p_2>0`.

The sample proportions are :math:`\hat{p}_1=18/120=0.15` and :math:`\hat{p}_2=12/150=0.08`. Under :math:`H_0`, the pooled proportion is :math:`\hat{p}=(18+12)/(120+150)=0.1111`, so the standard error is

.. math::

   \sqrt{\hat{p}(1-\hat{p})\left(\frac{1}{120}+\frac{1}{150}\right)}\approx 0.0385

The test statistic is :math:`z\approx (0.15-0.08)/0.0385\approx 1.82`. The one-sided P-value :math:`P(Z\ge 1.82)\approx 0.034`, which is below :math:`0.05`.

**Answer:** Reject :math:`H_0`. There is statistically significant evidence that Supplier 1 has a higher defect proportion than Supplier 2 at the 5% level.

10.3.4 Testing the Ratio of Variances (Variability Comparison)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Sometimes the operational target is variability rather than the mean, such as consistency of fill weights, stability of throughput, or dispersion of lead times. A classical model-based approach compares two Normal population variances using the ratio :math:`\sigma_1^2/\sigma_2^2`. The null hypothesis is often :math:`H_0:\sigma_1^2/\sigma_2^2=1`, with two-sided or one-sided alternatives depending on whether “more variable” is directional.

If both populations are Normal and samples are independent, the test statistic

.. math::

   F_0=\frac{S_1^2}{S_2^2}

has an :math:`F` distribution under :math:`H_0` with degrees of freedom :math:`(n_1-1,\ n_2-1)`. The :math:`F` distribution is supported on :math:`F\ge 0` and is typically right-skewed, so two-sided testing uses both a lower-tail and an upper-tail critical region.

For a two-sided test at level :math:`\alpha`, reject :math:`H_0` if

- :math:`F_0 \le f_{\alpha/2}(n_1-1,\ n_2-1)`  (lower tail), or
- :math:`F_0 \ge f_{1-\alpha/2}(n_1-1,\ n_2-1)` (upper tail)

where :math:`f_q(u,v)` denotes the :math:`q`-quantile of an :math:`F(u,v)` distribution. Because the distribution is asymmetric, practical computation often uses the identity that the lower critical value can be written as :math:`1/f_{1-\alpha/2}(n_2-1,\ n_1-1)`.

**Figure 10.4 narrative (m10_f_distribution_variance_ratio)**

This figure uses theoretical :math:`F` densities rather than simulated histograms because the main learning goal is geometric: locating critical regions on an asymmetric reference distribution. A “repetition” is not required here, since the curve itself represents the reference distribution used to compute tail probabilities under :math:`H_0`. In this figure, :math:`n` is implicit through the degrees of freedom :math:`(n_1-1,\ n_2-1)`, which are selected by the dropdown.

To read the figure, first select the degrees of freedom and then identify the two vertical lines marking the lower and upper critical values for a two-sided test at a fixed :math:`\alpha`. Next interpret the shaded tails as the rejection regions that together have total area :math:`\alpha`. The density curve is theoretical, and the shaded regions are also theoretical because they represent reference tail probabilities under :math:`H_0`.

The main message is that asymmetry matters: the upper critical value can be far from 1 even when the lower critical value is close to 0. As the degrees of freedom increase, the distribution becomes less skewed and concentrates more around 1, which makes variance-ratio evidence easier to interpret. With small degrees of freedom, the distribution is highly skewed, and modest changes in the variance ratio can occur by chance, which reduces power unless the variance difference is large.

The purpose of the figure is to connect a computed statistic :math:`F_0=S_1^2/S_2^2` to the correct two-sided rejection rule. The point is obtained by checking whether an observed :math:`F_0` would fall in either shaded tail for the chosen degrees of freedom. This also reinforces an important limitation: if Normality is doubtful, the :math:`F` test can be unreliable, so the procedure choice must be justified by the data-generating process.

.. raw:: html

   <iframe src="../_static/figures/stat2/m10_f_distribution_variance_ratio.html"
   scrolling="no"
   style="width:95%; height:560px; border:none; overflow:hidden; display:block; margin:0 auto;">
   </iframe>

Example 10.4 (:math:`F` test for comparing process variability)
""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""

Two packaging machines fill the same product, and variability in fill weight affects downstream rework. A small audit collects independent measurements from both machines under stable operating conditions. The engineer wants to know whether the variability differs, under an approximate Normal model for fill weights.

**Question:** Using :math:`\alpha=0.05`, test :math:`H_0:\sigma_1^2/\sigma_2^2=1` versus :math:`H_a:\sigma_1^2/\sigma_2^2\ne 1` given :math:`n_1=15`, :math:`s_1=2.6` and :math:`n_2=12`, :math:`s_2=1.8`.

The test statistic is :math:`F_0=s_1^2/s_2^2=(2.6^2)/(1.8^2)\approx 2.09`, with degrees of freedom :math:`(14,11)` under :math:`H_0`. For a two-sided test, :math:`F_0` must fall in either a very small lower tail or a large upper tail, because the :math:`F` distribution is not symmetric around 1. The observed value :math:`2.09` is not extreme relative to typical two-sided critical values for these degrees of freedom.

**Answer:** Do not reject :math:`H_0` at :math:`\alpha=0.05`. The data do not provide statistically significant evidence of different variances, although the sample variance ratio is above 1.

10.3.5 Choosing Procedures (A Decision Framework)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Procedure choice should begin with the data structure, not with a formula. The same numeric outcome can lead to a paired test or an independent two-sample test depending on whether the same units are measured twice. The inferential target must match the design, and the Type I error bound :math:`\alpha` must be interpreted as a long-run error rate for the chosen procedure.

Use the following selection rules:

- If the outcome is quantitative and the two groups are different units, use an independent-samples mean test for :math:`\mu_1-\mu_2`. Prefer Welch’s test unless a strong equal-variance justification exists.
- If the outcome is quantitative and each unit provides two measurements (or matched pairs exist), analyze differences :math:`D_i` and use a paired :math:`t` test for :math:`\mu_D`.
- If the outcome is binary, test :math:`p_1-p_2` with a two-sample proportion :math:`z` test only when large-sample conditions support a Normal approximation.
- If the outcome target is variability, use an :math:`F` test only under plausible Normal sampling; otherwise, treat the result as model-sensitive and consider more robust variability assessments in later study.

Power considerations clarify why design and :math:`n` matter. For fixed :math:`\alpha`, larger :math:`n` generally reduces standard errors and increases the probability of rejecting :math:`H_0` when the alternative is true. Pairing can also increase power by reducing the variance of the within-pair difference, but only when the pairing reflects real shared baseline factors.

10.4 Discussion and Common Errors
---------------------------------

A frequent error is confusing *two groups* with *two samples*. Two groups can still be paired if each unit appears in both conditions, and treating paired data as independent typically inflates the standard error and reduces power. A related error is interpreting a paired result as if it were comparing two unrelated population means rather than a mean of differences.

Equal-variance pooling is another common failure point. Using the pooled two-sample :math:`t` test without justification can distort the Type I error rate when group variances differ and sample sizes are unbalanced. Welch’s test is designed to be stable under unequal variances and is often the default choice in applied work.

For two-proportion tests, the Normal approximation can be misleading when counts of successes or failures are small. In that case, the P-value from a :math:`z` test may not match the intended long-run Type I error bound, so the conclusion should be treated cautiously.

For :math:`F` tests, non-Normality is a major concern because the sampling distribution of :math:`S^2` changes under heavy tails or skewness. Even when the mean is well-behaved, the variance ratio test can become overly sensitive or overly conservative, so the modeling assumption should be stated explicitly.

10.5 Summary
------------

This module introduced two-sample hypothesis tests for mean differences, paired mean differences, proportion differences, and variance ratios. The central workflow remains hypothesis specification, test statistic construction, and P-value interpretation under :math:`H_0`. The main new skill is selecting a procedure that matches the sampling design and the target parameter, while stating assumptions and understanding how :math:`n` and design choices affect power.