5. Estimation II: Confidence Intervals for Mean Differences \(\Delta=\mu_1-\mu_2\) and \(\mu_D\) (Pooled-\(t\), Welch-\(t\), and Paired-\(t\))

5.0 Notation Table

Symbol	Meaning
\(\mu_1,\ \mu_2\)	Population means (group 1, group 2)
\(\sigma_1,\ \sigma_2\)	Population standard deviations
\(n_1,\ n_2\)	Sample sizes (independent samples)
\(\bar{X}_1,\ \bar{X}_2\)	Sample means
\(S_1^2,\ S_2^2\)	Sample variances
\(\Delta=\mu_1-\mu_2\)	Difference of population means
\(\widehat{\Delta}=\bar{X}_1-\bar{X}_2\)	Estimator of \(\Delta\)
\(S_p^2\)	Pooled variance estimator (equal-variance case)
\(\nu\)	Degrees of freedom (df)
\(t_{\alpha/2,\nu}\)	Upper \(\alpha/2\) critical value of \(t(\nu)\)
\(D_i\)	Paired difference for pair \(i\)
\(\mu_D\)	Mean of paired differences
\(\bar{D},\ S_D\)	Mean and standard deviation of \(D_i\)
\(\rho\)	Correlation within pairs (paired design)

5.1 Introduction

In the previous module, the inference target was a single population mean \(\mu\), and the main method was a confidence interval based on a standard error and a critical value. In practice, operational decisions often compare two conditions, such as two suppliers, two machines, or two process settings.

This module focuses on confidence intervals for a difference in means, with target \(\Delta=\mu_1-\mu_2\). Two experimental structures are emphasized: independent samples (two separate groups) and paired observations (two measurements on the same experimental unit).

5.2 Learning Outcomes

After this session, you should be able to:

State the parameter of interest \(\Delta=\mu_1-\mu_2\) for independent samples and \(\mu_D\) for paired data
Construct a confidence interval for \(\mu_1-\mu_2\) when variances are unknown, under equal-variance and unequal-variance assumptions
Construct a confidence interval for \(\mu_D\) using paired differences
Explain how design choice (independent vs paired) changes the standard error and the interpretation
Identify common design and interpretation errors when reporting a two-mean confidence interval

5.3 Main Concepts

5.3.0 The common structure of a confidence interval

A confidence interval for a mean difference follows the same template used for one-mean estimation: estimate \(\pm\) (critical value) \(\times\) (standard error). The meaning is also parallel: under repeated sampling from the same data-generating process, a fixed percentage of such intervals will contain the true parameter.

For mean differences, the standard error is the main object that changes across methods and designs. It depends on whether samples are independent or paired, and on whether the two variances are treated as equal or not.

5.3.1 Independent samples: estimating \(\Delta=\mu_1-\mu_2\)

Assume two independent random samples are drawn from two populations, and the goal is to estimate \(\Delta=\mu_1-\mu_2\). The natural point estimator is the difference of sample means, \(\widehat{\Delta}=\bar{X}_1-\bar{X}_2\), and its variability comes from sampling variation in both groups.

When population standard deviations are treated as known (or when large-sample approximations are explicitly justified), the standard error is

\[\mathrm{SE}(\bar{X}_1-\bar{X}_2)=\sqrt{\frac{\sigma_1^2}{n_1}+\frac{\sigma_2^2}{n_2}}\]

This expression shows how \(n_1\) and \(n_2\) reduce uncertainty through averaging. It also shows why the two groups both matter: even if one group is measured precisely, the other group can dominate the standard error if it is small or highly variable.

Figure 5.1 visualizes how the sampling distribution of \(\bar{X}_1-\bar{X}_2\) changes with sample size. It uses simulated data because repeated sampling from real operations is usually costly, and simulation allows us to isolate the role of \(n\) while holding the population features fixed. In this figure, “repetition” means repeating the full two-sample experiment many times under the same population means and standard deviations, and \(n\) means the per-group sample size with \(n_1=n_2=n\).

To read the figure, first use the dropdown to select a sample size and then look at the histogram as an empirical distribution of \(\bar{X}_1-\bar{X}_2\). Next compare the histogram to the smooth reference curve, which represents the theoretical Normal approximation for the same mean difference and standard error. The histogram is the empirical object produced by simulation, while the curve is the theoretical reference for comparison.

The main message is that the distribution becomes more concentrated as \(n\) increases, which corresponds to a decreasing standard error. A narrower sampling distribution implies that confidence intervals based on that standard error will also become shorter, holding the confidence level fixed. The operational implication is that increasing sample size improves precision, even when the estimated difference in means stays similar.

5.3.2 Unknown but equal variances: pooled-\(t\) interval for \(\Delta\)

In many applications, \(\sigma_1\) and \(\sigma_2\) are unknown, and uncertainty must be estimated from the data. If it is reasonable to assume approximately Normal sampling and equal population variances, \(\sigma_1^2=\sigma_2^2=\sigma^2\), then a pooled estimate of the common variance is used.

The pooled variance estimator is a degrees-of-freedom weighted average of the two sample variances:

\[S_p^2=\frac{(n_1-1)S_1^2+(n_2-1)S_2^2}{n_1+n_2-2}\]

Using \(S_p\), the standard error becomes \(S_p\sqrt{1/n_1+1/n_2}\), and the confidence interval uses a \(t\) critical value with \(\nu=n_1+n_2-2\) degrees of freedom. Under approximately Normal sampling and independent samples, a \(100(1-\alpha)\%\) confidence interval for \(\Delta=\mu_1-\mu_2\) is

\[(\bar{x}_1-\bar{x}_2)\ \pm\ t_{\alpha/2,\nu}\,S_p\sqrt{\frac{1}{n_1}+\frac{1}{n_2}}\]

This method is efficient when the equal-variance assumption is appropriate. It can be misleading when the variances are substantially different, especially when the sample sizes are also quite different.

Example 5.1 (Independent samples, equal-variance pooled \(t\))

A manufacturing team compares mean fill weight under two calibrated machine settings to decide whether the settings produce meaningfully different average output. Two independent samples are collected under stable production, and the measurement unit is grams. Historical engineering knowledge suggests variability is similar under the two settings because the same feeder and scale are used.

Question: Construct a 95% confidence interval for \(\mu_1-\mu_2\) using a pooled-\(t\) method, given \(n_1=10,\ \bar{x}_1=101.2,\ s_1=4.1\) and \(n_2=12,\ \bar{x}_2=96.8,\ s_2=3.8\).

The point estimate is \(\bar{x}_1-\bar{x}_2=4.4\). The pooled variance is \(S_p^2=((n_1-1)s_1^2+(n_2-1)s_2^2)/(n_1+n_2-2)\), so \(S_p\approx 3.94\), and the standard error is \(S_p\sqrt{1/n_1+1/n_2}\approx 1.69\). With \(\nu=n_1+n_2-2=20\), the 95% critical value is \(t_{0.025,20}\approx 2.09\), which sets the margin of error.

Answer: The 95% confidence interval is approximately \((0.88,\ 7.92)\) grams. This interval suggests the mean fill weight under setting 1 is higher than under setting 2 by about 1 to 8 grams, under the stated sampling and equal-variance conditions.

5.3.3 Unknown and unequal variances: Welch (Satterthwaite) \(t\) interval

When the equal-variance assumption is not credible, a widely used alternative replaces the pooled standard error with an unpooled standard error. The interval is still centered at \(\bar{X}_1-\bar{X}_2\), but the standard error uses \(S_1^2\) and \(S_2^2\) separately:

\[\mathrm{SE}(\bar{X}_1-\bar{X}_2)=\sqrt{\frac{S_1^2}{n_1}+\frac{S_2^2}{n_2}}\]

The corresponding degrees of freedom \(\nu\) is approximated by the Satterthwaite formula, which may not be an integer. A common practice is to round \(\nu\) down to the nearest whole number to remain conservative:

\[\nu \approx \frac{\left(\frac{S_1^2}{n_1}+\frac{S_2^2}{n_2}\right)^2}{ \frac{\left(\frac{S_1^2}{n_1}\right)^2}{n_1-1}+\frac{\left(\frac{S_2^2}{n_2}\right)^2}{n_2-1} }\]

Under independent samples and approximately Normal sampling (or a careful large-sample justification), the approximate \(100(1-\alpha)\%\) interval is

\[(\bar{x}_1-\bar{x}_2)\ \pm\ t_{\alpha/2,\nu}\sqrt{\frac{s_1^2}{n_1}+\frac{s_2^2}{n_2}}\]

Figure 5.2 uses simulation to evaluate the coverage of two confidence-interval procedures when the two populations have unequal variances. Coverage is the long-run proportion of intervals that contain the true parameter \(\Delta=\mu_1-\mu_2\) when the same study is repeated under identical conditions. The data are simulated because the true value of \(\Delta\) must be known to count “contains” versus “does not contain,” and simulation allows that truth to be fixed while thousands of repeated studies are generated.

In this figure, “repetition” means the following loop: generate two independent samples of sizes \(n_1\) and \(n_2\), compute a 95% confidence interval for \(\Delta\) using a specified method, and record whether the interval contains the true difference. This loop is repeated many times under the same population model, and the empirical coverage is the fraction of repetitions that succeed. For the version shown here, \(n_1=n_2=10\) is held fixed to isolate the effect of variance inequality, and the variance ratio \(\sigma_2^2/\sigma_1^2\) changes across dropdown options while the true mean difference is kept constant.

To read the figure, first use the dropdown to select a variance ratio and then compare the two method bars in that setting. Each bar is an empirical estimate of the probability that the method’s 95% confidence interval contains \(\Delta\), based on the simulated repetitions. The horizontal reference line at 0.95 represents the nominal target; bars near this line indicate that the procedure is delivering the advertised confidence level in this scenario. Small deviations from 0.95 can be difficult to see because coverage is a probability near 1 and simulation introduces random error, so the primary visual task is to compare whether one method is systematically closer to the 0.95 line across variance ratios.

The statistical message depends on the assumptions behind each method and on the design choice embedded in \(n_1\) and \(n_2\). The pooled-\(t\) interval assumes equal population variances, so its coverage can fall below 0.95 when that assumption is violated, meaning it can be overconfident in repeated use. The Welch interval adjusts the standard error and degrees of freedom for unequal variances and is intended to maintain coverage closer to 0.95 across a broader range of variance ratios. A critical nuance is that variance inequality alone may produce only modest visible changes when the design is balanced (\(n_1=n_2\)), whereas larger coverage distortions for the pooled method are most likely when variance inequality is combined with an unbalanced design (especially when the smaller sample comes from the higher-variance population). For decision-making in operations and quality settings, this figure motivates a practical rule: when equal variances cannot be justified from process knowledge or diagnostics, Welch’s interval is generally the safer default for reporting uncertainty in \(\mu_1-\mu_2\).

Example 5.2 (Independent samples, Welch \(t\))

A service operation compares mean order fulfillment time between two warehouse layouts to decide whether a redesign reduces average time. Two independent samples are collected on different days and with different worker mixes, so variability is expected to differ across layouts. The response is minutes per order, and the management target is a credible interval estimate of the mean difference.

Question: Construct a 95% confidence interval for \(\mu_1-\mu_2\) using Welch’s method, given \(n_1=18,\ \bar{x}_1=42.6,\ s_1=6.8\) and \(n_2=12,\ \bar{x}_2=38.9,\ s_2=3.1\).

The point estimate is \(\bar{x}_1-\bar{x}_2=3.7\). The standard error is \(\sqrt{s_1^2/n_1+s_2^2/n_2}\approx 1.84\), and the Satterthwaite approximation gives \(\nu\approx 25.4\), which is commonly rounded down to \(\nu=25\). With \(t_{0.025,25}\approx 2.06\), the margin of error is about \(2.06\times 1.84\).

Answer: The 95% confidence interval is approximately \((-0.08,\ 7.48)\) minutes. This interval indicates the data are consistent with layout 1 being slightly faster, slightly slower, or up to about 7.5 minutes slower than layout 2 on average, so the evidence for a nonzero mean difference is not strong at the 95% level.

5.3.4 Paired observations: estimating \(\mu_D\)

In a paired design, each experimental unit is observed under both conditions, producing a pair \((X_{1i},X_{2i})\) for unit \(i\). The target is still a mean difference, but the natural parameter is the mean of the within-unit differences \(D_i=X_{1i}-X_{2i}\), denoted \(\mu_D\).

The analysis reduces to a one-sample problem on the differences. Under approximately Normal differences (or a careful large-sample argument), a \(100(1-\alpha)\%\) confidence interval for \(\mu_D\) is

\[\bar{d}\ \pm\ t_{\alpha/2,\nu}\frac{s_D}{\sqrt{n}}\]

where \(\nu=n-1\) and \(n\) is the number of pairs. The paired structure matters because the variability of \(D_i\) can be much smaller than the variability of the original measurements when the two measurements on the same unit are positively correlated.

A paired design is most effective when the same unit can be measured under both conditions and units differ substantially in their baseline level. Many operations settings have stable unit effects, such as a worker’s typical speed, a machine’s typical output level, or a batch’s typical quality level. If these unit effects are present in both conditions, pairing allows the comparison to focus on the change within the unit rather than on differences between units.

One simple representation is \(X_{1i}=\mu_1+U_i+\varepsilon_{1i}\) and \(X_{2i}=\mu_2+U_i+\varepsilon_{2i}\), where \(U_i\) is a unit-specific baseline and \(\varepsilon\) is within-unit noise. The paired difference is \(D_i=X_{1i}-X_{2i}=(\mu_1-\mu_2)+(\varepsilon_{1i}-\varepsilon_{2i})\), so the baseline term \(U_i\) cancels. This cancellation is the practical reason paired intervals can be noticeably shorter than independent-sample intervals when baseline differences across units are large.

A useful identity for design discussion is the variance of a difference:

\[\mathrm{Var}(X_1-X_2)=\mathrm{Var}(X_1)+\mathrm{Var}(X_2)-2\,\mathrm{Cov}(X_1,X_2)\]

This identity implies that positive correlation within pairs tends to reduce the variance of differences. Reduced variance of differences translates into a smaller standard error and therefore a shorter confidence interval, holding the confidence level fixed.

5.3.5 Choosing between independent and paired designs

The main advantage of pairing is precision when the pairing is meaningful. A meaningful pairing matches units so that shared unit-to-unit variation is removed by differencing, such as before–after on the same machine, the same worker under two tools, or matched items from the same batch.

Pairing is not automatically better, and it can be inappropriate if pairs are artificial or if the second measurement is influenced by the first in a way that changes the meaning of the comparison. In addition, paired studies require both measurements on each unit, so missing data can reduce the effective sample size.

Figure 5.3 is included for one purpose: to justify when a paired design is worth using before computing a confidence interval for a mean difference. The figure does not display raw observations, because the main question is a design question about precision rather than a descriptive question about a particular dataset. For this reason, the curve is computed directly from the standard error formulas and \(t\) critical values, which isolates how pairing changes uncertainty.

The figure is based on a simple operations model with a stable unit baseline. Let unit \(i\) be a worker, a machine, or a batch that is measured under two conditions, and write \(X_{1i}=\mu_1+U_i+\varepsilon_{1i}\) and \(X_{2i}=\mu_2+U_i+\varepsilon_{2i}\). The term \(U_i\) represents unit-to-unit baseline differences (some workers are faster, some machines run higher), and \(\varepsilon\) represents within-unit noise (measurement noise and short-run fluctuations). The paired difference is \(D_i=X_{1i}-X_{2i}=(\mu_1-\mu_2)+(\varepsilon_{1i}-\varepsilon_{2i})\), so the baseline term \(U_i\) cancels, which is the mechanism that can make paired confidence intervals shorter.

In this figure, \(n\) means the number of pairs (the number of units measured twice), and the independent comparator uses \(n_1=n_2=n\) to keep the per-condition sample size aligned. The x-axis is the shared-variation ratio \(k=\tau^2/\sigma^2\), where \(\tau^2=\mathrm{Var}(U_i)\) measures how different units are from each other and \(\sigma^2=\mathrm{Var}(\varepsilon)\) measures within-unit noise. The y-axis is the percent reduction in confidence-interval half-width from using pairing, defined as \(100\times(1-\text{ratio})\) where \(\text{ratio}=(\text{paired half-width})/(\text{independent half-width})\). The dashed line at 0% is the break-even point: values above 0% mean the paired interval is shorter, and values below 0% mean the paired interval is longer.

To get the point from the curve, first select \(n\) in the dropdown and then compare small versus large \(k\). When \(k\) is near 0, unit baselines are weak relative to noise, so there is little to cancel and the reduction stays near 0%. When \(k\) is large, unit baselines dominate variability, so pairing removes a major source of variation and the percent reduction increases, meaning the paired interval can be substantially shorter. The practical design implication is: pairing is most valuable when the same unit can be measured twice and baseline heterogeneity across units is large and stable across both conditions; in that setting, pairing targets the mean within-unit change and delivers higher precision for the same sample size.

In practice, the design choice can be guided by whether baseline heterogeneity is expected to dominate variation. If units are highly different from each other but reasonably stable across the two measurements, pairing can remove much of that between-unit variation and improve precision for the mean difference.

Pair when the same unit can be measured twice and unit baselines are a major source of variation (workers, machines, batches, stores)
Avoid pairing when the second measurement is not comparable to the first due to carryover, learning, fatigue, or order effects
When in doubt, a small pilot with repeated measures can indicate whether differences \(D_i\) are substantially less variable than the original measurements

Example 5.3 (Paired design, mean improvement)

A distribution center evaluates a new scanning interface intended to reduce picking time. The same 12 workers complete a standardized picking task using the old interface and then the new interface, so each worker provides a paired comparison. Define \(D_i\) as (old time) minus (new time), so positive \(D_i\) indicates an improvement.

Question: Construct a 95% confidence interval for \(\mu_D\), given \(n=12,\ \bar{d}=3.4\) seconds, and \(s_D=3.6\) seconds.

The standard error is \(s_D/\sqrt{n}\approx 1.04\) seconds, and the degrees of freedom is \(\nu=n-1=11\). Using \(t_{0.025,11}\approx 2.20\), the margin of error is about \(2.20\times 1.04\), which is added and subtracted from \(\bar{d}\).

Answer: The 95% confidence interval is approximately \((1.11,\ 5.69)\) seconds. This interval suggests the new interface reduces mean picking time by about 1 to 6 seconds per task for this worker population, under the paired sampling assumptions.

5.4 Discussion and Common Errors

A frequent reporting error is reversing the order of subtraction and then interpreting the sign incorrectly. The parameter \(\Delta=\mu_1-\mu_2\) must be defined in words before the interval is computed, and the interpretation must match that definition.

A second common issue is using a pooled-\(t\) interval without checking whether equal variances are plausible. When variances differ and sample sizes are unbalanced, pooled intervals can have poor coverage, so the confidence level stated in the report may not be achieved in practice.

A design error occurs when paired data are analyzed as if they were independent samples. This mistake discards the within-unit correlation structure and typically inflates the standard error, which yields an interval that is longer than necessary and answers a different question.

Another design error is forcing pairing when the pairs do not share meaningful common factors. If the paired units are not truly comparable, the differences \(D_i\) can be dominated by noise, and the paired interval may not represent the intended operational contrast.

5.5 Summary

This module extended one-mean confidence intervals to two-mean comparisons by targeting \(\Delta=\mu_1-\mu_2\) for independent samples and \(\mu_D\) for paired designs. The method choice is driven by the standard error model, which depends on whether variances are treated as equal and whether data are paired.

For independent samples with unknown variances, two primary intervals were presented: the pooled-\(t\) interval for approximately equal variances and the Welch interval for unequal variances. For paired observations, the problem reduces to a one-sample \(t\) interval on the differences, and pairing can substantially improve precision when within-pair correlation is positive.