8. Hypothesis Testing I: Decision Framework, Type I/II Errors, P-Values, CI–Test Equivalence, and Power
8.0 Notation Table
Symbol |
Meaning |
|---|---|
\(H_0\) |
Null hypothesis (baseline claim) |
\(H_1\) |
Alternative hypothesis (competing claim) |
\(\theta\) |
Population parameter of interest |
\(\theta_0\) |
Parameter value stated in \(H_0\) |
\(\alpha\) |
Type I error probability (significance level) |
\(\beta(\theta_1)\) |
Type II error probability at \(\theta=\theta_1\) |
\(1-\beta(\theta_1)\) |
Power at \(\theta=\theta_1\) |
\(p\) |
P-value (data-based evidence measure) |
\(n\) |
Sample size |
\(X_1,\dots,X_n\) |
Sample observations |
\(\bar{X}\) |
Sample mean |
\(S\) |
Sample standard deviation |
\(\mu,\ \mu_0\) |
Population mean and its null value |
\(\sigma\) |
Population standard deviation (if treated as known) |
\(Z\) |
Standard normal test statistic |
\(T\) |
Student \(t\) test statistic |
\(\nu\) |
Degrees of freedom (often \(\nu=n-1\)) |
\(z_{q}\) |
Normal quantile with \(P(Z\le z_q)=q\) |
\(t_{q,\nu}\) |
\(t\) quantile with \(P(T\le t_{q,\nu})=q\) |
\([L,U]\) |
Confidence interval endpoints |
\(\delta\) |
Practically meaningful effect size (e.g., \(\delta=|\mu_1-\mu_0|\)) |
\(d\) |
Standardized effect size \(d=\delta/\sigma\) |
8.1 Introduction
In the previous module, statistical inference was framed as estimation, especially confidence intervals that quantify plausible values of an unknown parameter. That approach answers questions such as “What range of means is consistent with the observed data under stated conditions?”
This module introduces hypothesis testing as a complementary decision framework. The goal is to compare two competing claims about a parameter, quantify the risk of wrong decisions, and interpret evidence using P-values while keeping a direct connection to confidence intervals.
8.2 Learning Outcomes
By the end of this module, students should be able to do the following.
Translate an operations, quality, or management question into a parameter, a null hypothesis, and an alternative hypothesis.
Distinguish Type I error from Type II error and explain why \(\alpha\) is controlled while \(\beta\) depends on the true parameter value.
Interpret a P-value correctly as a conditional probability under \(H_0\), and reject common misinterpretations.
Use the confidence-interval link to explain when a two-sided test rejects at level \(\alpha\).
Explain power as a planning tool and carry out basic sample-size calculations using a chosen effect size \(\delta\).
8.3 Main Concepts
8.3.1 Statistical hypotheses and inference targets
A statistical hypothesis is a statement about a population parameter (or equivalently about a probability model indexed by parameters). In this module, the parameter is typically a single quantity such as a mean \(\mu\) or a proportion \(p\), chosen because it connects directly to a performance target or specification.
The null hypothesis \(H_0\) is the baseline claim that is treated as the reference point for measuring surprise in the data. The alternative hypothesis \(H_1\) represents departures from the baseline that matter for the decision context, and it may be two-sided (\(\ne\)) or one-sided (\(<\) or \(>\)).
A two-sided alternative is appropriate when deviations in either direction are relevant to performance or risk. A one-sided alternative is appropriate when only one direction represents a meaningful or actionable departure, such as “mean strength is below the minimum requirement.”
8.3.2 Test statistics and rejection regions
A hypothesis test uses a test statistic, which is a function of the sample designed to be sensitive to departures from \(H_0\). Under stated conditions (for example, independent sampling and approximate normality of a standardized statistic), the test statistic has a known sampling distribution when \(H_0\) is true.
For a one-sample mean test, two common statistics are the following, depending on whether \(\sigma\) is treated as known or unknown. The selection is not a software choice; it is a modeling choice about whether process variation is known in advance.
A rejection region is defined so that the probability of rejecting \(H_0\) when it is true equals a chosen level \(\alpha\). For a two-sided \(Z\) test, the usual rejection rule is
This rule makes the meaning of \(\alpha\) explicit: it controls long-run false rejections under the assumptions of the test. The next figure visualizes how \(\alpha\) determines the critical values and tail areas.
Figure 8.1 (Two-sided rejection regions under a standard normal model) is theoretical rather than based on a specific dataset, because the objective is to display how the significance level defines the decision rule. There is no repeated sampling in the plot itself, but the interpretation of \(\alpha\) is long-run and refers to what would happen across many hypothetical repetitions of the same sampling procedure. For this figure, \(n\) does not appear directly because the statistic is shown on the standardized \(Z\) scale, where the null distribution is standard normal.
To read the figure, first identify the two shaded tail regions, which together represent the probability mass allocated to rejection when \(H_0\) is true. Next, read the vertical cutoffs at \(\pm z_{1-\alpha/2}\), which are the critical values on the standardized scale. The smooth curve is a theoretical density, not an empirical histogram, so the areas correspond to probabilities under the assumed model.
The main message is that smaller \(\alpha\) pushes the critical values farther from zero and makes rejection harder under fixed assumptions. If the dropdown compares multiple \(\alpha\) values, the “small” setting produces wider non-rejection regions and stricter evidence requirements than the “large” setting. This matters because, with all else fixed, stricter thresholds tend to reduce false rejections but also tend to reduce sensitivity to moderate departures.
In practice, the standardized scale is created from \((\bar{X}-\mu_0)/(\sigma/\sqrt{n})\) or \((\bar{X}-\mu_0)/(S/\sqrt{n})\), so sample size influences how easily data move into the tails. The figure should therefore be read as a rule template: the same tail areas apply in every application, but the mapping from data to the standardized axis changes with \(n\) and with the standard error.
Example 8.1 (Reject or fail to reject, and interpret the result)
A packaging line targets a mean fill weight of 500 g, and historical process monitoring supports treating the standard deviation as stable at \(\sigma=8\) g. A supervisor takes a simple random sample of 25 items from the current shift to check whether the process mean has drifted.
Question: Using a two-sided test at \(\alpha=0.05\), do the data provide strong evidence that the mean differs from 500 g if \(\bar{x}=503\) g?
The hypotheses are \(H_0:\mu=500\) and \(H_1:\mu\ne 500\). The standardized statistic is \(z=(503-500)/(8/\sqrt{25})=1.875\), which is compared to \(z_{0.975}\approx 1.96\). Because \(|z|<1.96\), the test does not reject at the 5% level, meaning the observed deviation is not extreme enough on the standardized scale to cross the rejection threshold.
The corresponding two-sided P-value is approximately \(p\approx 0.061\), which indicates that values at least as far from 500 g as the observed mean would occur about 6.1% of the time if \(\mu=500\) were true under the assumed model. This is not small relative to 0.05, so the evidence is not strong enough to justify rejection at that threshold.
Answer: Fail to reject \(H_0\) at \(\alpha=0.05\); the data do not provide strong evidence that the mean fill weight differs from 500 g under the stated assumptions.
8.3.3 Type I error, Type II error, and power
A Type I error is rejecting \(H_0\) when \(H_0\) is true, and its probability is controlled at \(\alpha\). In quality and operations settings, this corresponds to concluding that a process has changed when it has not, which can trigger unnecessary stoppages, investigations, or supplier disputes.
A Type II error is failing to reject \(H_0\) when \(H_0\) is false, and its probability depends on the true parameter value, the test rule, and the sample size. It is written as \(\beta(\theta_1)\) to emphasize that it is evaluated at a specific alternative value \(\theta=\theta_1\), not as a single constant.
Power is the probability of correctly rejecting a false null at a specific alternative value. Power is therefore the complement of Type II error at that value.
Power supports planning because “no rejection” can mean either “no meaningful change” or “insufficient information.” When a decision is costly, it is often better to plan a sample size so that power is high for departures that are practically important.
8.3.4 P-values as evidence measures
A P-value is computed from the observed test statistic and is defined as the probability, under \(H_0\), of observing a statistic at least as extreme as what was observed. The meaning of “extreme” depends on whether the test is two-sided or one-sided, because the alternative hypothesis defines which tail behavior counts as evidence.
For a two-sided \(Z\) test, if \(z_{\text{obs}}\) is the observed value, the P-value is
The P-value is not the probability that \(H_0\) is true, and it is not the probability that the decision is correct. It is a conditional probability computed under the assumption that \(H_0\) is true, and it quantifies how surprising the observed statistic would be under that assumption.
The next figure uses repeated simulation to show how P-values behave when \(H_0\) is true versus when \(H_0\) is false. This is a pedagogically appropriate choice because P-values are random across repetitions, and their distribution is best understood by explicitly repeating the sampling process many times under known truth.
Figure 8.2 (Simulated P-value distributions under different true means) is based on simulated data because the objective is to study long-run behavior under controlled truth. In this figure, one “repetition” means drawing a new sample of size \(n\) from the specified distribution, computing the test statistic for \(H_0:\mu=0\), and recording the resulting P-value. For this figure, \(n\) is the number of observations per repetition (for example, \(n=10\)), while the number of repetitions is the simulation count (for example, 10,000).
To read the figure, first look at the horizontal axis, which is the P-value scale from 0 to 1, and then compare the histogram shape across dropdown settings for the true mean. The bars are empirical results from simulation, while any reference line is theoretical (for example, a flat reference for uniformity when \(H_0\) is true). Because the figure is a histogram, variability in bar heights is expected even when the underlying distribution is exactly uniform.
The main message is that when \(H_0\) is true, P-values are approximately uniform on \([0,1]\), so small P-values occur with frequency about equal to their size (for example, about 5% below 0.05). When the true mean moves away from the null value, the distribution shifts toward 0, meaning small P-values become more common and the test rejects more often. Comparing a “small departure” to a “large departure” shows that strong effects concentrate P-values near zero, which corresponds to higher power.
This figure also explains why a fixed cutoff such as 0.05 is a convention rather than a universal rule. Under \(H_0\), the cutoff controls the false rejection rate at \(\alpha\), but under alternatives it determines sensitivity, and that sensitivity depends strongly on effect size and sample size. Therefore, reporting the P-value provides more information than only reporting “reject” or “fail to reject,” especially when decisions have different risk tolerances.
8.3.5 Connection between confidence intervals and tests
Confidence intervals and two-sided hypothesis tests are logically linked when they are built from the same sampling model and standard error. The confidence interval provides a set of parameter values that remain plausible at a stated confidence level, while the test provides a rule for rejecting a specific value \(\theta_0\).
If \([L,U]\) is a \(100(1-\alpha)\%\) two-sided confidence interval for \(\theta\), then the two-sided level-\(\alpha\) test of \(H_0:\theta=\theta_0\) rejects if and only if \(\theta_0\) is not contained in \([L,U]\). This equivalence supports interpretation because it translates a binary test decision into a range-based uncertainty statement.
Example 8.2 (Use a confidence interval to answer a testing question)
A call center aims to keep average handle time at 6 minutes, and management wants to know whether the current month’s mean handle time differs from this target. A sample of 16 calls is taken, and the sample mean and standard deviation are \(\bar{x}=6.4\) minutes and \(s=1.2\) minutes.
Question: Using the confidence-interval link, decide whether to reject \(H_0:\mu=6\) at \(\alpha=0.05\) under normal sampling assumptions.
A 95% confidence interval for \(\mu\) with unknown \(\sigma\) is \(\bar{x}\pm t_{0.975,15}(s/\sqrt{n})\). Using \(t_{0.975,15}\approx 2.13\), the margin of error is approximately \(2.13(1.2/4)\approx 0.64\), giving an interval of about \([5.76,\,7.04]\) minutes.
Because the hypothesized value 6 minutes lies inside the interval, the two-sided level-0.05 test does not reject \(H_0\). The interval interpretation is that values near 6 remain plausible given the sampling variability, even though the point estimate 6.4 is above the target.
Answer: Fail to reject \(H_0\) at \(\alpha=0.05\) because 6 is contained in the 95% confidence interval for \(\mu\).
8.3.6 Power as a planning tool and sample size selection
Power becomes operational when it is anchored to a practically meaningful effect size. Instead of planning to detect any nonzero difference, analysts specify a minimum departure \(\delta\) that would matter in cost, safety, or performance, and they design the study so that power is high at that departure.
For a two-sided one-sample \(Z\) test for a mean with known \(\sigma\), a standard planning approximation solves for \(n\) so that the test has power \(1-\beta\) at \(\mu_1=\mu_0\pm \delta\). The resulting sample-size formula is
This formula shows the main design levers clearly. Smaller \(\alpha\) and larger target power both increase the required sample size, while larger effect sizes reduce it. The square relationship implies that halving \(\delta\) multiplies the required \(n\) by about four under comparable settings.
Figure 8.3 (Theoretical power curves for a two-sided \(Z\) test) is theoretical rather than simulated because the purpose is to display the functional dependence of power on effect size and sample size under the model assumptions. Repetition is not used to generate the curves, but the interpretation remains long-run because power is a probability across hypothetical repeated studies. In this figure, \(n\) is the per-study sample size, and the horizontal axis uses the standardized effect size \(d=\delta/\sigma\).
To read the figure, first select a curve corresponding to a particular \(n\) and then locate the desired effect size \(d\) on the horizontal axis. Next, read vertically to the curve and then horizontally to the power value on the vertical axis. The curves are theoretical calculations under a normal approximation, not fitted lines, so their shapes reflect probability calculations rather than empirical smoothing.
The main message is that power increases as either the effect size or the sample size increases, and the increase is steep once the effect size is large relative to the standard error. Comparing a “small \(n\)” curve to a “large \(n\)” curve shows that larger samples shift the entire curve upward, meaning that the same effect size becomes easier to detect. This matters because “no rejection” can occur frequently at small \(n\) even when the true departure is practically important.
This figure should be used as a planning device rather than as a post-hoc justification. After data are collected, the P-value answers an evidence question under \(H_0\), while the power curve answers a design question about sensitivity under alternatives. Mixing these roles can lead to incorrect conclusions about what a non-rejection implies.
Example 8.3 (Choose a sample size using power and a practical effect)
A warehouse process improvement team monitors mean order-picking time and wants to be able to detect a reduction of 2 minutes, because smaller changes do not justify retraining costs. Historical data suggest the standard deviation is approximately \(\sigma=5\) minutes under stable conditions.
Question: Approximately how large should \(n\) be to test \(H_0:\mu=\mu_0\) versus \(H_1:\mu\ne \mu_0\) at \(\alpha=0.05\) with 80% power to detect \(\delta=2\) minutes?
Using the planning formula \(n\ge ((z_{0.975}+z_{0.8})\sigma/\delta)^2\) with \(z_{0.975}\approx 1.96\) and \(z_{0.8}\approx 0.84\), we obtain \(n\ge ((1.96+0.84)\cdot 5/2)^2\). Numerically, this is approximately \(n\ge 49.1\), and rounding up is appropriate because \(n\) must be an integer and the calculation is approximate.
The operational interpretation is that, if the true mean has shifted by 2 minutes from \(\mu_0\), the designed test will reject about 80% of the time under the assumed model. If the team needs higher sensitivity (for example, 90% power), the required sample size increases, often substantially.
Answer: Use about \(n=50\) observations to target 80% power for detecting a 2-minute mean shift at \(\alpha=0.05\) under the stated assumptions.
8.4 Discussion and Common Errors
A frequent error is interpreting the P-value as the probability that \(H_0\) is true. The P-value is computed under the assumption that \(H_0\) is true, and it quantifies how unusual the observed statistic would be under that assumption, not the truth probability of \(H_0\).
Another frequent error is treating “fail to reject \(H_0\)” as “accept \(H_0\).” A non-rejection may reflect limited information, meaning that \(\beta(\theta_1)\) could be large for practically important alternatives. This is why planning with power and reporting effect sizes are important in management decisions.
Statistical significance should not be equated with operational importance. With large \(n\), very small departures can yield small P-values, even when the effect size is too small to matter in cost, safety, or service outcomes. Practical significance should be addressed by specifying \(\delta\), reporting estimates and intervals, and using domain thresholds.
The direction of the alternative hypothesis should reflect which conclusion must be strong. Because rejection is the strong conclusion controlled by \(\alpha\), the alternative should capture the claim that must be supported by strong evidence, such as “mean is below the safety limit” rather than “mean meets the safety limit.”
8.5 Summary
Hypothesis testing provides a structured way to make decisions between competing claims about a parameter using sample data. The framework separates two error types, controls the false rejection rate using \(\alpha\), and uses the P-value as an evidence measure defined under \(H_0\).
Confidence intervals and two-sided tests are linked: a level-\(\alpha\) rejection corresponds to the null value lying outside a \(100(1-\alpha)\%\) confidence interval built from the same model. Power connects inference to design by quantifying sensitivity to practically meaningful departures and supporting sample-size selection before data collection.