9. Hypothesis Testing II: One-Sample Tests for \(\mu\), \(p\), and \(\sigma^2\)
9.0 Notation Table
Symbol |
Meaning |
|---|---|
\(X_1,\dots,X_n\) |
Random sample observations |
\(n\) |
Sample size |
\(\bar{X}\) |
Sample mean |
\(S,\ S^2\) |
Sample standard deviation, variance |
\(\mu,\ \sigma,\ \sigma^2\) |
Population mean, standard deviation, variance |
\(\mu_0,\ \sigma_0,\ \sigma_0^2\) |
Hypothesized parameter value under \(H_0\) |
\(\alpha\) |
Significance level (Type I error probability) |
\(p\) |
Population proportion (“success” probability) |
\(p_0\) |
Hypothesized proportion under \(H_0\) |
\(X\) |
Number of successes in \(n\) trials (binomial count) |
\(\hat{p}=X/n\) |
Sample proportion |
\(\nu\) |
Degrees of freedom (typically \(\nu=n-1\)) |
\(Z_0\) |
z test statistic (standard normal reference) |
\(T_0\) |
t test statistic (Student t reference) |
\(\chi_0^2\) |
Chi-square test statistic for a variance |
9.1 Introduction
In the previous module, hypothesis testing was developed as a decision framework based on a sampling model, a null hypothesis, and a controlled Type I error probability \(\alpha\). That framework emphasized how a P-value or a critical region is derived from a reference distribution when \(H_0\) is assumed true.
This module applies the framework to three common one-sample inference targets in operations and quality settings: a process mean, a defect proportion, and a process variance. The key technical bridge is that each test is built from a statistic whose distribution is known (exactly or approximately) under stated conditions, and those conditions must be checked and defended.
9.2 Learning Outcomes
After completing this session, students should be able to:
Formulate \(H_0\) and \(H_1\) for one-sample problems involving \(\mu\), \(p\), and \(\sigma^2\).
Select an appropriate one-sample test (z, t, binomial/normal-proportion, or chi-square) based on the sampling model and available information.
Compute the test statistic, interpret the P-value, and state a conclusion in the original applied context.
Explain the meaning of “fail to reject \(H_0\)” and distinguish it from “accepting \(H_0\)”.
Check and discuss assumptions (independence, approximate normality, and adequacy of large-sample approximations).
Connect two-sided tests at level \(\alpha\) with \(100(1-\alpha)\%\) confidence intervals for the same parameter.
9.3 Main Concepts
9.3.1 One-sample testing workflow and the reference distribution
A one-sample hypothesis test compares a hypothesized parameter value to information extracted from a sample. The conclusion is driven by how unusual the observed statistic would be if \(H_0\) were true, where “unusual” is quantified relative to a reference distribution.
A standard workflow is as follows. The null hypothesis \(H_0\) states a specific value such as \(\mu=\mu_0\), \(p=p_0\), or \(\sigma^2=\sigma_0^2\), and the alternative \(H_1\) specifies the direction (lower-tailed, upper-tailed, or two-sided). The analyst chooses \(\alpha\), computes an appropriate test statistic, and then uses either a P-value or a critical region to make a decision that controls Type I error.
9.3.2 Tests for a single mean: z-test and t-test
When the inference target is a population mean \(\mu\), the natural estimator is the sample mean \(\bar{X}\). The distribution used for testing depends on whether the population standard deviation \(\sigma\) is treated as known or must be estimated from the data.
If \(\sigma\) is known (or treated as known from stable historical data), then under the sampling model the standardized statistic has a standard normal reference distribution:
For a two-sided test at level \(\alpha\), reject \(H_0\) if \(|Z_0|>z_{\alpha/2}\). For a one-sided test, reject in the tail indicated by \(H_1\), using \(z_{\alpha}\).
If \(\sigma\) is unknown, it is replaced by \(S\), and the reference distribution is Student’s t with \(\nu=n-1\) degrees of freedom under approximate normal sampling:
For a two-sided test at level \(\alpha\), reject \(H_0\) if \(|T_0|>t_{\alpha/2,\nu}\). For a one-sided test, reject in the appropriate tail using \(t_{\alpha,\nu}\).
The connection to estimation is operationally important. For two-sided alternatives, a level-\(\alpha\) test of \(\mu=\mu_0\) rejects exactly when \(\mu_0\) lies outside the corresponding \(100(1-\alpha)\%\) confidence interval for \(\mu\), constructed with the same reference distribution.
Figure 9.1 develops the practical meaning of the two reference distributions through their critical values.
Figure 9.1 (Critical values for z and t)
The figure is model-based rather than data-based, because its goal is to compare reference cutoffs implied by two different sampling models. No real dataset is needed, and “repetition” is not used because we are not approximating probabilities by simulation in this display. The horizontal axis is degrees of freedom \(\nu\), and for the t procedure \(\nu=n-1\) so each value of \(\nu\) corresponds to a particular sample size.
To read the figure, first select whether you are viewing one-sided or two-sided cutoffs at \(\alpha=0.05\). Then compare the constant z critical value (a flat line) to the t critical value (a curve) at the same tail probability. The z line is theoretical for \(N(0,1)\), while the t curve is theoretical for \(t_{\nu}\) and therefore changes with \(\nu\).
The main message is that t cutoffs are larger in magnitude when \(\nu\) is small, and they approach the corresponding z cutoff as \(\nu\) increases. This happens because replacing \(\sigma\) with \(S\) introduces extra uncertainty when the sample is small, which is reflected by heavier tails in the t distribution. As \(n\) increases, \(S\) stabilizes and the t distribution becomes close to the standard normal, making the two procedures nearly identical in large samples.
In applications, this comparison explains why small-sample mean tests should not use z cutoffs unless \(\sigma\) is credibly known. Using a z cutoff when \(\sigma\) is actually estimated can make rejection too easy and inflate the actual Type I error. Using t cutoffs appropriately protects the advertised significance level under the normal sampling model.
9.3.3 Reading a P-value for mean tests
A P-value is the probability, computed under \(H_0\), of obtaining a statistic at least as extreme as the observed one in the direction specified by \(H_1\). For two-sided alternatives, “at least as extreme” means in both tails beyond \(\pm|z_0|\) or \(\pm|t_0|\), while for one-sided alternatives it means in a single tail.
For a z-test on a mean, typical P-value forms are:
For a t-test with \(\nu=n-1\) degrees of freedom, the same logic applies using the t distribution:
Figure 9.2 illustrates how the same observed statistic can produce different P-values under different reference distributions.
Figure 9.2 (P-value areas under z and t references)
The figure is theoretical and is not constructed from a real dataset, because its purpose is to visualize how P-values are defined as tail areas under a reference distribution. No simulation is required, so “repetition” does not apply in this display. The sample size enters indirectly through \(\nu=n-1\) when the t reference is used, so choosing a smaller \(\nu\) corresponds to a smaller \(n\).
To read the figure, select a scenario from the dropdown and locate the marked test statistic on the horizontal axis. Then interpret the shaded region as the probability mass in the relevant tail(s) beyond the observed statistic. The smooth curve is always the theoretical reference density, and the shaded region is the P-value area under that curve.
The main message is that heavier-tailed references yield larger tail areas for the same statistic magnitude. In particular, with a small \(\nu\), the t distribution assigns more probability to extreme values, so the P-value tends to be larger than the z-based P-value at the same \(|t_0|\). As \(\nu\) increases, the t reference becomes close to the normal reference and the P-values become similar.
In practice, this display reinforces that a P-value is not only a property of the data but also of the chosen model and reference distribution. A correct conclusion requires pairing the statistic with the correct reference distribution implied by assumptions. A correct report states the direction of \(H_1\), the value of the statistic, and whether the resulting P-value is below the chosen \(\alpha\).
9.3.4 Tests for a single proportion: exact binomial and normal approximation
In quality and service operations, binary outcomes are common. Examples include “defective vs acceptable,” “late vs on-time,” or “customer churn vs retained.” If \(X\) is the number of successes in \(n\) independent Bernoulli trials with success probability \(p\), then \(X\sim\text{Binomial}(n,p)\) and \(\hat{p}=X/n\) estimates \(p\).
For small samples, testing \(H_0:p=p_0\) should use the exact binomial distribution for the P-value. The direction of \(H_1\) determines whether the tail probability is computed as \(P(X\le x)\), \(P(X\ge x)\), or a two-sided tail rule that reflects “as extreme as observed” relative to \(np_0\).
For large samples, a normal approximation gives a convenient z statistic:
This approximation is typically defended when both \(np_0\) and \(n(1-p_0)\) are not small, so that the binomial distribution is not too discrete and not too skewed. When \(p_0\) is close to 0 or 1, much larger \(n\) may be required for the approximation to behave well.
Figure 9.3 compares the exact sampling behavior under \(H_0\) to the normal approximation as \(n\) changes.
Figure 9.3 (Sampling distribution of \(\hat{p}\) under \(H_0\))
The figure uses simulated data because it aims to show an entire sampling distribution, which is defined by repeated sampling under the same model. Here “repetition” means repeatedly drawing new samples of size \(n\) from a binomial model with the same hypothesized value \(p_0\), and then recomputing \(\hat{p}\) each time. In this figure, \(n\) is the number of Bernoulli trials in each repeated sample.
To read the figure, focus first on the histogram, which is an empirical approximation to the distribution of \(\hat{p}\) under \(H_0\). Then compare the histogram to the smooth curve, which is the theoretical normal approximation with mean \(p_0\) and variance \(p_0(1-p_0)/n\). The histogram is empirical (simulation-based), while the smooth curve is theoretical (model-based).
The main message is that discreteness and skewness are prominent when \(n\) is small, especially when \(p_0\) is near 0. As \(n\) increases, the histogram becomes more concentrated and more bell-shaped, and the normal curve tracks the empirical shape more closely. This matters because the accuracy of z-based P-values depends on how well the normal approximation matches the true binomial behavior in the tails.
In applications, the figure motivates a defensible procedure choice. If \(n\) is small or \(p_0\) is extreme, an exact binomial P-value is preferred because it controls Type I error under the binomial model without relying on approximation. If the large-sample conditions are satisfied, the z test is usually adequate and much simpler to compute and communicate.
9.3.5 Tests for a single variance: chi-square test and robustness warning
When the inference target is process variability, the parameter of interest is \(\sigma^2\) (or \(\sigma\)). Variance tests appear in gauge studies, process capability discussions, and settings where meeting a specification requires controlling dispersion rather than shifting the mean.
Under normal sampling, the test statistic
has a chi-square reference distribution with \(\nu=n-1\) degrees of freedom when \(H_0:\sigma^2=\sigma_0^2\) is true. The support of the chi-square distribution is \(\chi^2\ge 0\), so the rejection region is always in one or both tails on the nonnegative axis.
A critical practical point is robustness. The chi-square variance test is highly sensitive to departures from normality, and moderate skewness or heavy tails can distort the true sampling distribution of \(\chi_0^2\). For this reason, a variance test should be paired with explicit model checking and careful interpretation.
Figure 9.4 demonstrates why this warning is operationally important.
Figure 9.4 (Why the chi-square variance test is nonrobust)
The figure uses simulation because it compares the actual behavior of the variance statistic under different underlying distributions that share the same variance. Here “repetition” means repeatedly drawing new samples of size \(n\) from the same population distribution, computing \(S^2\), and then transforming it into \(\chi_0^2=(n-1)S^2/\sigma_0^2\) each time. In this figure, \(n\) is the sample size used to compute each repeated sample variance.
To read the figure, compare the histogram to the smooth chi-square curve for the same degrees of freedom \(\nu=n-1\). The histogram is empirical and shows the simulated distribution of \(\chi_0^2\) under the stated population shape. The smooth curve is theoretical and shows the chi-square reference distribution that would be correct if the population were normal.
The main message is that the match is good when the population is normal, but it can degrade substantially under skewness even if the variance is the same. Increasing \(n\) reduces sampling noise and makes the mismatch easier to see rather than making it disappear. This matters because a distorted reference distribution can produce misleading P-values and rejection decisions that reflect shape violations rather than true differences in \(\sigma^2\).
In applications, the correct conclusion is conditional on model adequacy. Before interpreting a significant chi-square test as “variance changed,” the analyst should examine distributional diagnostics and consider whether a transformation, a more appropriate model, or a robust alternative is needed. At minimum, the result should be reported with an explicit statement about the normality check and any visible departures.
9.3.6 Assumption checks: independence, normality, and approximation adequacy
All one-sample tests require a defensible sampling story. In operations settings, this usually means the data come from a stable process during the sampling window and the observations are not mechanically linked (for example, repeated measurements on the same unit without accounting for pairing).
Independence and randomness are primarily defended by design and data collection practice. Time order plots, run charts, and knowledge of batching or shift effects are often needed to argue that the sample is not a short burst from a drifting process. If dependence is suspected, the nominal Type I error control of standard tests is not guaranteed.
Normality enters in different ways. The one-sample t-test is reasonably robust to mild nonnormality when the distribution is roughly symmetric and unimodal, especially as \(n\) grows, while the chi-square variance test is not robust and should be used with extra caution. Normal probability plots (Q–Q plots) and plots of standardized residual-like quantities \((x_i-\bar{x})/S\) can provide evidence of gross skewness, heavy tails, or outliers that could dominate variance-based decisions.
For proportion tests, the key check is adequacy of the normal approximation when using the z test. If \(np_0\) or \(n(1-p_0)\) is small, exact binomial P-values are preferred because the sampling distribution is discrete and tail behavior is sensitive to \(n\) and \(p_0\).
9.3.7 Examples
Example 9.1 (z-test for a mean with stable historical variance)
A packaging line targets a mean fill weight of 50 grams, and compliance audits treat the short-term process standard deviation as stable at \(\sigma=5\) grams based on a long history of calibration. A quality engineer collects a fresh random sample of \(n=40\) units after a maintenance action to check whether the mean has increased. The sample mean is \(\bar{x}=52.1\) grams.
Question: At \(\alpha=0.05\), is there evidence that the mean fill weight exceeds 50 grams?
Under \(H_0:\mu=50\) versus \(H_1:\mu>50\), the test statistic is computed using the known \(\sigma\). The z statistic is
An upper-tailed P-value is \(P=1-\Phi(z_0)\approx 0.004\). This probability is small under the null model, meaning the observed sample mean would be rare if the true mean were 50 grams.
Answer: Reject \(H_0\) at \(\alpha=0.05\), and conclude that the mean fill weight is higher than 50 grams under the assumed stability of \(\sigma\).
Example 9.2 (t-test for a mean when variance is unknown)
A service operations team evaluates whether a new triage protocol reduces mean waiting time below 4.0 minutes. They record a random sample of \(n=12\) waiting times from a stable shift, obtaining \(\bar{x}=3.8\) minutes and \(s=0.9\) minutes. The distribution is considered approximately symmetric with no extreme outliers on a quick Q–Q plot.
Question: At \(\alpha=0.05\), is there evidence that the mean waiting time is less than 4.0 minutes?
Under \(H_0:\mu=4.0\) versus \(H_1:\mu<4.0\), the t statistic uses \(s\) and has \(\nu=n-1=11\) degrees of freedom under approximate normal sampling. The computed statistic is
The lower-tailed P-value from \(t_{11}\) is approximately \(0.229\). This is not small relative to \(\alpha=0.05\), so the sample does not provide strong evidence against the null at the stated level.
Answer: Fail to reject \(H_0\) at \(\alpha=0.05\); the data do not support a claim that the mean waiting time is below 4.0 minutes, although the sample estimate is below 4.0.
Example 9.3 (one-sample test for a defect proportion)
A supplier agreement requires that the fraction defective at incoming inspection be at most 5%. An inspector samples \(n=200\) items from a large shipment and observes \(x=6\) defectives, giving \(\hat{p}=0.03\). The goal is to determine whether the process appears capable relative to the 5% requirement.
Question: Using \(\alpha=0.05\), can we conclude that the defect rate is below 0.05?
A suitable formulation is \(H_0:p=0.05\) versus \(H_1:p<0.05\), because rejecting the null supports capability. The large-sample z statistic is
The lower-tailed P-value is \(P=\Phi(z_0)\approx 0.097\). This exceeds 0.05, so the evidence is not strong enough to reject at the 5% level, even though the point estimate is favorable.
Answer: Fail to reject \(H_0\) at \(\alpha=0.05\); there is not enough evidence to claim \(p<0.05\) at the 5% level, though the result may be considered suggestive at a 10% level.
Example 9.4 (chi-square test for a process variance)
An automated dispenser is considered out of control if its fill variance exceeds \(\sigma_0^2=0.01\) (volume units squared). A random sample of \(n=20\) fills yields a sample variance of \(s^2=0.0153\). The engineering team assumes the fill distribution is approximately normal during stable operation.
Question: At \(\alpha=0.05\), is there evidence that the variance exceeds 0.01?
Use \(H_0:\sigma^2=0.01\) versus \(H_1:\sigma^2>0.01\) with \(\nu=n-1=19\). The chi-square statistic is
Under \(H_0\), compare \(\chi_0^2\) to the upper-tail cutoff for \(\chi^2_{19}\) at \(\alpha=0.05\). Since the observed value is below that cutoff, the upper-tail P-value is modest (approximately \(0.065\)) rather than small.
Answer: Fail to reject \(H_0\) at \(\alpha=0.05\); there is insufficient evidence that the variance exceeds 0.01, and the conclusion should be paired with a careful normality check.
9.4 Discussion and Common Errors
A frequent reporting error is treating “fail to reject \(H_0\)” as proof that \(H_0\) is true. This statement only means the sample does not provide strong evidence against \(H_0\) at the chosen \(\alpha\), and low power can produce the same outcome even when \(H_0\) is false.
Another error is mixing a test statistic with the wrong reference distribution. Using a z cutoff when \(\sigma\) is unknown and \(n\) is small is a common mistake, and it can lead to overstated evidence against \(H_0\). Conversely, using the chi-square variance test without checking distribution shape can produce rejections driven by skewness or heavy tails rather than true variance changes.
For proportion tests, the normal approximation is often applied automatically without verifying that the binomial distribution is not too discrete or skewed. When \(p_0\) is near 0 or 1, even moderately large \(n\) can still yield poor tail accuracy, and exact binomial calculations are preferred.
Finally, conclusions should be written in the language of the process. A statistically significant mean shift may still be operationally negligible, while a nonsignificant result may still motivate data collection if the decision stakes are high and power is low.
9.5 Summary
This module presented one-sample hypothesis tests for the mean, proportion, and variance using z, t, binomial/normal-proportion, and chi-square reference distributions. Each procedure depends on explicit conditions, and the same numerical statistic can imply different conclusions under different reference models.
Operationally correct inference requires three linked components: a defensible sampling story, a statistic consistent with the parameter of interest, and a reference distribution justified by assumptions. The final decision is then communicated as a conclusion about the process parameter with an explicit significance level and a clear interpretation of the P-value.