11. Hypothesis Testing IV: Chi-Square Tests for Goodness-of-Fit and Independence in Contingency Tables

11.0 Notation Table

Symbol

Meaning

\(n\)

Total sample size (grand total)

\(k\)

Number of categories (cells) in GOF

\(r,\ c\)

Rows and columns in a contingency table

\(O_i\)

Observed count in cell \(i\) (GOF)

\(E_i\)

Expected count in cell \(i\) (GOF)

\(p_i\)

Hypothesized cell probability (GOF)

\(O_{ij}\)

Observed count in row \(i\), column \(j\)

\(E_{ij}\)

Expected count in row \(i\), column \(j\)

\(R_i\)

Row total for row \(i\)

\(C_j\)

Column total for column \(j\)

\(\chi^2\)

Chi-square test statistic

\(\nu\)

Degrees of freedom

\(\alpha\)

Significance level

\(P\text{-value}\)

Right-tail probability under \(\chi^2_\nu\)

\(r_i\)

Standardized residual (GOF)

\(r_{ij}\)

Standardized residual (table cell)

\(V\)

Cramér’s \(V\) (association strength)

11.1 Introduction

Earlier modules developed hypothesis tests for numerical parameters such as means and proportions, where the evidence is summarized by a test statistic and a right-tail or two-tail probability under a reference distribution. In many operations and management settings, the response is categorical, and the parameter of interest is not a single mean but a pattern of counts across categories or across a two-way table.

This module introduces chi-square methods for categorical inference. The core idea is to compare observed counts with expected counts under a null model, and to use a chi-square reference distribution to quantify how surprising the observed discrepancies are under repeated sampling.

11.2 Learning Outcomes

By the end of this session, students should be able to:

  • Construct and interpret the chi-square statistic as a standardized measure of discrepancy between observed and expected counts

  • Perform a chi-square goodness-of-fit (GOF) test for a specified categorical model

  • Build a contingency table and compute expected counts under an independence or homogeneity assumption

  • Distinguish “independence” from “homogeneity” by the sampling design and the inferential target

  • Diagnose which cells drive a significant result using standardized residuals

  • State and check practical assumptions (especially expected counts) and select remedies when assumptions fail

11.3 Main Concepts

11.3.1 The chi-square statistic as a discrepancy measure

For categorical data, the null model provides an expected count for each cell. Evidence against the null model arises when the observed counts differ from these expected counts by more than can be explained by sampling variability. A standard way to summarize the overall discrepancy is the Pearson chi-square statistic.

For a one-way table with \(k\) categories, the chi-square statistic is

\[\chi^2 = \sum_{i=1}^{k}\frac{(O_i - E_i)^2}{E_i}\]

Each term compares “observed minus expected” to the scale of the expected count. A large value of \(\chi^2\) indicates that the observed pattern is far from what the null model predicts.

11.3.2 Chi-square goodness-of-fit (GOF)

A GOF problem asks whether the distribution of a categorical outcome matches a specified model. The null hypothesis states the category probabilities, and the alternative is that at least one probability differs.

Typical hypotheses are:

  • \(H_0: (p_1,\dots,p_k)\) equals a specified vector of probabilities

  • \(H_1:\) the distribution is not the specified one

Expected counts come from \(E_i = n p_i\). The chi-square reference distribution is an approximation that improves as the expected counts increase, and it is not recommended when expected counts are too small. A practical rule is to ensure all (or nearly all) expected counts meet a minimum such as 5, and to combine categories if necessary.

Degrees of freedom for GOF depend on whether parameters were estimated from the same data. If the probabilities are fully specified in advance, then

\[\nu = k - 1\]

If \(p\) parameters are estimated to obtain the \(p_i\) values, then the degrees of freedom reduce to

\[\nu = k - 1 - p\]

This reduction reflects that the fitted model used information from the sample and therefore has fewer independent components left to assess.

Example 11.1 (GOF with category combining). A warehouse classifies shipping outcomes into four categories: no damage, minor damage, major damage, and total loss. Historical monitoring suggests proportions \((0.80, 0.15, 0.04, 0.01)\) for these categories, and a quality manager wants to check whether the current week matches this profile. In a sample of \(n=200\) shipments, the observed counts are 150, 35, 12, and 3.

Question: At \(\alpha=0.05\), does the weekly distribution differ from the historical profile, after addressing any expected-count issue?

Under \(H_0\), expected counts are \(E=(160, 30, 8, 2)\). The last expected count equals 2, which is below a common minimum and can weaken the chi-square approximation. A practical remedy is to combine adjacent “rare-event” categories into a single category, here “major or total loss,” yielding three categories with observed counts \((150, 35, 15)\) and expected counts \((160, 30, 10)\).

The test statistic is

\[\chi^2 = \frac{(150-160)^2}{160} + \frac{(35-30)^2}{30} + \frac{(15-10)^2}{10}\]

This gives \(\chi^2 \approx 3.96\). With \(k=3\) and no estimated parameters, \(\nu = k-1 = 2\), and the right-tail probability is \(P(\chi^2_2 \ge 3.96) \approx 0.138\). Answer: The result is not significant at \(\alpha=0.05\), so the data do not provide strong evidence of a shift from the historical damage profile, given the combined-category analysis.

Figure 11.1. Sampling distribution of the GOF chi-square statistic under the null

The main message of this figure is that the chi-square reference distribution is an approximation to the sampling distribution of \(\chi^2\), and the approximation improves when expected counts are not small. The purpose is to connect the formula for \(\chi^2\) to repeated sampling and to show why minimum expected counts matter for validity.

The data are simulated under a null GOF model (a balanced six-category process) because the goal is to visualize the distribution of \(\chi^2\) over many repetitions. A “repetition” means generating a new random sample of size \(n\) from the same null probabilities, recomputing the counts, and then recomputing \(\chi^2\). In this figure, \(n\) is the sample size per repetition, and changing \(n\) changes the expected count in each cell.

To read the figure, first select a sample size using the dropdown and look at the histogram, which is the empirical distribution of \(\chi^2\) across many repetitions. Then compare the histogram to the smooth reference curve, which is the theoretical \(\chi^2_\nu\) density with fixed degrees of freedom. The vertical line is a single realized sample’s \(\chi^2\) value and illustrates how one observed statistic sits within the long-run distribution.

The figure shows that for smaller \(n\), the empirical distribution can deviate from the theoretical curve because the cell counts are more discrete and the approximation is less accurate, especially when expected counts are near the minimum threshold. As \(n\) increases, the histogram aligns more closely with the theoretical curve, and tail areas used for \(P\text{-values}\) become more reliable. This matters because chi-square inference is calibrated by right-tail probabilities, so poor approximation can lead to misleading conclusions.

Finally, the figure reinforces a practical workflow: check expected counts first, then compute \(\chi^2\), then interpret the result using a chi-square reference only when the assumptions are reasonable. When expected counts are borderline, combining categories is a design-level correction that targets the validity of the approximation rather than “improving significance.”

11.3.3 Contingency tables and expected counts

A two-way contingency table summarizes counts for two categorical variables. Let rows index one classification with \(r\) levels and columns index another classification with \(c\) levels. The observed table is \(\{O_{ij}\}\) with row totals \(R_i\) and column totals \(C_j\), and grand total \(n\).

Under a “no association” structure, expected counts are computed from marginal totals. The standard formula is

\[E_{ij} = \frac{R_i C_j}{n}\]

This formula ensures that expected counts add to the correct row and column totals. In practice, it is a fast consistency check: any error in table construction often appears as expected counts that do not reproduce the marginals.

11.3.4 Chi-square test for independence

The test for independence is used when a single random sample is drawn from one population, and each sampled unit is classified by two categorical variables. In that design, both sets of marginal totals are random outcomes of sampling.

The hypotheses are:

  • \(H_0:\) the two classification variables are independent in the population

  • \(H_1:\) the variables are not independent (there is association)

The test statistic is

\[\chi^2 = \sum_{i=1}^{r}\sum_{j=1}^{c}\frac{(O_{ij}-E_{ij})^2}{E_{ij}}\]

If the approximation conditions are adequate, the reference distribution is \(\chi^2_\nu\) with

\[\nu = (r-1)(c-1)\]

A significant result implies evidence of association, but it does not by itself specify which cells contribute most to the association. That diagnosis is handled by residual analysis.

Example 11.2 (Independence: machine type and defect status). A plant monitors three machine types that produce the same component, and quality engineers classify each inspected unit as pass or fail. A random sample of \(n=450\) units is collected over a week, and the observed counts are: Machine A (120 pass, 30 fail), Machine B (140 pass, 10 fail), Machine C (110 pass, 40 fail). The goal is to assess whether defect status is independent of machine type.

Question: At \(\alpha=0.05\), is defect status independent of machine type, and which cells appear most responsible for any departure?

Under \(H_0\), expected counts are computed from marginals. The overall pass rate is \(370/450\), so each machine with 150 observations has expected counts \(E_{\text{pass}}=150(370/450)=123.33\) and \(E_{\text{fail}}=26.67\). Substituting into the chi-square formula gives \(\chi^2 \approx 21.28\).

Degrees of freedom are \(\nu=(3-1)(2-1)=2\), and the right-tail probability is extremely small (\(P\text{-value} \approx 2.4\times 10^{-5}\)). Standardized residuals help interpret direction and location, using \(r_{ij}=(O_{ij}-E_{ij})/\sqrt{E_{ij}}\). Here, the largest magnitude residual occurs in Machine B’s fail cell (much fewer fails than expected), while Machine C’s fail cell is positive (more fails than expected). Answer: Independence is rejected at \(\alpha=0.05\), and the pattern suggests Machine C has an elevated fail count while Machine B has a reduced fail count relative to the marginal expectation.

Figure 11.2. Reading a contingency table via observed counts, expected counts, and residuals

The main message of this figure is that chi-square evidence is produced by specific cells where observed counts differ from expected counts under a null “no association” model. The purpose is to teach a disciplined way to read a contingency table: observed counts show the operational reality, expected counts define the null benchmark, and residuals show where the benchmark fails.

The tables are simulated because the goal is to control the underlying association and to compare small and large sample sizes without changing the underlying process. A “repetition” would mean resampling a new table from the same joint probabilities, but this figure shows one realized table for each displayed \(n\) to focus attention on interpretation rather than variability. In this figure, \(n\) is the grand total across all cells, and increasing \(n\) scales counts while also stabilizing residual patterns.

To read the figure, start by selecting one option in the dropdown that specifies both \(n\) and the layer to display. In the “Observed” layer, look for operational differences across rows and columns but avoid causal claims. In the “Expected” layer, confirm that expected counts follow the row and column totals, and then move to “Residuals,” where large positive values indicate “more than expected” and large negative values indicate “less than expected,” cell by cell.

The figure shows that when \(n\) is small, residuals can appear noisy because random fluctuation is large relative to expected counts, and borderline expected counts can occur more often. When \(n\) is large, residuals become more stable and the strongest departures are easier to localize, which supports clearer operational diagnosis. This matters because a statistically significant \(\chi^2\) can be driven by a few cells, and residuals provide a map of where process investigation should begin.

Finally, the figure discourages a common pitfall: interpreting only the \(P\text{-value}\) and ignoring the table structure. A correct workflow is to report the significance decision, then summarize which cells drive the discrepancy and in which direction, and then translate those cells into actionable process hypotheses (for example, lane-specific delays or segment-specific complaint types).

11.3.5 Chi-square test for homogeneity

The test for homogeneity compares distributions across multiple populations or groups. The key feature is the sampling design: group sample sizes are often fixed in advance (for example, auditing a predetermined number of units from each supplier), and the inferential target is whether the within-group category proportions are the same across groups.

In a homogeneity setting, the hypotheses are:

  • \(H_0:\) the category proportions are the same across groups

  • \(H_1:\) at least one group has a different set of proportions

The computation uses the same expected-count formula \(E_{ij}=R_iC_j/n\), the same chi-square statistic, and the same degrees of freedom \(\nu=(r-1)(c-1)\). The difference from independence is conceptual and design-based rather than computational.

Example 11.3 (Homogeneity: supplier comparison). A company audits incoming lots from three suppliers and records whether each inspected unit is accepted or rejected. The audit plan fixes sample sizes in advance: 100 units from Supplier 1, 120 from Supplier 2, and 80 from Supplier 3. The observed counts are: Supplier 1 (92 accept, 8 reject), Supplier 2 (102 accept, 18 reject), Supplier 3 (70 accept, 10 reject).

Question: At \(\alpha=0.05\), are the reject proportions homogeneous across suppliers, and what is an appropriate practical interpretation?

The table is a 2-by-3 layout, so \(\nu=(2-1)(3-1)=2\). Expected counts are computed from the overall accept and reject totals and each supplier’s column total. The resulting chi-square statistic is \(\chi^2 \approx 2.56\), leading to \(P\text{-value} \approx 0.278\).

A non-significant result indicates that the observed differences in reject counts are consistent with sampling variability under equal reject proportions, given the sample sizes used. A useful complement is an association magnitude such as Cramér’s \(V\), which for a 2-by-3 table reduces to \(V=\sqrt{\chi^2/n}\) and here is small. Answer: The homogeneity hypothesis is not rejected at \(\alpha=0.05\), so the audit does not provide strong evidence that reject rates differ across suppliers, and the apparent differences should be treated as operational signals requiring more data rather than as confirmed performance gaps.

11.3.6 Standardized residuals and effect size for interpretation

A chi-square test is global: it detects that “some difference exists,” not which difference matters. To localize discrepancies, use standardized (Pearson) residuals.

For GOF, a common diagnostic is

\[r_i = \frac{O_i - E_i}{\sqrt{E_i}}\]

For contingency tables, apply the same idea cell-wise:

\[r_{ij} = \frac{O_{ij} - E_{ij}}{\sqrt{E_{ij}}}\]

Large positive residuals indicate more observations than expected under the null model, and large negative residuals indicate fewer. In practice, residual magnitudes around 2 or larger are often treated as practically notable signals, but this rule should be used cautiously because many cells create multiple-comparison pressure.

For a compact measure of association strength, Cramér’s \(V\) is often reported:

\[V = \sqrt{\frac{\chi^2}{n\min(r-1,\ c-1)}}\]

This number is not a probability, and it does not describe direction. It is useful for comparing association strength across studies with different sample sizes and table dimensions.

11.3.7 Assumptions, expected counts, and remedies when assumptions fail

Chi-square methods rely on an approximation that is most accurate when expected counts are not small and when observations are independent. In table settings, independence refers to the sampling unit, meaning each unit contributes to exactly one cell and units do not influence each other’s category membership.

Expected-count issues can arise from rare categories, from too many categories, or from small samples. When expected counts are too small, the chi-square statistic can become dominated by sampling discreteness rather than meaningful model departure, and the chi-square reference distribution can mis-calibrate the \(P\text{-value}\).

Practical remedies include:

  • Combine categories in a defensible way, preferably using operational logic (for example, merging adjacent severity levels) rather than post-hoc significance chasing.

  • Use an exact method for small tables when appropriate, especially in 2-by-2 settings with small expected counts.

  • Use a simulation-based \(P\text{-value}\) by generating tables under the null model and comparing the observed statistic to the simulated distribution.

  • Consider alternative modeling when the objective is prediction or adjustment for covariates, such as logistic regression for binary outcomes or multinomial models for multi-category outcomes.

Figure 11.3. Combining categories to satisfy minimum expected counts

The main message of this figure is that small expected counts create a validity problem for chi-square inference, and combining categories is a principled corrective action. The purpose is to show how a table that looks “detailed” can be statistically fragile, and how a slightly coarser table can be more defensible for inference.

The data are simulated from a fixed GOF probability model because the focus is on expected frequencies rather than on a specific real dataset. A “repetition” would mean drawing a new sample of size \(n\) from the same category probabilities and recomputing observed counts, but the figure presents one realized sample to highlight the expected-count check. In this figure, \(n\) is the sample size that generates the one-way frequency table.

To read the figure, first examine the “Original categories” view and compare the observed bars to the expected bars for each category. Then locate the horizontal reference level at 5, which represents a commonly used minimum expected count rule of thumb. Categories whose expected bars fall below that line are candidates for combining with adjacent or operationally similar categories.

The figure shows that when categories are rare, the expected counts can easily fall below the minimum threshold, especially for small or moderate \(n\). Switching to “Combined categories” increases expected counts by pooling rare categories, and it also changes the degrees of freedom because the number of categories \(k\) decreases. This matters because the validity of the chi-square approximation and the correct reference distribution both depend on the final table structure.

Finally, the figure emphasizes that combining categories is not only a technical fix but also an interpretation choice. After combining, the test answers a slightly different operational question because distinctions among rare categories are no longer separately assessed. The recommended practice is to document the combining rule before looking at the test result and to explain the new category meanings in the final report.

11.4 Discussion and Common Errors

A common error is to confuse independence and homogeneity as different formulas. The formulas are the same, but the inferential meaning depends on the sampling design: independence is a single-sample “two variables in one population” question, while homogeneity is a “compare distributions across groups” question.

Another frequent pitfall is to interpret a significant chi-square test as a statement about causality. Chi-square tests detect association or distributional mismatch, but they do not establish a causal mechanism. Process conclusions should be framed as hypotheses for investigation, supported by residual patterns and operational knowledge.

Expected-count conditions are often overlooked, especially when analysts create many categories to gain detail. If expected counts are too small, chi-square inference can be unreliable, and combining categories or using an exact or simulation-based method is more defensible.

Finally, analysts sometimes report only the \(P\text{-value}\) and omit interpretation of which cells differ. A useful report includes the observed table, the expected table, a residual summary, and a short operational explanation of the dominant positive and negative discrepancies.

11.5 Summary

  • Chi-square methods compare observed counts to expected counts under a null categorical model

  • GOF tests assess whether a one-way distribution matches specified probabilities, using \(E_i=np_i\) and an upper-tail chi-square reference

  • Independence tests assess association between two categorical variables in a single random sample, using \(E_{ij}=R_iC_j/n\)

  • Homogeneity tests compare category proportions across groups, with the same computation but a different sampling design interpretation

  • Standardized residuals localize which cells drive the global chi-square discrepancy and provide directional insight

  • Expected-count checks are essential, and combining categories or using exact/simulation-based methods are recommended when assumptions fail