15. Regression and Correlation II: Simple Linear Regression (SLR) for Mean Response, Inference, and Prediction
15.0 Notation Table
Symbol |
Meaning |
|---|---|
\((x_i, y_i)\) |
observed predictor/response pair, \(i=1,\dots,n\) |
\(n\) |
sample size (number of observed pairs) |
\(Y\) |
response random variable |
\(X\) |
predictor random variable (or observed design values) |
\(\mu_{Y\mid x}\) |
mean response at predictor value \(x\) |
\(\beta_0,\ \beta_1\) |
intercept and slope parameters |
\(\varepsilon_i\) |
random error term |
\(\sigma^2\) |
error variance |
\(\bar{x},\ \bar{y}\) |
sample means of \(x_i\) and \(y_i\) |
\(S_{xx}\) |
\(\sum_{i=1}^n (x_i-\bar{x})^2\) |
\(S_{xy}\) |
\(\sum_{i=1}^n (x_i-\bar{x})(y_i-\bar{y})\) |
\(b_0,\ b_1\) |
least squares estimates of \(\beta_0,\beta_1\) |
\(\hat{y}_i\) |
fitted value at \(x_i\), \(\hat{y}_i=b_0+b_1x_i\) |
\(e_i\) |
residual, \(e_i=y_i-\hat{y}_i\) |
\(\mathrm{SSE}\) |
error sum of squares, \(\sum e_i^2\) |
\(s^2,\ s\) |
\(s^2=\mathrm{SSE}/(n-2)\), and \(s=\sqrt{s^2}\) |
\(\mathrm{se}(b_1)\) |
standard error of \(b_1\) |
\(t_{\alpha/2,\ n-2}\) |
t critical value with \(n-2\) df |
\(\hat{y}(x_0)\) |
fitted mean response at \(x_0\), \(b_0+b_1x_0\) |
\(Y_0\) |
a future response at predictor value \(x_0\) |
\(r_i\) |
standardized residual, \(r_i=e_i/s\) |
15.1 Introduction
The previous module used scatterplots and covariance-based summaries to describe linear association between two quantitative variables. Description alone is not sufficient when the objective is prediction, process planning, or quantifying how the mean response changes with a predictor.
This module introduces simple linear regression (SLR) as a model for the conditional mean response \(\mu_{Y\mid x}\) and shows how least squares produces a fitted line. The second objective is inference: confidence intervals, prediction intervals, and hypothesis tests for the slope, all grounded in sampling distributions under stated assumptions.
15.2 Learning Outcomes
By the end of this module, students should be able to:
State the SLR model and its core assumptions (linearity, independence, constant variance, and Normality for t-based inference).
Compute and interpret the least squares estimates \(b_0\) and \(b_1\) using \(S_{xx}\) and \(S_{xy}\).
Use residuals and \(\mathrm{SSE}\) to estimate \(\sigma^2\) via \(s^2=\mathrm{SSE}/(n-2)\).
Conduct t inference for the slope \(\beta_1\), including confidence intervals and hypothesis tests.
Construct and interpret a confidence interval for \(\mu_{Y\mid x_0}\) and a prediction interval for a new observation \(Y_0\) at \(x_0\).
Read basic residual diagnostics and connect visible patterns to assumption failures.
15.3 Main Concepts
15.3.1 The Simple Linear Regression Model
SLR models the conditional mean of \(Y\) as a linear function of \(x\):
For observed predictor values \(x_1,\dots,x_n\), the sampling model is
The standard assumptions are:
\(E(\varepsilon_i)=0\) (no systematic error),
\(\mathrm{Var}(\varepsilon_i)=\sigma^2\) (constant variance),
errors are independent across observations.
For t-based confidence intervals and hypothesis tests, a further condition is commonly used:
\(\varepsilon_i \sim N(0,\sigma^2)\) (Normality), so \(Y_i\) is Normal conditional on \(x_i\).
15.3.2 Least Squares Estimates and Fitted Values
Least squares chooses \(b_0\) and \(b_1\) to minimize the sum of squared vertical deviations from the line:
Define centered sums:
The least squares estimates are
The fitted value at \(x_i\) is
and the residual is
Figure 15.1 narrative (m15_fig01_scatter_fit). This figure shows a scatterplot with a fitted regression line and a known true mean line (from simulation). Increasing \(n\) typically makes the fitted line more stable because the slope estimate averages over more information. The plot also reinforces that visual fit alone does not validate assumptions; residual diagnostics are required.
15.3.3 Least Squares as an Optimization Problem
SLR uses closed-form estimates, but the optimization view clarifies the method. The criterion
is quadratic in \((b_0,b_1)\) and has a unique global minimum whenever
meaning not all \(x_i\) are identical. If all \(x_i\) are the same, a slope cannot be estimated.
Figure 15.2 narrative (m15_fig02_sse_contour). This figure plots SSE contours over a grid of \((b_0,b_1)\) values and marks the minimizer. The nested elliptical contours reflect a smooth quadratic surface. Points with high leverage or large residuals can reshape the SSE surface and pull the minimizer, which motivates influence awareness.
15.3.4 Estimating Error Variance and Fit Summaries
The residual sum of squares is
Under the SLR model, \(\sigma^2\) is estimated by
The degrees of freedom \(n-2\) reflect estimation of \(b_0\) and \(b_1\).
A related descriptive summary is
In SLR with an intercept, \(R^2\) equals \(r^2\), the squared sample correlation between \(x\) and \(y\). This is a descriptive fit measure; it does not imply causality.
15.3.5 t Inference for the Slope (and Intercept)
The slope \(\beta_1\) is typically the main inferential target. Under the stated assumptions,
Thus the standard error of \(b_1\) is
A two-sided \(100(1-\alpha)\%\) confidence interval for \(\beta_1\) is
For testing
use
with \(n-2\) degrees of freedom. The common special case is \(H_0:\beta_1=0\).
The intercept standard error is
In many operational settings, \(\beta_0\) is not practically interpretable unless \(x=0\) is inside the observed operating range.
Figure 15.3 narrative (m15_fig03_slope_sampling). This figure shows the sampling distribution of \(b_1\) under repeated sampling from a known model. As \(n\) increases, the distribution tightens around the true \(\beta_1\), which corresponds to smaller standard error and narrower confidence intervals. The goal is to reinforce that a single observed \(b_1\) is one draw from a distribution.
15.3.6 Confidence Interval for Mean Response and Prediction Interval
Regression supports two distinct targets at a given predictor value \(x_0\):
Mean response: \(\mu_{Y\mid x_0}=\beta_0+\beta_1x_0\)
A new outcome: \(Y_0\) at \(x_0\)
Both use the point estimate
A \(100(1-\alpha)\%\) confidence interval for the mean response is
A \(100(1-\alpha)\%\) prediction interval for a new observation is
Both intervals widen as \(|x_0-\bar{x}|\) increases. The prediction interval is always wider than the mean-response interval because it includes irreducible outcome variability (the leading \(1\) term).
Figure 15.4 narrative (m15_fig04_ci_pi_band). This figure overlays the fitted line with a confidence band for mean response and a wider prediction band for new observations. The key reading rule is that prediction uncertainty is dominated by individual-level variability, so the prediction band remains relatively wide even when the mean-response band is tight.
15.3.7 Residual Diagnostics: What Patterns Mean
Residual checks assess whether the modeling assumptions are plausible. Residuals are
and standardized residuals are
Two common diagnostic plots are:
residuals vs fitted (or residuals vs \(x\)): checks linearity and constant variance,
Normal Q–Q plot: checks Normality (relevant for t-based inference in small samples).
Figure 15.5 narrative (m15_fig05_residual_patterns). This figure shows typical residual patterns: curvature (nonlinearity), funnel shape (heteroscedasticity), and isolated extreme points (outliers/high influence). The point is procedural: when patterns appear, revise the model, transform the response, or investigate special causes rather than treating the p-value as final.
15.3.8 Examples
Example 15.1 (Maintenance Effort and Downtime)
A plant records monthly preventive maintenance hours (\(x\)) and unplanned downtime hours (\(y\)) for a critical line. The goal is to estimate how the mean downtime changes with maintenance effort.
Question: Fit an SLR line predicting downtime from maintenance hours, interpret the slope, and predict downtime at \(x=6\).
Using least squares, suppose the fitted line is
Answer: The slope \(b_1\approx -1.452\) indicates that, within the observed range, the mean downtime is estimated to decrease by about 1.45 hours for each additional maintenance hour. At \(x=6\), the predicted downtime is \(\hat{y}(6)\approx 9.27\) hours.
Example 15.2 (Inference for the Slope)
A quality engineer studies whether increasing conveyor speed increases defect rate. Let \(x\) be speed and \(y\) be defect rate.
Question: Test \(H_0:\beta_1=0\) vs \(H_1:\beta_1\neq 0\) at \(\alpha=0.05\) and report a 95% CI for \(\beta_1\).
Compute
and form
Answer: Reject \(H_0\) if \(|t_0|\) is large (or p-value is small). The confidence interval quantifies the plausible range of the mean defect-rate change per unit increase in speed.
Example 15.3 (Mean Response vs Individual Prediction at \(x_0\))
A service team models weekly new sign-ups (\(y\)) using marketing spend (\(x\)). Management needs (i) a mean forecast and (ii) a plausible range for a single future week.
Question: At \(x_0=25\), compute a 95% CI for \(\mu_{Y\mid x_0}\) and a 95% PI for \(Y_0\).
Answer: Use the mean-response interval for planning expected workload, and the prediction interval for staffing risk. The prediction interval is wider because it includes week-to-week variability not removed by knowing \(x_0\).
15.4 Discussion and Common Errors
SLR models the mean response, not a deterministic law. The slope describes expected change in \(\mu_{Y\mid x}\), while individual outcomes vary around that mean with variance \(\sigma^2\).
A frequent error is confusing the confidence interval for \(\mu_{Y\mid x_0}\) with the prediction interval for \(Y_0\). The prediction interval must be wider and remains wide even when the fitted line is precise.
Extrapolation is risky even with high \(R^2\). Uncertainty increases away from \(\bar{x}\), and model validity outside the observed range is not guaranteed.
Residual patterns should drive action: curvature suggests a missing nonlinear term, a funnel suggests non-constant variance, and extreme points suggest outlier or influence concerns. When assumption failures appear, revise the model or the measurement/design rather than relying on fragile p-values.
Regression does not by itself establish causation. When \(x\) is not controlled experimentally, interpret the slope as association in a conditional-mean model, not as a causal effect.
15.5 Summary
Simple linear regression models the conditional mean response
and estimates parameters by least squares. Residuals and SSE yield an estimate of the error variance via
Inference for the slope uses a t distribution with \(n-2\) degrees of freedom, producing confidence intervals and hypothesis tests. Regression also supports two distinct interval types: confidence intervals for the mean response and wider prediction intervals for individual future outcomes.
The module closes with residual diagnostics to assess whether linearity, constant variance, and approximate Normality are plausible enough for the intended inference and prediction.