15. Regression and Correlation II: Simple Linear Regression (SLR) for Mean Response, Inference, and Prediction

15.0 Notation Table

Notation (symbols used in this module)
Symbol	Meaning
\((x_i, y_i)\)	observed predictor/response pair, \(i=1,\dots,n\)
\(n\)	sample size (number of observed pairs)
\(Y\)	response random variable
\(X\)	predictor random variable (or observed design values)
\(\mu_{Y\mid x}\)	mean response at predictor value \(x\)
\(\beta_0,\ \beta_1\)	intercept and slope parameters
\(\varepsilon_i\)	random error term
\(\sigma^2\)	error variance
\(\bar{x},\ \bar{y}\)	sample means of \(x_i\) and \(y_i\)
\(S_{xx}\)	\(\sum_{i=1}^n (x_i-\bar{x})^2\)
\(S_{xy}\)	\(\sum_{i=1}^n (x_i-\bar{x})(y_i-\bar{y})\)
\(b_0,\ b_1\)	least squares estimates of \(\beta_0,\beta_1\)
\(\hat{y}_i\)	fitted value at \(x_i\), \(\hat{y}_i=b_0+b_1x_i\)
\(e_i\)	residual, \(e_i=y_i-\hat{y}_i\)
\(\mathrm{SSE}\)	error sum of squares, \(\sum e_i^2\)
\(s^2,\ s\)	\(s^2=\mathrm{SSE}/(n-2)\), and \(s=\sqrt{s^2}\)
\(\mathrm{se}(b_1)\)	standard error of \(b_1\)
\(t_{\alpha/2,\ n-2}\)	t critical value with \(n-2\) df
\(\hat{y}(x_0)\)	fitted mean response at \(x_0\), \(b_0+b_1x_0\)
\(Y_0\)	a future response at predictor value \(x_0\)
\(r_i\)	standardized residual, \(r_i=e_i/s\)

15.1 Introduction

The previous module used scatterplots and covariance-based summaries to describe linear association between two quantitative variables. Description alone is not sufficient when the objective is prediction, process planning, or quantifying how the mean response changes with a predictor.

This module introduces simple linear regression (SLR) as a model for the conditional mean response \(\mu_{Y\mid x}\) and shows how least squares produces a fitted line. The second objective is inference: confidence intervals, prediction intervals, and hypothesis tests for the slope, all grounded in sampling distributions under stated assumptions.

15.2 Learning Outcomes

By the end of this module, students should be able to:

State the SLR model and its core assumptions (linearity, independence, constant variance, and Normality for t-based inference).
Compute and interpret the least squares estimates \(b_0\) and \(b_1\) using \(S_{xx}\) and \(S_{xy}\).
Use residuals and \(\mathrm{SSE}\) to estimate \(\sigma^2\) via \(s^2=\mathrm{SSE}/(n-2)\).
Conduct t inference for the slope \(\beta_1\), including confidence intervals and hypothesis tests.
Construct and interpret a confidence interval for \(\mu_{Y\mid x_0}\) and a prediction interval for a new observation \(Y_0\) at \(x_0\).
Read basic residual diagnostics and connect visible patterns to assumption failures.

15.3 Main Concepts

15.3.1 The Simple Linear Regression Model

SLR models the conditional mean of \(Y\) as a linear function of \(x\):

\[\mu_{Y\mid x} = \beta_0 + \beta_1 x\]

For observed predictor values \(x_1,\dots,x_n\), the sampling model is

\[Y_i = \beta_0 + \beta_1 x_i + \varepsilon_i,\qquad i=1,\dots,n.\]

The standard assumptions are:

\(E(\varepsilon_i)=0\) (no systematic error),
\(\mathrm{Var}(\varepsilon_i)=\sigma^2\) (constant variance),
errors are independent across observations.

For t-based confidence intervals and hypothesis tests, a further condition is commonly used:

\(\varepsilon_i \sim N(0,\sigma^2)\) (Normality), so \(Y_i\) is Normal conditional on \(x_i\).

15.3.2 Least Squares Estimates and Fitted Values

Least squares chooses \(b_0\) and \(b_1\) to minimize the sum of squared vertical deviations from the line:

\[\mathrm{SSE}(b_0,b_1) = \sum_{i=1}^n \bigl(y_i - (b_0 + b_1 x_i)\bigr)^2.\]

Define centered sums:

\[S_{xx}=\sum_{i=1}^n (x_i-\bar{x})^2,\qquad S_{xy}=\sum_{i=1}^n (x_i-\bar{x})(y_i-\bar{y}).\]

The least squares estimates are

\[b_1=\frac{S_{xy}}{S_{xx}}, \qquad b_0=\bar{y}-b_1\bar{x}.\]

The fitted value at \(x_i\) is

\[\hat{y}_i=b_0+b_1x_i,\]

and the residual is

\[e_i=y_i-\hat{y}_i.\]

Figure 15.1 narrative (m15_fig01_scatter_fit). This figure shows a scatterplot with a fitted regression line and a known true mean line (from simulation). Increasing \(n\) typically makes the fitted line more stable because the slope estimate averages over more information. The plot also reinforces that visual fit alone does not validate assumptions; residual diagnostics are required.

15.3.3 Least Squares as an Optimization Problem

SLR uses closed-form estimates, but the optimization view clarifies the method. The criterion

\[\mathrm{SSE}(b_0,b_1)\]

is quadratic in \((b_0,b_1)\) and has a unique global minimum whenever

\[S_{xx}>0,\]

meaning not all \(x_i\) are identical. If all \(x_i\) are the same, a slope cannot be estimated.

Figure 15.2 narrative (m15_fig02_sse_contour). This figure plots SSE contours over a grid of \((b_0,b_1)\) values and marks the minimizer. The nested elliptical contours reflect a smooth quadratic surface. Points with high leverage or large residuals can reshape the SSE surface and pull the minimizer, which motivates influence awareness.

15.3.4 Estimating Error Variance and Fit Summaries

The residual sum of squares is

\[\mathrm{SSE}=\sum_{i=1}^n e_i^2.\]

Under the SLR model, \(\sigma^2\) is estimated by

\[s^2 = \frac{\mathrm{SSE}}{n-2},\qquad s=\sqrt{s^2}.\]

The degrees of freedom \(n-2\) reflect estimation of \(b_0\) and \(b_1\).

A related descriptive summary is

\[R^2 = 1 - \frac{\mathrm{SSE}}{\mathrm{SST}}, \qquad \mathrm{SST}=\sum_{i=1}^n (y_i-\bar{y})^2.\]

In SLR with an intercept, \(R^2\) equals \(r^2\), the squared sample correlation between \(x\) and \(y\). This is a descriptive fit measure; it does not imply causality.

15.3.5 t Inference for the Slope (and Intercept)

The slope \(\beta_1\) is typically the main inferential target. Under the stated assumptions,

\[T=\frac{b_1-\beta_1}{s/\sqrt{S_{xx}}}\sim t_{n-2}.\]

Thus the standard error of \(b_1\) is

\[\mathrm{se}(b_1)=\frac{s}{\sqrt{S_{xx}}}.\]

A two-sided \(100(1-\alpha)\%\) confidence interval for \(\beta_1\) is

\[b_1 \pm t_{\alpha/2,\ n-2}\,\mathrm{se}(b_1).\]

For testing

\[H_0:\beta_1=\beta_{1,0}\quad\text{vs}\quad H_1:\beta_1\ne \beta_{1,0},\]

use

\[t_0=\frac{b_1-\beta_{1,0}}{\mathrm{se}(b_1)}\]

with \(n-2\) degrees of freedom. The common special case is \(H_0:\beta_1=0\).

The intercept standard error is

\[\mathrm{se}(b_0)=s\sqrt{\frac{1}{n}+\frac{\bar{x}^2}{S_{xx}}}.\]

In many operational settings, \(\beta_0\) is not practically interpretable unless \(x=0\) is inside the observed operating range.

Figure 15.3 narrative (m15_fig03_slope_sampling). This figure shows the sampling distribution of \(b_1\) under repeated sampling from a known model. As \(n\) increases, the distribution tightens around the true \(\beta_1\), which corresponds to smaller standard error and narrower confidence intervals. The goal is to reinforce that a single observed \(b_1\) is one draw from a distribution.

15.3.6 Confidence Interval for Mean Response and Prediction Interval

Regression supports two distinct targets at a given predictor value \(x_0\):

Mean response: \(\mu_{Y\mid x_0}=\beta_0+\beta_1x_0\)
A new outcome: \(Y_0\) at \(x_0\)

Both use the point estimate

\[\hat{y}(x_0)=b_0+b_1x_0.\]

A \(100(1-\alpha)\%\) confidence interval for the mean response is

\[\hat{y}(x_0)\ \pm\ t_{\alpha/2,\ n-2}\, s\sqrt{\frac{1}{n}+\frac{(x_0-\bar{x})^2}{S_{xx}}}.\]

A \(100(1-\alpha)\%\) prediction interval for a new observation is

\[\hat{y}(x_0)\ \pm\ t_{\alpha/2,\ n-2}\, s\sqrt{1+\frac{1}{n}+\frac{(x_0-\bar{x})^2}{S_{xx}}}.\]

Both intervals widen as \(|x_0-\bar{x}|\) increases. The prediction interval is always wider than the mean-response interval because it includes irreducible outcome variability (the leading \(1\) term).

Figure 15.4 narrative (m15_fig04_ci_pi_band). This figure overlays the fitted line with a confidence band for mean response and a wider prediction band for new observations. The key reading rule is that prediction uncertainty is dominated by individual-level variability, so the prediction band remains relatively wide even when the mean-response band is tight.

15.3.7 Residual Diagnostics: What Patterns Mean

Residual checks assess whether the modeling assumptions are plausible. Residuals are

\[e_i=y_i-\hat{y}_i,\]

and standardized residuals are

\[r_i=\frac{e_i}{s}.\]

Two common diagnostic plots are:

residuals vs fitted (or residuals vs \(x\)): checks linearity and constant variance,
Normal Q–Q plot: checks Normality (relevant for t-based inference in small samples).

Figure 15.5 narrative (m15_fig05_residual_patterns). This figure shows typical residual patterns: curvature (nonlinearity), funnel shape (heteroscedasticity), and isolated extreme points (outliers/high influence). The point is procedural: when patterns appear, revise the model, transform the response, or investigate special causes rather than treating the p-value as final.

15.3.8 Examples

Example 15.1 (Maintenance Effort and Downtime)

A plant records monthly preventive maintenance hours (\(x\)) and unplanned downtime hours (\(y\)) for a critical line. The goal is to estimate how the mean downtime changes with maintenance effort.

Question: Fit an SLR line predicting downtime from maintenance hours, interpret the slope, and predict downtime at \(x=6\).

Using least squares, suppose the fitted line is

\[\hat{y}=17.988-1.452x.\]

Answer: The slope \(b_1\approx -1.452\) indicates that, within the observed range, the mean downtime is estimated to decrease by about 1.45 hours for each additional maintenance hour. At \(x=6\), the predicted downtime is \(\hat{y}(6)\approx 9.27\) hours.

Example 15.2 (Inference for the Slope)

A quality engineer studies whether increasing conveyor speed increases defect rate. Let \(x\) be speed and \(y\) be defect rate.

Question: Test \(H_0:\beta_1=0\) vs \(H_1:\beta_1\neq 0\) at \(\alpha=0.05\) and report a 95% CI for \(\beta_1\).

Compute

\[t_0=\frac{b_1}{\mathrm{se}(b_1)}\quad\text{with df }n-2,\]

and form

\[b_1 \pm t_{0.025,\ n-2}\,\mathrm{se}(b_1).\]

Answer: Reject \(H_0\) if \(|t_0|\) is large (or p-value is small). The confidence interval quantifies the plausible range of the mean defect-rate change per unit increase in speed.

Example 15.3 (Mean Response vs Individual Prediction at \(x_0\))

A service team models weekly new sign-ups (\(y\)) using marketing spend (\(x\)). Management needs (i) a mean forecast and (ii) a plausible range for a single future week.

Question: At \(x_0=25\), compute a 95% CI for \(\mu_{Y\mid x_0}\) and a 95% PI for \(Y_0\).

Answer: Use the mean-response interval for planning expected workload, and the prediction interval for staffing risk. The prediction interval is wider because it includes week-to-week variability not removed by knowing \(x_0\).

15.4 Discussion and Common Errors

SLR models the mean response, not a deterministic law. The slope describes expected change in \(\mu_{Y\mid x}\), while individual outcomes vary around that mean with variance \(\sigma^2\).

A frequent error is confusing the confidence interval for \(\mu_{Y\mid x_0}\) with the prediction interval for \(Y_0\). The prediction interval must be wider and remains wide even when the fitted line is precise.

Extrapolation is risky even with high \(R^2\). Uncertainty increases away from \(\bar{x}\), and model validity outside the observed range is not guaranteed.

Residual patterns should drive action: curvature suggests a missing nonlinear term, a funnel suggests non-constant variance, and extreme points suggest outlier or influence concerns. When assumption failures appear, revise the model or the measurement/design rather than relying on fragile p-values.

Regression does not by itself establish causation. When \(x\) is not controlled experimentally, interpret the slope as association in a conditional-mean model, not as a causal effect.

15.5 Summary

Simple linear regression models the conditional mean response

\[\mu_{Y\mid x}=\beta_0+\beta_1x\]

and estimates parameters by least squares. Residuals and SSE yield an estimate of the error variance via

\[s^2=\mathrm{SSE}/(n-2).\]

Inference for the slope uses a t distribution with \(n-2\) degrees of freedom, producing confidence intervals and hypothesis tests. Regression also supports two distinct interval types: confidence intervals for the mean response and wider prediction intervals for individual future outcomes.

The module closes with residual diagnostics to assess whether linearity, constant variance, and approximate Normality are plausible enough for the intended inference and prediction.