17. Regression and Correlation IV: Diagnostics, Influence, and Model Reporting

17.0 Notation Table

Notation (symbols used in this module)
Symbol	Meaning
\(y_i\)	observed response for unit \(i\)
\(\hat{y}_i\)	fitted value for unit \(i\)
\(e_i = y_i - \hat{y}_i\)	ordinary residual
\(n\)	number of observations
\(k\)	number of predictors (excluding intercept)
\(p = k+1\)	number of parameters (including intercept)
\(X\)	design matrix (includes intercept column)
\(\beta\)	regression coefficient vector (population)
\(b\)	least squares estimate of \(\beta\)
\(s^2\)	mean square error (MSE) from fitted model, \(s^2=SSE/(n-p)\)
\(H = X(X^\top X)^{-1}X^\top\)	hat matrix
\(h_{ii}\)	leverage (diagonal of \(H\))
\(r_i\)	internally studentized residual
\(t_i\)	externally studentized (R-student) residual
\(\delta_i\)	PRESS residual (leave-one-out)
\(\mathrm{PRESS}\)	prediction sum of squares
\(R^2\)	in-sample coefficient of determination
\(R^2_{\mathrm{pred}}\)	prediction \(R^2\) based on PRESS
\(C_p\)	Mallows’ \(C_p\) statistic
\(D_i\)	Cook’s distance (influence measure)

17.1 Introduction

In earlier regression modules, the emphasis was on fitting a model, interpreting coefficients, and performing inference for mean response and prediction. Those conclusions are reliable only when the model is adequate for the data and when a small number of problematic observations is not dominating the fit.

This module develops a practical workflow for checking model adequacy and reporting results responsibly. The focus is on residual analysis, detection of outliers and misspecification, identification of influential observations, and prediction-oriented assessment using leave-one-out logic and the PRESS statistic. The module closes with a reporting template that aligns diagnostics, purpose, and conclusions.

17.2 Learning Outcomes

After completing this module, students should be able to:

Explain why model checking is required before relying on coefficient tests and intervals.
Compute and interpret scaled residual diagnostics, including studentized and R-student residuals.
Use residual plots to detect outliers, nonconstant variance, and mean-function misspecification.
Explain leverage and influence, and describe how \(h_{ii}\) affects diagnostic scaling.
Describe prediction-focused assessment using leave-one-out logic and the PRESS statistic.
Compare candidate regression models using simple criteria and report a defensible final model.

17.3 Main Concepts

17.3.1 Model adequacy in regression

Regression inference is conditional on a working error model. Under independent sampling, a common set of conditions is:

Mean structure is adequate (the chosen predictors and functional form capture the systematic pattern in \(E(Y\mid X)\)).
Errors have mean zero given predictors: \(E(\varepsilon\mid X)=0\).
Errors have constant variance \(\sigma^2\) (or at least no severe heteroscedasticity for the intended procedures).
Errors are approximately Normal when using small-sample \(t\) and \(F\) reference distributions.

Diagnostics are part of the inferential contract. If diagnostics show strong violations, then standard errors, tests, and intervals may not describe the intended uncertainty, and predictions may be unstable.

17.3.2 Residuals and scale-free residual diagnostics

The ordinary residual is

\[e_i = y_i - \hat{y}_i.\]

Raw residual magnitudes are not directly comparable across observations because the residual variance depends on leverage. Leverage values are the diagonal elements of the hat matrix:

\[H = X(X^\top X)^{-1}X^\top,\qquad \hat{\mathbf{y}} = H\mathbf{y},\qquad h_{ii}=\text{diag}(H).\]

A standard scale-free residual is the internally studentized residual:

\[r_i = \frac{e_i}{s\sqrt{1-h_{ii}}},\]

where \(s^2=SSE/(n-p)\) is the fitted model MSE. Studentization adjusts for the fact that high-leverage observations have smaller residual variance under the fitted model.

A more sensitive outlier diagnostic is the externally studentized (R-student) residual:

\[t_i = \frac{e_i}{s_{-i}\sqrt{1-h_{ii}}},\]

where \(s_{-i}\) is computed from a refit that excludes observation \(i\). This makes \(t_i\) more responsive to a single extreme point because the scale is not inflated by the same point being assessed.

Practical note on interpretation: values with \(|r_i|\) or \(|t_i|\) exceeding about 2 are often flagged for review, but flags are not automatic deletion rules. They are prompts for provenance checks and sensitivity analysis.

17.3.3 Residual plots for adequacy checking

Residual plots are informal but high-leverage tools because they show whether residual behavior matches assumptions.

A core plot is studentized residuals versus fitted values \(\hat{y}_i\). It supports three checks:

Outliers: isolated points with large \(|r_i|\) or \(|t_i|\).
Heteroscedasticity: spread that increases or decreases with \(\hat{y}_i\) (fan/funnel shapes).
Mean misspecification: systematic structure (curvature, waves, clustering) rather than random scatter around 0.

A second plot is a Normal Q–Q plot of studentized residuals. Points close to a straight line support Normality as a working approximation for inference. Strong tail departures suggest heavy tails, outliers, or a mixture of regimes. Q–Q plots do not validate the mean structure; they assess whether the residual distribution is compatible with the Normal reference used by \(t\) and \(F\) procedures.

Figure 17.1 (file: m17_residual_patterns) trains reading skills for residual-versus-fitted plots by showing controlled scenarios (well-specified, heteroscedastic, nonlinear mean, outlier). The operational rule is to diagnose patterns first and decide actions second (transformations, added terms, robust methods, or process investigation).

17.3.4 Leverage and influential observations

Not all unusual points are equally consequential. A point can be unusual in response (large residual) or unusual in predictor space (high leverage). Influence tends to be largest when both occur together.

Leverage diagnostics:

\(h_{ii}\) measures how far observation \(i\) is in predictor space relative to the design.
Average leverage is \(\bar{h}=p/n\). Values well above \(p/n\) indicate potential high leverage.

Leverage alone does not imply a bad point. High leverage can arise from rare but legitimate operating regimes (e.g., peak demand, unusual product mix). The concern is whether the fitted model depends heavily on a small number of such points.

Influence diagnostics summarize how much the fit would change if a point were removed. A widely used measure is Cook’s distance \(D_i\), which combines leverage and residual size. Software typically reports \(D_i\) and highlights large values.

Figure 17.2 (file: m17_influence_plot) integrates leverage (x-axis), R-student residuals (y-axis), and Cook’s distance (bubble size). The key reading rule is that points with both high leverage and large standardized residuals tend to have high influence and should trigger verification and sensitivity checks rather than silent removal.

17.3.5 Prediction checking, cross-validation, and PRESS

Training fit is not the same as predictive performance. A model can have high \(R^2\) and still generalize poorly if it overfits noise or depends on influential observations.

Cross-validation separates fitting from assessment. In leave-one-out cross-validation (LOOCV), observation \(i\) is held out, the model is fitted on the remaining \(n-1\) observations, and \(y_i\) is predicted from the refit. The LOOCV prediction error is the PRESS residual

\[\delta_i = y_i - \hat{y}_{i,-i}.\]

In least squares regression, \(\delta_i\) can be computed efficiently from the ordinary residual and leverage:

\[\delta_i = \frac{e_i}{1-h_{ii}}.\]

The PRESS statistic aggregates LOOCV prediction errors:

\[\mathrm{PRESS} = \sum_{i=1}^n \delta_i^2.\]

A prediction-oriented \(R^2\) can be defined using PRESS:

\[R^2_{\mathrm{pred}} = 1 - \frac{\mathrm{PRESS}}{\sum_{i=1}^n (y_i-\bar{y})^2}.\]

A useful operational reading rule is that \(R^2_{\mathrm{pred}}\) can be much smaller than training \(R^2\) when the model is overfit or unstable; a large gap suggests that in-sample fit is not translating to predictive performance.

Another commonly reported model-size criterion is Mallows’ \(C_p\). In practice, \(C_p\) is used to compare candidate models by balancing fit and complexity. A common form (using \(\hat{\sigma}^2\) from a reference model) is

\[C_p = \frac{SSE_p}{\hat{\sigma}^2} - (n-2p),\]

where \(SSE_p\) is the error sum of squares for a model with \(p\) parameters. Under adequate conditions, models with \(C_p\) near \(p\) are often viewed as having a good balance of bias and variance. The reference variance choice should be stated because it affects \(C_p\).

Figure 17.3 (file: m17_press_model_selection) illustrates how training \(R^2\) can rise monotonically with model size while PRESS-based criteria can worsen, indicating overfitting. When prediction is the goal, PRESS-based criteria and parsimony typically deserve priority over maximizing training \(R^2\).

17.3.6 Reporting workflow: a defensible diagnostic narrative

A defensible regression report links purpose, evidence, and limitations.

Purpose (state the decision target) - Explanation (drivers and partial effects), prediction (forecasting), or monitoring (process stability). - This choice governs emphasis: coefficient inference for explanation, cross-validation/PRESS for prediction.
Evidence (state fitted results with uncertainty) - Model form (predictors, transforms, interactions). - Coefficient estimates with units and ceteris paribus language. - Uncertainty summaries (standard errors, confidence intervals) with sample size and degrees of freedom.
Adequacy checks (state what was examined and what was found) - Residual-versus-fitted: outliers, heteroscedasticity, curvature. - Q–Q plot: Normality as a working approximation (for inference). - Leverage/influence: any high-leverage or high-Cook’s-distance points and what was done.
Sensitivity and scope (state robustness and operating region) - If influential points exist, report whether conclusions change materially under refits or robust alternatives. - Report predictor ranges used to fit the model and avoid extrapolation language.

This structure prevents a common failure mode: presenting coefficient p-values without acknowledging whether the model assumptions and stability checks support those p-values.

17.4 Discussion and Common Errors

Workflow errors dominate many regression failures.

Treating training \(R^2\) as predictive evidence without any out-of-sample check (PRESS or cross-validation).
Deleting a point solely because \(|t_i|\) is large. The defensible sequence is verify provenance, check for special causes, then evaluate sensitivity.
Ignoring leverage and focusing only on residual size. High leverage points can dominate coefficients even with moderate residuals.
Declaring misspecification from weak patterns in very small samples. With modest \(n\), use multiple diagnostics and state conclusions cautiously.
Using automated variable selection as a substitute for process knowledge. Selection criteria compare candidates; they do not guarantee interpretability, stability, or causal meaning.

17.5 Summary

This module treated diagnostics and reporting as required components of regression analysis. Studentized and R-student residuals were used to detect outliers on a comparable scale across observations, while leverage and influence measures (including Cook’s distance) were used to identify observations that can dominate fitted results. Prediction-focused assessment was introduced through leave-one-out logic and PRESS, motivating prediction-oriented criteria such as \(R^2_{\mathrm{pred}}\). The module concluded with a reporting workflow that aligns model purpose, diagnostic evidence, and sensitivity checks to produce defensible operational conclusions.