14. Correlation and Regression Readiness I: Covariance \(\sigma_{XY}\), Correlation \(\rho\), and Diagnostics for Linear Modeling (\(\beta_1\))

14.0 Notation Table

Symbol	Meaning
\(X,\ Y\)	random variables (two quantitative measures)
\((x_i,\ y_i)\)	observed pair for unit \(i\)
\(n\)	number of observed pairs
\(\mu_X,\ \mu_Y\)	population means
\(\sigma_X^2,\ \sigma_Y^2\)	population variances
\(\sigma_{XY}=\mathrm{Cov}(X,Y)\)	population covariance
\(\rho=\mathrm{Corr}(X,Y)\)	population correlation
\(\bar{x},\ \bar{y}\)	sample means
\(S_{xx}=\sum (x_i-\bar{x})^2\)	corrected sum of squares for \(x\)
\(S_{yy}=\sum (y_i-\bar{y})^2\)	corrected sum of squares for \(y\)
\(S_{xy}=\sum (x_i-\bar{x})(y_i-\bar{y})\)	corrected cross-product sum
\(r\)	sample correlation coefficient
\(\beta_0,\ \beta_1\)	population intercept and slope (linear model)
\(b_0,\ b_1\)	fitted intercept and slope (from data)
\(\varepsilon\)	random error in a regression model
\(t\)	test statistic for correlation/slope (df \(n-2\))

14.1 Introduction

In previous modules, inference focused on mean comparisons across categorical groups. In many operational and managerial settings, the primary question concerns the association between two quantitative variables measured on the same unit.

This module develops covariance and correlation as preparation for simple linear regression. The emphasis is methodological discipline: visualize first, diagnose structure and anomalies, then interpret numerical summaries. Correlation is treated as a descriptive measure of linear association, not as a substitute for modeling.

14.3.2 Correlation as a Standardized Covariance

The population correlation is

\[\rho=\frac{\sigma_{XY}}{\sigma_X\sigma_Y}\]

Correlation is dimensionless and bounded:

\[-1 \le \rho \le 1\]

Values near \(\pm 1\) indicate strong linear association. Values near \(0\) indicate weak linear association.

Important qualification:

\[\rho = 0\]

does not imply independence in general. Under a bivariate Normal model, however, zero correlation is equivalent to independence. Outside that model, nonlinear dependence may exist even when \(\rho=0\).

14.3.3 Sample Correlation and Its Connection to Regression

Define

\[S_{xx}=\sum (x_i-\bar{x})^2,\quad S_{yy}=\sum (y_i-\bar{y})^2,\quad S_{xy}=\sum (x_i-\bar{x})(y_i-\bar{y})\]

The sample correlation is

\[r=\frac{S_{xy}}{\sqrt{S_{xx}S_{yy}}}\]

In simple linear regression

\[Y=\beta_0+\beta_1X+\varepsilon\]

the least-squares slope estimate is

\[b_1=\frac{S_{xy}}{S_{xx}}\]

so the sign of \(b_1\) matches the sign of \(r\).

The coefficient of determination satisfies

\[r^2=\frac{SSR}{S_{yy}}\]

where \(SSR\) is the regression sum of squares. Thus \(r^2\) represents the proportion of sample variation in \(Y\) explained by a linear function of \(X\).

Under a bivariate Normal model, testing

\[H_0:\rho=0\]

is equivalent to testing

\[H_0:\beta_1=0.\]

The test statistic is

\[t_0=\frac{r\sqrt{n-2}}{\sqrt{1-r^2}}\]

which follows a \(t\) distribution with \(n-2\) degrees of freedom under \(H_0\).

14.3.4 Scatterplot-First Workflow

Before computing \(r\), apply the following diagnostic sequence.

Form: Does the relationship appear approximately linear?
Unusual points: Are there outliers or high-leverage points?
Spread: Does variability of \(Y\) change with \(X\)?
Structure: Is there clustering, stratification, or time ordering?

A reported value of \(r\) should summarize the visible pattern. If the plot contradicts a linear interpretation, the plot governs.

14.3.7 Zero Correlation Does Not Imply “No Relationship”

A value \(r \approx 0\) should be interpreted as:

“No linear association is evident in this sample.”

It does not imply:

no association,
independence of \(X\) and \(Y\),
absence of predictability.

Nonlinear relationships can produce \(r=0\) because positive and negative cross-products cancel in \(S_{xy}\).

14.3.8 Three Distinct Independence Statements

These are distinct concepts.

Independent observations: \((X_i,Y_i)\) independent across \(i\).
\(X\) independent of \(Y\): A joint-distribution statement.
Regression exogeneity:

\[E(\varepsilon \mid X)=0\]

Independent observations \(\neq\) \(X\) independent of \(Y\).

The regression condition \(E(\varepsilon \mid X)=0\) can hold even when \(X\) and \(Y\) are strongly associated.

Example 14.3 (Outlier Sensitivity)

Let

\[x=(1,2,3,4,5,6,7,8)\]

and

\[y=(1,2,3,4,5,6,7,-10).\]

The final point has extreme leverage and a large negative residual. The sample correlation becomes strongly negative (approximately \(r \approx -0.59\)), demonstrating how a single high-leverage outlier can reverse the direction of association.

14.5 Summary

Covariance describes co-variation but depends on units. Correlation standardizes covariance to a dimensionless measure bounded between \(-1\) and \(1\).

The sample correlation \(r\) summarizes linear association, and \(r^2\) connects directly to variation explained by a linear regression model.

A scatterplot-first workflow is a required step before numerical interpretation. Correlation is descriptive; regression is a model. Diagnostic checks determine whether a linear model is appropriate for inference.

Finally, independence must be specified precisely: independence of observations, independence of variables, and the regression condition \(E(\varepsilon\mid X)=0\) are distinct concepts with different implications.