Linear Regression Checklist

Define Y — what exactly are you predicting? Must

Be precise. Not "customer value" but "total spend in the next 90 days in ₹." Vague Y = untestable model. Y type determines your method. Linear regression works well for continuous Y (revenue, price). For other types, specialist models are needed:

Continuous Y

Revenue, temperature, distance. Use linear regression. Validate with residual diagnostics.

Count Y

Number of visits, claims filed. Linear regression gives biased predictions. Use Poisson or negative binomial regression instead.

Ordinal Y

Ratings (1–5 stars), satisfaction (low/medium/high). Order matters but distances meaningless. Use ordered logistic regression, not linear.

Binary Y

Yes/no, default/no-default. Use logistic regression, not linear regression.

Bounded Y

Percentages (0–100), proportions (0–1). Linear regression can predict outside bounds. Use beta regression or logistic-linear hybrid.

Ordinal ratings are not continuousA 1–5 star rating looks numeric but the jump from 3→4 is not the same as 4→5 (ordinal scales). Using linear regression assumes equal intervals, which is false. Predictions outside [1,5] don't make sense. Use ordered logistic regression instead.

Decide: Prediction or Inference? Must

This single decision changes what you optimise for, how you build the model, and what you report. Most business problems are one or the other — be explicit.

Prediction

Accurate forecasts. Black box is fine. Primary metric: test MSE / RMSE. Coefficient values don't matter as long as predictions are good.

Inference

Understand which Xs drive Y and by how much. Coefficients, SEs, p-values, and CIs are the deliverable. Interpretability over accuracy.

List candidate X variables using business intuition first Must

Before looking at data: what should logically drive Y? Business knowledge is your first filter against spurious correlations. With p = 100 predictors, approximately 5 will appear significant at the 5% level purely by chance — even with no real relationship.

⚠️ Critical: Do not make preprocessing decisions on full data

Any statistic computed to guide a decision (mean for imputation, log transformation parameters, dummy encoding rules, correlation thresholds) learned from the full dataset leaks test set information into training. This creates artificially optimistic performance metrics and is data leakage.

Wait until Phase 04To transform Y, handle missing values, encode categoricals, scale features, or select variables based on correlations — all of these must happen AFTER the train/test split, computed only on training data and applied to test data.

Explore distributions — histograms and summary statistics for all variables Must

Look for extreme skewness in Y and each X. Heavily skewed Y (income, revenue, house prices) often benefits from a log transformation before modelling — it linearises exponential relationships and tends to reduce heteroscedasticity. Flag impossible or suspicious values. Get a sense of ranges and outliers.

Scatter plot each X against Y — look for linearity, curves, and outlying points Must

A linear model fitted to a curved relationship will be systematically biased. Visible curves in scatter plots signal the need for transformations or polynomial terms before fitting. Also note any points that are extreme on both X and Y simultaneously.

Curved, diminishing

Try log(X) or √X as the predictor instead

U-shaped or accelerating

Add X² as an additional predictor (polynomial regression — still OLS)

Exponential-looking Y

Log-transform Y, then the relationship to X becomes linear

Check pairwise correlation matrix between all X variables Must

High correlation between predictors (multicollinearity) inflates SE(β̂ⱼ), widens confidence intervals, and makes individual coefficient estimates unstable. It does not hurt prediction accuracy, but it destroys inference — you can no longer reliably say which variable is driving Y.

Flag if |r| > 0.8

Pairwise correlation > 0.8 suggests possible collinearity. But pairwise misses multivariate collinearity (when one X is a linear combination of several others). This is why VIF after fitting is the definitive test.

Severity assessment

Pairwise r > 0.9: severe. r = 0.8–0.9: moderate. r < 0.8: usually acceptable. But again, wait for VIF.

If severe pairwise

Drop one variable, combine them into an index, or use Ridge regression which handles collinearity gracefully. But confirm with VIF after fitting.

Check if interaction effects are plausible between any X pairs Must

The standard linear model assumes each predictor's effect on Y is independent of all others. But sometimes the effect of X₁ on Y depends on the level of X₂. This is an interaction effect. Identify candidate interactions using business logic first — not data fishing.

Example

TV ad spend may work better when Radio spend is also high — each amplifies the other's effect on sales

How to model

Add X₁ × X₂ as an additional predictor column. Still OLS. Test with CV MSE whether it improves the model.

Hierarchical principle

If X₁×X₂ is included, always include X₁ and X₂ individually too — even if their individual p-values are not significant

Inference noteFor inference tasks, only add interaction terms when there is a clear business or theoretical reason to expect them. Fishing for interactions inflates false discovery risk.

Plan sample size for adequate statistical power (Inference tasks only) Optional Inference

Statistical power is the probability of detecting a true effect when it exists. Low power means you may miss real relationships due to a small sample size. This step applies mainly to experimental or planned-data designs where you control n.

Power calculation inputs

Effect size (β you want to detect), baseline noise (σ), number of predictors (p), desired power (typically 0.80 = 80%), significance level α = 0.05

Typical rule

For simple regression detecting a correlation r > 0.3 at 80% power: n ≈ 90. For detecting r > 0.2: n ≈ 200. Tools: G*Power (free software) or online calculators.

Multiple predictors

As p increases, required n increases roughly linearly. If planning n, use rule-of-thumb n > 20×p to ensure stable estimates (discussed in Phase 03).

Observational data

If you don't control n (e.g., analyzing historical data), you cannot adjust sample size. Instead, report post-hoc power to interpret non-significant results (e.g., "we had 40% power to detect r = 0.3, so absence of significance is inconclusive").

Split data into Train and Test FIRST — before any preprocessing Must

Lock the test set away immediately after splitting. It must not influence any modelling decision — not even preprocessing decisions. Test set exists to give you an honest, final performance number used exactly once. If any preprocessing parameter (mean for imputation, transformation fit, encoding rules) is computed from the full dataset, test data leaks into training.

Random split (balanced Y)

70–80% train, 20–30% test. Use random assignment if Y is roughly uniform (continuous) or balanced (binary).

Stratified split (imbalanced Y)

If Y has extreme class imbalance (e.g., 95% negative, 5% positive) or skewed distribution, stratify the split: ensure train and test have the same Y distribution as the full dataset. Use stratified random sampling, not pure random.

Time series

Split chronologically. Train on earlier data, test on later. Random or stratified split leaks future information into training.

Inference noteFor inference-only tasks (no forecasting), a formal train/test split is less critical. The focus shifts to coefficient stability and quality of standard errors rather than out-of-sample prediction accuracy.

Check sample size against number of predictors (on training set) Must

Too few training observations relative to number of candidate predictors → unstable coefficients, inflated R², unreliable standard errors. Check AFTER split, using training set size only.

Rule of thumb

Aim for n_train > 20 × p for stable OLS estimates. If violated, use regularization (Ridge, Lasso) instead of OLS.

If p ≥ n_train

OLS fails — (XᵀX) is not invertible. Use Ridge, Lasso, or elastic net instead.

Why 20×?

Rules of thumb vary (10×, 20×). The real constraint is degrees of freedom: if p consumes >10% of n, standard errors inflate. Use regularization to handle high p-to-n ratios gracefully.

Use K-Fold Cross-Validation on training data for all model selection decisions Must

Variable selection, transformation choices, and model complexity decisions must all be made using CV MSE on training data only. CV gives a far more reliable estimate of generalisation error than training MSE, which always decreases as you add variables.

K choice

K = 5 or K = 10 are standard. K = 10 gives lower bias; K = 5 is faster.

Compare models by

CV MSE — not training R², not training RSE.

All preprocessing below must be fit on training data only, then applied to test data

Any statistic you compute (mean for imputation, transformation parameters, encoding rules, scaling factors, outlier thresholds, feature interactions) must come from the training set. Write down what parameters you learned on train. Apply the exact same parameters to test data without refitting.

Correct preprocessing order (matters!) 1. Create "was_missing" indicators BEFORE imputing (missingness itself carries signal) → 2. Impute missing values (using train set mean/median) → 3. Encode categorical variables (based on train set levels) → 4. Transform Y and X variables (fit Box-Cox, log on train set) → 5. Create polynomial/interaction terms (X², X₁×X₂) → 6. Scale/normalize features (subtract train mean, divide by train SD)

Why this order? Interactions should use original scales (not log), scaling affects interaction magnitudes, and "was_missing" must exist before imputation changes the values. Document each step.

Handle missing values — investigate why, then impute using training statistics only Must

The mechanism behind missing data matters. Never silently drop rows without understanding why. Calculate imputation parameters (mean, median, mode) from training set only. Apply those values to both train and test.

Missing at random

Safe to impute with mean/median (from train set) or model-based imputation. Dropping rows acceptable if <5%.

Not at random

Missingness itself carries signal. Add a binary "was_missing" indicator predictor (fit on train, apply to test).

Transform Y if heavily skewed or multiplicative (fit parameters on train set only) Optional

A log-transformed Y is appropriate when effects are multiplicative rather than additive. Common in revenue, salary, price data. If transforming: compute transformation parameters (e.g., λ in Box-Cox) from the training set distribution only, not the full dataset.

Quantitative decision criteria

|Skewness| > 1.0: Y is highly skewed. Transformation likely to help. Heteroscedasticity visible in residual plot: Residual variance increases with fitted values (funnel shape) → try log(Y). Y has a natural floor (e.g., prices ≥ 0): Log transformation linearizes multiplicative growth.

Interpretation shift after log

β̂₁ changes meaning: it becomes the % change in Y per unit X (multiply by 100 for percentages). Predictions must be back-transformed: ŷ = exp(predicted log Y).

How to compare models

⚠️ Critical: If you log-transform Y, CV MSE is in log-space and NOT comparable to untransformed MSE. To compare: either (1) back-transform predictions to original scale before calculating MSE, (2) calculate RMSE on original Y scale, or (3) use Adjusted R² (scale-independent). Never directly compare MSE log-space vs original-space.

Alternative transformations

Square root (mild skew), Box-Cox (data-driven optimal λ), or reciprocal for inverse relationships. Test with CV MSE which works best.

Encode categorical variables as dummy / indicator variables Must

OLS cannot process text. For a categorical variable with K levels, create K−1 dummies. Identify the levels in the training set only. When you encounter a new level in test data that wasn't in train, handle it by grouping into "Other" or the baseline.

Encoding hazardIf test data contains a categorical level not seen in training, you have a data problem that preprocessing alone cannot fix. Document this and decide: drop the observation, group into "Other", or treat as baseline.

When K is small

K ≤ 5: create K−1 dummies directly from training set categories.

When K is medium

5 < K < 20: monitor CV MSE to ensure dummies improve generalisation. Group rare categories if CV degrades.

When K is large

K ≥ 20: consider grouping rare categories into "Other", using target encoding, or Ridge/Lasso. Merge levels with similar effects on Y.

Inference noteFor inference, choose a business-meaningful baseline so coefficients represent relevant comparisons.

Scale or normalize numeric features, center if adding interactions (fit on train set parameters) Optional

Standardization (subtract mean, divide by SD) or min-max scaling can improve numerical stability and interpretation. Compute mean and SD from training set only. Apply those exact parameters to test set. Centering is particularly important if you plan to include interaction or polynomial terms.

When to scale/center

Regularized models (Ridge, Lasso), distance-based algorithms, or when comparing coefficient magnitudes. Always center before creating interactions to reduce multicollinearity between main effects and their interactions.

How to center

z = (x − train_mean) / train_sd. Example: if train_mean = 50, train_sd = 10, and test has x = 55, then z_test = (55−50)/10 = 0.5. Never use test set mean/SD.

Centering benefit for interactions

Without centering, X and X² or X₁ and X₁×X₂ are highly correlated, inflating SEs and making interpretation hard. Centering X first breaks this artificial collinearity.

Create interaction terms and polynomial features based on EDA and business logic Optional

Once you have clean X and Y, create derived features: X² for curves, X₁ × X₂ for interactions, log(X) for skewed predictors. Document which transformations you create on training data. Apply the same transformations to test data.

Example interactions

TV × Radio spend (effects amplify each other), Age × Income (spending patterns differ by both)

Hierarchical principle

If X₁×X₂ is included, always include X₁ and X₂ individually — even if their p-values are not significant

Note — assumption checking is a two-stage process

Assumptions like homoscedasticity, normality of errors, and correlated errors cannot be verified before fitting — they require residuals which only exist after the model is run. Here you make preliminary assessments; Phase 09 confirms them formally. The full process is iterative: fit → diagnose → fix → refit.

Linearity — do scatter plots suggest straight-line relationships? Must

The linearity assumption is that the combined effect of all X variables has a linear relationship with Y — i.e. E[Y | X₁, X₂, ...] = β₀ + β₁X₁ + β₂X₂ + ... It does not require that each individual X has a straight-line relationship with Y when looked at in isolation. Clear curves in individual scatter plots signal the need to transform variables (e.g., add polynomial terms like X² or use log(X)) so that the combined linear combination fits well. The definitive check is the residual vs. fitted plot after fitting — it will show a systematic U-shape if the combined relationship is violated.

Check multivariate linearity via partial regression plots (after fitting) Optional

For multiple regression with 2+ predictors, individual X vs Y scatter plots can be misleading due to confounding. After fitting, use partial regression plots (partial residual plots) to visualize the relationship between each X and Y while holding others constant. A partial regression plot shows [residual(Y after removing effect of other Xs)] vs [residual(Xⱼ after removing effect of other Xs)].

Purpose

Reveals whether the linear assumption holds for each Xⱼ in the context of the full model. If the partial plot shows a clear curve, Xⱼ may need transformation (e.g., add Xⱼ²) even if the univariate scatter plot is fairly linear.

How to read

Points should scatter randomly around a 45° line with no systematic curve. If you see a U-shape, inverted U, or S-curve, transform that variable.

When available

Standard output from regression diagnostics in R (plot(lm_fit, which=1)), Python (statsmodels), or SAS

Independence — are observations independent of each other? Must

If observations are related (same customer across time, same store across weeks, repeated measurements), error terms εᵢ are correlated. This causes OLS to underestimate standard errors, making confidence intervals too narrow and p-values too small — you appear more certain than you are.

Common violations

Weekly sales data, customer purchase history, panel data with repeated measurements

Fix

Time-series models, mixed-effects models, or cluster-robust standard errors — not standard OLS

No perfect multicollinearity — Xs must not be exact linear combinations Must

Perfect multicollinearity (e.g. including "total spend" and two components that sum to it exactly) makes (XᵀX) singular — OLS has no unique solution and fails entirely. High but not perfect collinearity is handled in Phase 09 via VIF.

Fit the null model — predict ȳ (mean of Y) for every observation Must

The null model always predicts ȳ regardless of X. It is your absolute floor. R² literally measures improvement over this baseline — R² = 0 means your model does no better than guessing the mean. Any model that cannot beat this has zero practical value.

Record

Null model RMSE = SD(Y) on training set. This is your reference for all subsequent models.

TSS connection

TSS = Σ(yᵢ − ȳ)² is the null model's RSS. R² = 1 − RSS/TSS.

Fit simple regression on your strongest single candidate predictor Must

Before fitting a full model, understand the strongest single variable. Does the sign of β̂₁ match business expectation? Unexpected signs often signal missing confounders or colliders — problems that multi-variable models may hide. Record this as a reference baseline.

Record from output

β̂₁, SE(β̂₁), p-value, 95% CI, R², RMSE on training set

Purpose

This baseline will be compared against your final model (after variable selection in Phase 08). Does adding more variables improve over this simple model?

Timing noteYou fit preliminary baselines HERE (null + single best X). After variable selection in Phase 08, you'll refit your chosen model and compare against these baselines. This prevents selection bias from making your final model look artificially good.

Understand Standard Error of each coefficient Must

β̂₀ and β̂₁ are estimates that vary from sample to sample. SE(β̂₁) measures how much β̂₁ would vary if you repeatedly drew new samples — it is the standard deviation of the estimator across hypothetical repeated samples. Larger SE = more uncertainty about the true β₁.

SE formulas SE(β̂₁)² = σ² / Σᵢ(xᵢ − x̄)²
SE(β̂₀)² = σ² · [1/n + x̄² / Σᵢ(xᵢ − x̄)²]

σ² estimated by RSE² = RSS / (n − p − 1)
Simple regression: RSS / (n − 2)

SE(β̂₁) shrinks when

More data (larger n), more spread in X values, less noise (smaller σ²)

SE(β̂₁) grows when

X values are bunched together, high noise, small sample

Multicollinearity effect

When Xⱼ is highly correlated with other predictors, SE(β̂ⱼ) inflates significantly — this is why VIF matters for inference

Compute 95% Confidence Intervals for each coefficient Must Inference

A 95% CI means: if you repeated this study many times, 95% of the constructed intervals would contain the true β. It conveys the precision of your estimate. A CI that contains zero means you cannot rule out that the variable has no effect.

95% Confidence Interval CI for β̂₁: β̂₁ ± 2 · SE(β̂₁)
CI for β̂₀: β̂₀ ± 2 · SE(β̂₀)

Precisely: β̂₁ ± t(n−p−1, 0.975) · SE(β̂₁)
For n > 30: t ≈ 2 (standard normal approximation is fine)

CI contains zero

Cannot reject H₀ at 5% level. Variable may have no effect on Y.

CI entirely positive

Strong evidence X has a positive effect on Y.

Wide CI

High uncertainty. Need more data, better measurements, or less collinearity.

Hypothesis test on each coefficient — the t-test Must

For each predictor, test whether there is evidence of a relationship with Y after accounting for all other predictors in the model. This is a test of the partial effect of each variable.

t-test for each coefficient βⱼ

H₀

βⱼ = 0 — Xⱼ has no partial relationship with Y given other predictors

Hₐ

βⱼ ≠ 0 — Xⱼ is associated with Y, holding others constant

t = β̂ⱼ / SE(β̂ⱼ)
Follows t-distribution with n−p−1 degrees of freedom under H₀
For n > 30: |t| > 2 ≈ p < 0.05 |t| > 2.75 ≈ p < 0.01

p-value meaning

Probability of seeing |t| this large by chance if βⱼ were truly 0. Small p = strong evidence against H₀.

Decision

p < 0.05: reject H₀, conclude partial relationship exists. p < 0.01: very strong evidence.

Statistical vs. practical significanceA variable can be highly significant (p < 0.001) yet have an effect too small to matter in practice. For inference, always report the effect size (β̂ⱼ with CI) alongside the p-value — the CI tells you whether the effect is practically meaningful, not just statistically detectable.

Check the F-statistic BEFORE reading individual p-values Must

In multiple regression, always check the F-statistic first. It tests whether at least one predictor is useful overall. Individual t-tests are only meaningful once F confirms the model is not wholly uninformative.

F-test for overall model significance

H₀

β₁ = β₂ = ... = βₚ = 0 — no predictor is useful at all

Hₐ

At least one βⱼ ≠ 0

F = [(TSS − RSS) / p] / [RSS / (n − p − 1)]
H₀ true → E[F] ≈ 1 (signal ≈ noise)
Hₐ true → E[F] > 1 (signal > noise)
p-value from F-distribution with (p, n−p−1) degrees of freedom

F p-value < 0.05

Model is meaningful. Proceed to examine individual t-tests.

F p-value > 0.05

Stop. No variable is reliably useful. Individual p-values are noise.

Why F before t-tests?With p = 100 predictors and no true associations, ~5 individual p-values will be <0.05 purely by chance. The F-statistic adjusts for the number of predictors and guards against this false discovery problem. The square of each t-statistic equals the F-statistic for that variable dropped from the model (q=1).

Interpret RSE — Residual Standard Error Must

RSE estimates σ — the standard deviation of the irreducible error. It is the average amount predictions deviate from the true regression line, in the same units as Y. Unlike R², it is directly interpretable in business terms.

RSE RSE = √[ RSS / (n − p − 1) ]
n − p − 1 = degrees of freedom (penalises for p parameters estimated)
Simple regression: √[RSS / (n − 2)]

Contextualise it

RSE / mean(Y) × 100 = average percentage error. This is the most business-friendly framing.

Adding variables

RSE can increase when a new variable is added if the decrease in RSS is smaller than the penalty from losing a degree of freedom.

Interpret β̂₀ — but only when X=0 is meaningful Must Inference

β̂₀ is the estimated value of Y when all predictors equal zero. This is only meaningful if X=0 is a realistic value in your context. If no observation in your data has X near zero, the intercept is a mathematical extrapolation — report it but do not interpret it substantively.

Meaningful example

Intercept in a sales model: expected sales with zero ad spend — plausible, since products can sell organically

Not meaningful

Intercept when X = height in cm — X=0 is biologically impossible, so β̂₀ has no real interpretation

Use partial F-test to compare nested models Optional

When deciding whether to add a group of variables (e.g. all interaction terms, or a set of dummy variables for one categorical), use a partial F-test rather than individual t-tests. It tests whether the group of q variables jointly improve the model.

Partial F-test (ISLP Eq. 3.24) F = [(RSS₀ − RSS) / q] / [RSS / (n − p − 1)]
RSS₀ = RSS of model without the q variables
RSS = RSS of full model with the q variables
q = number of variables being tested jointly
p-value from F-distribution with (q, n−p−1) degrees of freedom

When to use

Testing whether all dummies for one categorical variable are jointly significant, or whether a set of interaction terms improves fit as a group

Note

Individual t-test for each coefficient = partial F-test with q=1. They are equivalent for single variables.

Interpret each β̂ⱼ — magnitude and units matter more than p-value Must Inference

β̂ⱼ is your estimate of how much Y changes when Xⱼ increases by one unit, holding all other variables constant. The magnitude and 95% CI are the main deliverables. The p-value merely indicates whether you can rule out β=0; it does not measure importance.

Practical vs statistical significance

A coefficient can be highly significant (p<0.001) yet tiny in business terms (e.g., β̂=0.0001). Conversely, large effects (β̂=50) with p=0.06 may be more important than tiny significant effects. Always report the point estimate and 95% CI, not just the p-value.

Units and scaling

If X is in millions and Y in thousands, then β̂=2 means Y increases by 2 thousand per million-unit increase in X. If you centered X, then β̂ represents the effect of a one-SD increase in X.

Log-transformed Y

If Y was log-transformed, β̂ⱼ ≈ percent change in Y per unit Xⱼ (multiply by 100 for percentage). Example: β̂=0.05 means a 5% increase in Y per unit X.

Dummy variables

For categorical X: β̂ is the difference in mean Y between that category and the baseline. Example: if baseline is "Male" and β̂_Female = 3000, then women earn 3000 units more than men on average (holding other variables constant).

Interactions

If model includes X₁ × X₂, then β̂₁ is the effect of X₁ when X₂=0 (or when centered X₂=0). The interaction coefficient β̂₁₂ shows how this effect changes per unit of X₂.

Interpret R² and use Adjusted R² / AIC / BIC to compare models Must

R² measures the proportion of variance in Y explained by the model. It is scale-free and always 0–1. However, R² always increases when variables are added — even noise variables. Never use plain R² to compare models with different numbers of predictors.

R² R² = (TSS − RSS) / TSS = 1 − RSS/TSS
TSS = Σ(yᵢ − ȳ)² RSS = Σ(yᵢ − ŷᵢ)²
R² = 0 → no better than null model
R² = 1 → perfect fit (impossible in practice)

Adjusted R²

Penalises for p. Can decrease when useless variables are added. Higher is better.

AIC / BIC

Lower is better. BIC penalises more heavily — prefers parsimonious models. Use for model comparison.

What is "good" R²?

Entirely domain-dependent. Physics: near 1. Marketing, economics: 0.1–0.4 is realistic and actionable.

⚠️ Important: Stepwise regression (p-value thresholding) is fundamentally flawed for inference

Backward elimination ("remove least significant variables") and forward selection ("add most significant variables") are NOT recommended because they:

Inflate false discovery rate: Testing 50 variables with α=0.05 yields ~2.5 false positives by chance. No correction applied.
Bias coefficient estimates: Selected variables are overestimated (selected because their t-stats were extreme).
Invalidate p-values & CIs: Standard errors are underestimated for selected variables.
Produce inflated R² & F-stats: In-sample fit looks better than out-of-sample.

For prediction tasks, stepwise can work (use CV MSE as stopping rule, not p-values). For inference, use regularization (Ridge/Lasso) instead — it shrinks coefficients without the selection bias.

Recommended approach: Regularization

Use Ridge or Lasso regression for variable selection (Recommended) Must

Regularization shrinks coefficients toward zero, automatically performing variable selection without the biases of stepwise methods. Choose λ (shrinkage strength) via cross-validation.

Ridge vs Lasso Ridge: Shrink all coefficients, keep all variables. Good when all X are truly related to Y.
Lasso: Shrink some coefficients to exactly zero, automatically dropping variables. Good when many X are irrelevant.
Elastic Net: Blend of both. Use when you're unsure.

How to choose λ

5-fold CV on training set: fit Ridge/Lasso at many λ values, pick λ that minimizes CV MSE. Plot CV error vs λ — most software does this automatically.

Interpretation

Regularized coefficients are biased toward zero (trade variance for bias). They do NOT have p-values or CIs. Use for prediction or as a tool to identify "important" variables (non-zero in Lasso).

Advantages over stepwise

No selection bias, valid CV estimates, stable across datasets, handles p > n naturally.

Alternative approach (use only if regularization not available)

Forward Selection — start from null model, add one variable at a time Alt A

Start with intercept only. At each step, add the variable that most reduces CV MSE. Stop when adding any remaining variable worsens CV MSE. Use CV MSE as stopping rule, NOT p-values.

When to use

When p > n (OLS can't fit full model) or when you lack regularization software. Avoid for inference.

Limitation

Greedy — can include variables early that become redundant when others added later. Monitor CV MSE carefully.

Stopping rule

Stop when CV MSE increases or reaches a plateau (not when p-value crosses threshold).

Backward Elimination — start with all variables, remove one at a time (NOT RECOMMENDED FOR INFERENCE) Alt B

Fit the full model. Remove the variable that most reduces CV MSE (not the one with highest p-value). Refit. Repeat until removal worsens CV MSE. Use CV MSE, not p-values, as the stopping criterion.

When to use

Only when you need a quick variable screen and regularization is unavailable. Never use for final inference model.

Cannot use when

p > n — OLS cannot fit the full model to start.

Critical: Use CV MSE, not p-values

If you remove based on p-value thresholds, you get all the biases mentioned above. Use CV MSE stopping rule instead.

Never use p-value thresholds as a selection criterion Must

Do not use "remove if p > 0.05" or "add if p < 0.05" as your stopping rule. This multiplies false discovery risk and biases all subsequent inference. Instead: use CV MSE (for prediction) or domain knowledge + regularization (for inference).

Plot residuals vs. fitted values — primary diagnostic Must

Residuals eᵢ = yᵢ − ŷᵢ plotted against fitted values ŷᵢ. In multiple regression always use fitted values (not individual Xs) on the x-axis. Want to see: random scatter around zero with no discernible pattern.

Pattern seen	What it means	Fix
U-shape or curve	Non-linearity — the linear model is systematically wrong	Add X² or transform X
Funnel / cone shape	Heteroscedasticity — residual variance grows with fitted values	Log-transform Y, use weighted least squares, or robust SEs
Trending pattern	Missing variable — unexplained signal remains in residuals	Add a relevant predictor
Random scatter	✓ Linearity and homoscedasticity are satisfied	—

Use robust standard errors if heteroscedasticity is present (Huber-White) Optional

If your residual plot shows heteroscedasticity (funnel shape) but you don't want to transform Y, robust standard errors (also called sandwich estimators or Huber-White SEs) provide valid inference under non-constant variance. The point estimates β̂ stay the same, but SEs, CIs, and p-values adjust to account for the heteroscedasticity.

Why use

Standard OLS assumes homoscedasticity. If var(ε) depends on X (funnel shape), standard SEs are wrong, CIs are too narrow, and p-values are too small. Robust SEs fix this.

How

Most statistical software supports robust SEs — available in R (sandwich package), Python (statsmodels), Stata (HC1 option), SAS (proc reg with acov option). The computation is automatic.

Tradeoff

Robust SEs are slightly wider than standard SEs (paying for the heteroscedasticity), but they are valid. If heteroscedasticity is severe, Y transformation often works better than robust SEs alone.

Check for correlated errors — especially in time or group-structured data Must

Plot residuals in time order (or group order). If adjacent residuals have similar values ("tracking"), errors are correlated. This makes standard errors too small, p-values too small, and CIs too narrow — a dangerous false precision.

What to look for

Wavy or trending pattern in time-ordered residual plot — adjacent residuals move together

Fix

Time-series models, mixed-effects models, or cluster-robust standard errors

Q-Q plot of residuals — check normality of errors Optional

Plot quantiles of residuals against theoretical normal quantiles. Points should fall along a straight diagonal. Normality is needed for p-values and CIs to be exactly valid under small samples.

Large samples (n > 100)

Mild non-normality is acceptable. Central Limit Theorem ensures coefficients are approximately normally distributed regardless.

Small samples (< 50)

Non-normality meaningfully affects p-values and CIs. Consider bootstrap confidence intervals instead.

Identify outliers and influential points — studentized residuals, Cook's distance, leverage Must

An outlier is an observation whose Y value is far from what the model predicts. An influential point is one that strongly affects fitted coefficients. These are not the same — a point can have high leverage (unusual X values) yet still be fit well by the model. Investigate all high-influence points before deletion.

Studentized residuals > 3

Points with |t-residual| > 3 are extreme outliers in Y-space. Refit without them — if coefficients change significantly (>10%), they are influencing the fit. Examine for data entry errors or unusual events.

Cook's distance > 1

Combined measure of leverage × residual size. Points with Cook's D > 1 strongly influence fitted coefficients. Refit without the point — if β̂ changes by >10%, it's a high-influence outlier. Document the decision.

Leverage > 2(p+1)/n

Points with unusual X combinations (far from the X-centroid in multivariate space). High leverage alone doesn't shift the fit, but high leverage + large residual = potential high influence. Review these closely.

Action if data error

Correct or remove. Document thoroughly.

Action if genuine outlier

Report results both with and without. An outlier can signal a missing predictor or non-linearity — do not blindly discard.

Identify high-leverage points — unusual X values that distort the fit Must

A high-leverage observation has an unusual X value (not Y). It can substantially shift the fitted line, potentially invalidating the entire model. Leverage is more dangerous than outliers because the model bends toward it — yet residuals may look small, hiding the problem.

Leverage statistic (simple regression) hᵢ = 1/n + (xᵢ − x̄)² / Σᵢ'(xᵢ' − x̄)²
Average leverage = (p+1)/n
Flag: hᵢ > 2(p+1)/n is considered high leverage

Multiple regression warningIn multiple regression, a point can be within normal range on every individual X yet still have high leverage when all Xs are considered together. Pairwise scatter plots will not catch this — you must compute the leverage statistic directly.

Compute VIF to confirm multicollinearity Must

VIF(β̂ⱼ) measures how much the variance of β̂ⱼ is inflated due to correlation with other predictors. Pairwise correlation (Phase 02) misses multivariate collinearity — VIF is the definitive test. High VIF doesn't hurt prediction but makes inference on individual coefficients unreliable.

Variance Inflation Factor VIF(β̂ⱼ) = 1 / (1 − R²_{Xⱼ|X₋ⱼ})
R²_{Xⱼ|X₋ⱼ} = R² from regressing Xⱼ onto all other predictors
If Xⱼ ≈ linear combination of others → R² near 1 → VIF → ∞

VIF = 1

No collinearity

VIF 1–5

Moderate — usually acceptable

VIF 5–10

Concerning — investigate which variables are collinear

VIF > 10

Severe — coefficient estimates unreliable for inference. Drop a variable, combine, or use Ridge.

If any assumption is violated — fix, refit, re-diagnose Must

Diagnostics → fix → refit → re-diagnose is an iterative loop, not a one-time checkbox. Keep cycling until residual plots look clean.

U-shape in residuals

Add X² or log(X)

Funnel shape

Log-transform Y, or weighted least squares

Tracking in residuals

Switch to time-series model or robust SEs

High VIF

Drop collinear predictor, combine them, or use Ridge

Outlier / leverage

Investigate individually — fix data error or fit with and without

Distinguish Confidence Interval from Prediction Interval for new observations Must

These answer fundamentally different questions. Mixing them up produces either false precision or unnecessary alarm. A PI is always wider than a CI for the same x*. Use the right one for your decision context.

CI vs. PI — when and why

Confidence Interval (CI)

Uncertainty about the average/mean response E[Y|X=x*] across a hypothetical population of obs at these X values. Assumes model is correct, population is infinite.

Prediction Interval (PI)

Uncertainty about one specific next observation Y at X=x*. Includes irreducible error σ² — the random noise you cannot eliminate. Always substantially wider than CI.

CI: ŷ ± t · SE(ŷ) ≈ ŷ ± t · RSE/√n
PI: ŷ ± t · √[ SE(ŷ)² + RSE² ]
PI width ∝ √(1 + 1/n). Even with infinite data (1/n→0), PI ≈ ŷ ± 2·RSE due to irreducible σ²

Use CI when...

Evaluating aggregate behaviour: "What is the expected average sales for all stores at this ad spend level?" Reports at scale (e.g., mean effect across a cohort).

Use PI when...

Forecasting for individuals: "What will this single store's next-quarter sales be?" Operations, supply chain, individual case planning.

ISLP advertising exampleGiven $100K TV + $20K Radio spend: CI = [10,985 – 11,528] and PI = [7,930 – 14,580]. Both centred at 11,256 but PI is ~6× wider because individual sales vary much more than average sales.

Only predict within the range of training X values Must

The regression line is calibrated only within the range of X seen during training. Extrapolation beyond this range assumes the linear relationship holds in territory with no empirical support — it may not, and predictions can be wildly wrong.

Evaluate final model on held-out test set — once, at the end Must

After all modelling decisions are finalised via CV, apply the model to the test set exactly once. Do not retune after seeing test results — that makes the test number optimistically biased and dishonest.

Diagnose train error vs. test error gap — what does the gap tell you? Must

The gap between training RMSE and test RMSE reveals whether your problem is under-fitting (model too simple) or over-fitting (model too complex). This diagnosis guides your next move.

High train RMSE + High test RMSE (small gap)

Underfitting (bias problem): Model is too simple to capture the underlying relationship. Solution: add variables, interactions, polynomial terms, or use non-linear methods (splines, GAM).

Low train RMSE + High test RMSE (large gap)

Overfitting (variance problem): Model learned noise in the training set, does not generalize. Solution: remove variables, apply regularisation (Ridge/Lasso), increase training data, or use early stopping.

Low train RMSE + Low test RMSE (small gap)

✓ Balanced fit. Model captures signal without memorizing noise. Proceed to communication and deployment.

Unexpected: Low train + Slightly lower test

Can occur due to random variation, especially with small test sets. Recheck with cross-validation (Phase 03) to confirm. Do not optimize further based on one test evaluation.

Bias-variance tradeoffThere is no free lunch. Adding complexity reduces bias (train error) but increases variance (test gap). The goal is balance, not perfection on training data.

Report RMSE in original units and compare to null model RMSE Must

"RMSE = ₹4,200" is concrete and immediately interpretable. "R² = 0.82" is abstract. Always contextualise: "Our model predicts within ₹4,200 on average; the baseline of just predicting the mean was off by ₹9,800." This frames the value of the model directly.

Translate each coefficient into plain language with units and context Must Inference

"β̂₁ = 0.0475" is meaningless. "Every additional ₹1,000 in TV advertising is associated with approximately 47.5 additional units sold per week, holding radio and newspaper spend constant — with a plausible range of 42 to 53 units (95% CI)." That is the deliverable for an inference task.

Always say "associated with" — never say "causes" Must

Linear regression identifies statistical associations. It cannot prove causation without a randomised experiment or a formal causal inference framework. Saying "X causes Y" when you mean "X is correlated with Y" is a serious analytical error that erodes credibility.

Report 95% CI alongside point estimates — not just the p-value Must Inference

The CI communicates both statistical significance and practical magnitude in one number. A coefficient with CI [0.001, 0.003] may be significant yet trivially small. A coefficient with CI [42, 53] tells a stakeholder exactly what range of effect they should plan for.

State the valid prediction range and model limitations explicitly Must

Define the range of X values for which predictions are trustworthy. What variables were unavailable? What non-linearities might you have simplified away? Under what conditions would the model go stale (new market entrant, policy change, seasonality shifts)? Proactive honesty about limitations builds more trust than overselling.