End-to-end guide — from framing to communication. Works for both prediction and inference tasks.
Be precise. Not "customer value" but "total spend in the next 90 days in ₹." Vague Y = untestable model. Y type determines your method. Linear regression works well for continuous Y (revenue, price). For other types, specialist models are needed:
This single decision changes what you optimise for, how you build the model, and what you report. Most business problems are one or the other — be explicit.
Before looking at data: what should logically drive Y? Business knowledge is your first filter against spurious correlations. With p = 100 predictors, approximately 5 will appear significant at the 5% level purely by chance — even with no real relationship.
Any statistic computed to guide a decision (mean for imputation, log transformation parameters, dummy encoding rules, correlation thresholds) learned from the full dataset leaks test set information into training. This creates artificially optimistic performance metrics and is data leakage.
Look for extreme skewness in Y and each X. Heavily skewed Y (income, revenue, house prices) often benefits from a log transformation before modelling — it linearises exponential relationships and tends to reduce heteroscedasticity. Flag impossible or suspicious values. Get a sense of ranges and outliers.
A linear model fitted to a curved relationship will be systematically biased. Visible curves in scatter plots signal the need for transformations or polynomial terms before fitting. Also note any points that are extreme on both X and Y simultaneously.
High correlation between predictors (multicollinearity) inflates SE(β̂ⱼ), widens confidence intervals, and makes individual coefficient estimates unstable. It does not hurt prediction accuracy, but it destroys inference — you can no longer reliably say which variable is driving Y.
The standard linear model assumes each predictor's effect on Y is independent of all others. But sometimes the effect of X₁ on Y depends on the level of X₂. This is an interaction effect. Identify candidate interactions using business logic first — not data fishing.
Statistical power is the probability of detecting a true effect when it exists. Low power means you may miss real relationships due to a small sample size. This step applies mainly to experimental or planned-data designs where you control n.
Lock the test set away immediately after splitting. It must not influence any modelling decision — not even preprocessing decisions. Test set exists to give you an honest, final performance number used exactly once. If any preprocessing parameter (mean for imputation, transformation fit, encoding rules) is computed from the full dataset, test data leaks into training.
Too few training observations relative to number of candidate predictors → unstable coefficients, inflated R², unreliable standard errors. Check AFTER split, using training set size only.
Variable selection, transformation choices, and model complexity decisions must all be made using CV MSE on training data only. CV gives a far more reliable estimate of generalisation error than training MSE, which always decreases as you add variables.
Any statistic you compute (mean for imputation, transformation parameters, encoding rules, scaling factors, outlier thresholds, feature interactions) must come from the training set. Write down what parameters you learned on train. Apply the exact same parameters to test data without refitting.
The mechanism behind missing data matters. Never silently drop rows without understanding why. Calculate imputation parameters (mean, median, mode) from training set only. Apply those values to both train and test.
A log-transformed Y is appropriate when effects are multiplicative rather than additive. Common in revenue, salary, price data. If transforming: compute transformation parameters (e.g., λ in Box-Cox) from the training set distribution only, not the full dataset.
OLS cannot process text. For a categorical variable with K levels, create K−1 dummies. Identify the levels in the training set only. When you encounter a new level in test data that wasn't in train, handle it by grouping into "Other" or the baseline.
Standardization (subtract mean, divide by SD) or min-max scaling can improve numerical stability and interpretation. Compute mean and SD from training set only. Apply those exact parameters to test set. Centering is particularly important if you plan to include interaction or polynomial terms.
Once you have clean X and Y, create derived features: X² for curves, X₁ × X₂ for interactions, log(X) for skewed predictors. Document which transformations you create on training data. Apply the same transformations to test data.
Assumptions like homoscedasticity, normality of errors, and correlated errors cannot be verified before fitting — they require residuals which only exist after the model is run. Here you make preliminary assessments; Phase 09 confirms them formally. The full process is iterative: fit → diagnose → fix → refit.
The linearity assumption is that the combined effect of all X variables has a linear relationship with Y — i.e. E[Y | X₁, X₂, ...] = β₀ + β₁X₁ + β₂X₂ + ... It does not require that each individual X has a straight-line relationship with Y when looked at in isolation. Clear curves in individual scatter plots signal the need to transform variables (e.g., add polynomial terms like X² or use log(X)) so that the combined linear combination fits well. The definitive check is the residual vs. fitted plot after fitting — it will show a systematic U-shape if the combined relationship is violated.
For multiple regression with 2+ predictors, individual X vs Y scatter plots can be misleading due to confounding. After fitting, use partial regression plots (partial residual plots) to visualize the relationship between each X and Y while holding others constant. A partial regression plot shows [residual(Y after removing effect of other Xs)] vs [residual(Xⱼ after removing effect of other Xs)].
If observations are related (same customer across time, same store across weeks, repeated measurements), error terms εᵢ are correlated. This causes OLS to underestimate standard errors, making confidence intervals too narrow and p-values too small — you appear more certain than you are.
Perfect multicollinearity (e.g. including "total spend" and two components that sum to it exactly) makes (XᵀX) singular — OLS has no unique solution and fails entirely. High but not perfect collinearity is handled in Phase 09 via VIF.
The null model always predicts ȳ regardless of X. It is your absolute floor. R² literally measures improvement over this baseline — R² = 0 means your model does no better than guessing the mean. Any model that cannot beat this has zero practical value.
Before fitting a full model, understand the strongest single variable. Does the sign of β̂₁ match business expectation? Unexpected signs often signal missing confounders or colliders — problems that multi-variable models may hide. Record this as a reference baseline.
β̂₀ and β̂₁ are estimates that vary from sample to sample. SE(β̂₁) measures how much β̂₁ would vary if you repeatedly drew new samples — it is the standard deviation of the estimator across hypothetical repeated samples. Larger SE = more uncertainty about the true β₁.
A 95% CI means: if you repeated this study many times, 95% of the constructed intervals would contain the true β. It conveys the precision of your estimate. A CI that contains zero means you cannot rule out that the variable has no effect.
For each predictor, test whether there is evidence of a relationship with Y after accounting for all other predictors in the model. This is a test of the partial effect of each variable.
In multiple regression, always check the F-statistic first. It tests whether at least one predictor is useful overall. Individual t-tests are only meaningful once F confirms the model is not wholly uninformative.
RSE estimates σ — the standard deviation of the irreducible error. It is the average amount predictions deviate from the true regression line, in the same units as Y. Unlike R², it is directly interpretable in business terms.
β̂₀ is the estimated value of Y when all predictors equal zero. This is only meaningful if X=0 is a realistic value in your context. If no observation in your data has X near zero, the intercept is a mathematical extrapolation — report it but do not interpret it substantively.
When deciding whether to add a group of variables (e.g. all interaction terms, or a set of dummy variables for one categorical), use a partial F-test rather than individual t-tests. It tests whether the group of q variables jointly improve the model.
β̂ⱼ is your estimate of how much Y changes when Xⱼ increases by one unit, holding all other variables constant. The magnitude and 95% CI are the main deliverables. The p-value merely indicates whether you can rule out β=0; it does not measure importance.
R² measures the proportion of variance in Y explained by the model. It is scale-free and always 0–1. However, R² always increases when variables are added — even noise variables. Never use plain R² to compare models with different numbers of predictors.
Backward elimination ("remove least significant variables") and forward selection ("add most significant variables") are NOT recommended because they:
For prediction tasks, stepwise can work (use CV MSE as stopping rule, not p-values). For inference, use regularization (Ridge/Lasso) instead — it shrinks coefficients without the selection bias.
Regularization shrinks coefficients toward zero, automatically performing variable selection without the biases of stepwise methods. Choose λ (shrinkage strength) via cross-validation.
Start with intercept only. At each step, add the variable that most reduces CV MSE. Stop when adding any remaining variable worsens CV MSE. Use CV MSE as stopping rule, NOT p-values.
Fit the full model. Remove the variable that most reduces CV MSE (not the one with highest p-value). Refit. Repeat until removal worsens CV MSE. Use CV MSE, not p-values, as the stopping criterion.
Do not use "remove if p > 0.05" or "add if p < 0.05" as your stopping rule. This multiplies false discovery risk and biases all subsequent inference. Instead: use CV MSE (for prediction) or domain knowledge + regularization (for inference).
Residuals eᵢ = yᵢ − ŷᵢ plotted against fitted values ŷᵢ. In multiple regression always use fitted values (not individual Xs) on the x-axis. Want to see: random scatter around zero with no discernible pattern.
| Pattern seen | What it means | Fix |
|---|---|---|
| U-shape or curve | Non-linearity — the linear model is systematically wrong | Add X² or transform X |
| Funnel / cone shape | Heteroscedasticity — residual variance grows with fitted values | Log-transform Y, use weighted least squares, or robust SEs |
| Trending pattern | Missing variable — unexplained signal remains in residuals | Add a relevant predictor |
| Random scatter | ✓ Linearity and homoscedasticity are satisfied | — |
If your residual plot shows heteroscedasticity (funnel shape) but you don't want to transform Y, robust standard errors (also called sandwich estimators or Huber-White SEs) provide valid inference under non-constant variance. The point estimates β̂ stay the same, but SEs, CIs, and p-values adjust to account for the heteroscedasticity.
Plot residuals in time order (or group order). If adjacent residuals have similar values ("tracking"), errors are correlated. This makes standard errors too small, p-values too small, and CIs too narrow — a dangerous false precision.
Plot quantiles of residuals against theoretical normal quantiles. Points should fall along a straight diagonal. Normality is needed for p-values and CIs to be exactly valid under small samples.
An outlier is an observation whose Y value is far from what the model predicts. An influential point is one that strongly affects fitted coefficients. These are not the same — a point can have high leverage (unusual X values) yet still be fit well by the model. Investigate all high-influence points before deletion.
A high-leverage observation has an unusual X value (not Y). It can substantially shift the fitted line, potentially invalidating the entire model. Leverage is more dangerous than outliers because the model bends toward it — yet residuals may look small, hiding the problem.
VIF(β̂ⱼ) measures how much the variance of β̂ⱼ is inflated due to correlation with other predictors. Pairwise correlation (Phase 02) misses multivariate collinearity — VIF is the definitive test. High VIF doesn't hurt prediction but makes inference on individual coefficients unreliable.
Diagnostics → fix → refit → re-diagnose is an iterative loop, not a one-time checkbox. Keep cycling until residual plots look clean.
These answer fundamentally different questions. Mixing them up produces either false precision or unnecessary alarm. A PI is always wider than a CI for the same x*. Use the right one for your decision context.
The regression line is calibrated only within the range of X seen during training. Extrapolation beyond this range assumes the linear relationship holds in territory with no empirical support — it may not, and predictions can be wildly wrong.
After all modelling decisions are finalised via CV, apply the model to the test set exactly once. Do not retune after seeing test results — that makes the test number optimistically biased and dishonest.
The gap between training RMSE and test RMSE reveals whether your problem is under-fitting (model too simple) or over-fitting (model too complex). This diagnosis guides your next move.
"RMSE = ₹4,200" is concrete and immediately interpretable. "R² = 0.82" is abstract. Always contextualise: "Our model predicts within ₹4,200 on average; the baseline of just predicting the mean was off by ₹9,800." This frames the value of the model directly.
"β̂₁ = 0.0475" is meaningless. "Every additional ₹1,000 in TV advertising is associated with approximately 47.5 additional units sold per week, holding radio and newspaper spend constant — with a plausible range of 42 to 53 units (95% CI)." That is the deliverable for an inference task.
Linear regression identifies statistical associations. It cannot prove causation without a randomised experiment or a formal causal inference framework. Saying "X causes Y" when you mean "X is correlated with Y" is a serious analytical error that erodes credibility.
The CI communicates both statistical significance and practical magnitude in one number. A coefficient with CI [0.001, 0.003] may be significant yet trivially small. A coefficient with CI [42, 53] tells a stakeholder exactly what range of effect they should plan for.
Define the range of X values for which predictions are trustworthy. What variables were unavailable? What non-linearities might you have simplified away? Under what conditions would the model go stale (new market entrant, policy change, seasonality shifts)? Proactive honesty about limitations builds more trust than overselling.