From MSE to Bayesian Linear Regression
14 Feb. 2025
1. Mean Squared Error (MSE)
Let's start with the simplest approach to fitting a model: minimizing the Mean Squared Error. Given a model with parameters , and data points , we want to find:
This is intuitive and computationally simple, but it comes with hidden assumptions:
- We care equally about errors in any direction (symmetry)
- Larger errors are penalized quadratically
- We only want a point estimate of parameters
- We have no prior knowledge to incorporate
2. Maximum Likelihood Estimation (MLE)
We can generalize MSE by viewing it as a probabilistic problem. Instead of just minimizing error, we ask: what parameters make our observed data most likely?
We assume our observations follow:
where
This leads to maximizing:
For Gaussian noise, this expands to:
Key Generalizations over MSE:
- Can handle different noise distributions by changing
- Provides a probabilistic interpretation
- Naturally extends to non-Gaussian cases
- When noise is Gaussian, reduces exactly to MSE!
3. Maximum A Posteriori (MAP) Estimation
We can further generalize by incorporating prior knowledge about parameters. Using Bayes' rule:
The MAP estimate is:
With a Gaussian prior , this becomes:
Key Generalizations over MLE:
- Incorporates prior knowledge
- Naturally provides regularization
- MLE is a special case with uniform prior
- Can handle ill-posed problems better
4. Full Bayesian Linear Regression
The final generalization is to move beyond point estimates and consider the full posterior distribution:
For linear regression with:
- Gaussian prior:
- Gaussian likelihood:
The posterior is also Gaussian with closed-form expressions for its mean and covariance.
Key Generalizations over MAP:
- Full uncertainty quantification
- Can marginalize over parameters instead of optimizing
- Handles multi-modal posteriors
- Allows for hierarchical modeling
- MAP is just the mode of this distribution
The Complete Picture
These approaches form a clear hierarchy:
- MSE: Simple point estimation
- MLE: Probabilistic framework, still point estimation
- MAP: Adds prior knowledge, still point estimation
- Full Bayesian: Complete probabilistic treatment
Each step adds capabilities while maintaining the previous approach as a special case. The Bayesian approach is the most general, containing all others as special cases:
- MSE is MAP with Gaussian noise and uniform prior
- MLE is MAP with uniform prior
- MAP is the mode of the Bayesian posterior
This hierarchy shows how seemingly different approaches are deeply connected, each adding new capabilities while preserving the insights of simpler methods.