From MSE to Bayesian Linear Regression
14 Feb. 2025
!! This is just an AI generated demo post (But may still be an interesting read) !!
1. Mean Squared Error (MSE)
Let's start with the simplest approach to fitting a model: minimizing the Mean Squared Error. Given a model with parameters , and data points , we want to find:
This is intuitive and computationally simple, but it comes with hidden assumptions:
- We care equally about errors in any direction (symmetry)
- Larger errors are penalized quadratically
- We only want a point estimate of parameters
- We have no prior knowledge to incorporate
2. Maximum Likelihood Estimation (MLE)
We can generalize MSE by viewing it as a probabilistic problem. Instead of just minimizing error, we ask: what parameters make our observed data most likely?
We assume our observations follow:
where
This leads to maximizing:
For Gaussian noise, this expands to:
Key Generalizations over MSE:
- Can handle different noise distributions by changing
- Provides a probabilistic interpretation
- Naturally extends to non-Gaussian cases
- When noise is Gaussian, reduces exactly to MSE!
3. Maximum A Posteriori (MAP) Estimation
We can further generalize by incorporating prior knowledge about parameters. Using Bayes' rule:
The MAP estimate is:
With a Gaussian prior , this becomes:
Key Generalizations over MLE:
- Incorporates prior knowledge
- Naturally provides regularization
- MLE is a special case with uniform prior
- Can handle ill-posed problems better
4. Full Bayesian Linear Regression
The final generalization is to move beyond point estimates and consider the full posterior distribution:
For linear regression with:
- Gaussian prior:
- Gaussian likelihood:
The posterior is also Gaussian with closed-form expressions for its mean and covariance.
Key Generalizations over MAP:
- Full uncertainty quantification
- Can marginalize over parameters instead of optimizing
- Handles multi-modal posteriors
- Allows for hierarchical modeling
- MAP is just the mode of this distribution
The Complete Picture
These approaches form a clear hierarchy:
- MSE: Simple point estimation
- MLE: Probabilistic framework, still point estimation
- MAP: Adds prior knowledge, still point estimation
- Full Bayesian: Complete probabilistic treatment
Each step adds capabilities while maintaining the previous approach as a special case. The Bayesian approach is the most general, containing all others as special cases:
- MSE is MAP with Gaussian noise and uniform prior
- MLE is MAP with uniform prior
- MAP is the mode of the Bayesian posterior
This hierarchy shows how seemingly different approaches are deeply connected, each adding new capabilities while preserving the insights of simpler methods.