From MSE to Bayesian Linear Regression

14 Feb. 2025

!! This is just an AI generated demo post (But may still be an interesting read) !!

1. Mean Squared Error (MSE)

Let's start with the simplest approach to fitting a model: minimizing the Mean Squared Error. Given a model fθf_\theta with parameters θ\theta, and data points (Ti,Ri)(T_i, R_i), we want to find:

θ=argminθi=1N(Rifθ(Ti))2\theta^* = \arg\min_\theta \sum_{i=1}^N (R_i - f_\theta(T_i))^2

This is intuitive and computationally simple, but it comes with hidden assumptions:

  • We care equally about errors in any direction (symmetry)
  • Larger errors are penalized quadratically
  • We only want a point estimate of parameters
  • We have no prior knowledge to incorporate

2. Maximum Likelihood Estimation (MLE)

We can generalize MSE by viewing it as a probabilistic problem. Instead of just minimizing error, we ask: what parameters make our observed data most likely?

We assume our observations follow:

R=fθ(T)+ϵR = f_\theta(T) + \epsilon

where

ϵN(0,σ2)\epsilon \sim N(0, \sigma^2)

This leads to maximizing:

θ=argmaxθi=1Nlogp(RiTi;θ)\theta^* = \arg\max_\theta \sum_{i=1}^N \log p(R_i|T_i;\theta)

For Gaussian noise, this expands to:

θ=argmaxθi=1N[12log(2πσ2)(Rifθ(Ti))22σ2]\theta^* = \arg\max_\theta \sum_{i=1}^N [-\frac{1}{2}\log(2\pi\sigma^2) - \frac{(R_i - f_\theta(T_i))^2}{2\sigma^2}]

Key Generalizations over MSE:

  1. Can handle different noise distributions by changing p(RT;θ)p(R|T;θ)
  2. Provides a probabilistic interpretation
  3. Naturally extends to non-Gaussian cases
  4. When noise is Gaussian, reduces exactly to MSE!

3. Maximum A Posteriori (MAP) Estimation

We can further generalize by incorporating prior knowledge about parameters. Using Bayes' rule:

p(θD)p(Dθ)p(θ)p(\theta|D) \propto p(D|\theta)p(\theta)

The MAP estimate is:

θ=argmaxθ[logp(Dθ)+logp(θ)]\theta^* = \arg\max_\theta [\log p(D|\theta) + \log p(\theta)]

With a Gaussian prior θN(0,Σprior)\theta \sim N(0, \Sigma_{prior}), this becomes:

θ=argmaxθ[i=1N(Rifθ(Ti))22σ2θTΣprior1θ2]\theta^* = \arg\max_\theta [-\sum_{i=1}^N \frac{(R_i - f_\theta(T_i))^2}{2\sigma^2} - \frac{\theta^T\Sigma_{prior}^{-1}\theta}{2}]

Key Generalizations over MLE:

  1. Incorporates prior knowledge
  2. Naturally provides regularization
  3. MLE is a special case with uniform prior
  4. Can handle ill-posed problems better

4. Full Bayesian Linear Regression

The final generalization is to move beyond point estimates and consider the full posterior distribution:

p(θD)=p(Dθ)p(θ)p(D)p(\theta|D) = \frac{p(D|\theta)p(\theta)}{p(D)}

For linear regression with:

  • Gaussian prior: θN(0,Σprior)\theta \sim N(0, \Sigma_{prior})
  • Gaussian likelihood: RT,θN(fθ(T),σ2)R|T,\theta \sim N(f_\theta(T), \sigma^2)

The posterior is also Gaussian with closed-form expressions for its mean and covariance.

Key Generalizations over MAP:

  1. Full uncertainty quantification
  2. Can marginalize over parameters instead of optimizing
  3. Handles multi-modal posteriors
  4. Allows for hierarchical modeling
  5. MAP is just the mode of this distribution

The Complete Picture

These approaches form a clear hierarchy:

  1. MSE: Simple point estimation
  2. MLE: Probabilistic framework, still point estimation
  3. MAP: Adds prior knowledge, still point estimation
  4. Full Bayesian: Complete probabilistic treatment

Each step adds capabilities while maintaining the previous approach as a special case. The Bayesian approach is the most general, containing all others as special cases:

  • MSE is MAP with Gaussian noise and uniform prior
  • MLE is MAP with uniform prior
  • MAP is the mode of the Bayesian posterior

This hierarchy shows how seemingly different approaches are deeply connected, each adding new capabilities while preserving the insights of simpler methods.