Linear regression

DATASCI 415: Statistical Learning and Data Mining

University of Michigan

including slides by Gareth James, Daniela Witten, Trevor Hastie, Rob Tibshirani, Jonathan Taylor

Least squares as maximum likelihood

Assume

The likelihood function

L (β_{0}, β) ≜ \prod_{i = 1}^{n} \frac{1}{\sqrt{2 π σ^{2}}} \exp (- \frac{1}{2 σ^{2}} (Y_{i} - β_{0} - β^{⊤} X_{i})^{2})

ouputs the probability of observing the training data ${(X_{i}, Y_{i})}_{i = 1}^{n}$ for the input parameters.

Idea: find the parameters that maximize the probability of observing the training data:

({\hat{β}}_{0}, {\hat{β}}_{1}) \leftarrow {argmax}_{β_{0}, β_{1}} L (β_{0}, β_{1});

i.e. find the parameters so that training data is most probable.

In practice, it is often more convenient to find the parameters that (equivalently) minimize the negative log-likelihood:

({\hat{β}}_{0}, {\hat{β}}_{1}) = {argmin}_{β_{0}, β_{1}} - \log L (β_{0}, β_{1}) .