Least squares as maximum Gaussian likelihood

This post supplements the linear regression slides. Please see the slides for the setup.

The Gaussian linear regression model is

Y ∣ X = x \sim N (β_{0} + β^{⊤} x, σ^{2}),

where $β_{0} \in R, β \in R^{d}$ are parameters (we assume $\sigma^2$ is known for now). It is a collection of probability distributions: one for each parameter setting/value. The model (implicitly) encodes distributional/probabilistic restrictions on the data because it rules out distributions not in the collection from generating the data. For example, the Gaussian linear regression model rules out non-linear dependencies between the inputs and the output because all distributions in the model posit a linear relation between the inputs and the output. Note that the restrictions imposed by the model may be incorrect; i.e. the data may actually come from a distribution that is not in the model. When the model does not include the data generating distribution, the model is said to be misspecified.

We fit the model to training data by estimating the parameters with maximum likelihood. To motivate maximum likelihood, consider two possible sets of model parameters. One way to decide which set of parameters fits the training data better is to compare the probability of observing the training data from (the distributions associated with) the two sets of parameters: the higher the probability of observing the training data, the better the parameters fit. For the Gaussian linear regression model, the probability of observing the training data ${(X_{1}, Y_{1}), \dots, (X_{n}, Y_{n})}$ from parameters $(β_{0}, β)$ is

\begin{aligned} L (β_{0}, β) & ≜ P {{(X_{i}, Y_{i})}_{i = 1}^{n}; β_{0}, β} \\ = \prod_{i = 1}^{n} P {Y_{i} ∣ X_{i}; β_{0}, β} & (samples are independent) \\ = \prod_{i = 1}^{n} \frac{1}{(2 π σ^{2})^{\frac{1}{2}}} \exp {- \frac{1}{2 σ^{2}} (Y_{i} - β_{0} - β^{⊤} X_{i})^{2}} & (Gaussian lin reg model) . \end{aligned}

We call $L$ the likelihood; it is a function of (i.e. its inputs are) the model parameters, and it outputs the probability of observing the data from the input parameters. We emphasize that the likelihood is not the same as the pdf of the data; the likelihood is a function of parameter values while the pdf is a function of data values. Although their functional forms are similar, their inputs are different.

The maximum likelihood estimator (MLE) is the parameter value that maximizes the likelihood; i.e. the parameter values that best fit the training data in the sense that the probability of observing the training data from the MLE is higher than the probability of observing the training data from any other parameter values. In practice, we usually maximize the log of the likelihood (called the log-likelihood) or minimize the negative of the log-likelihood (called the negative log-likelihood). For the Gaussian linear regression model, the negative log-likelihood is

\log L (β_{0}, β) = \sum_{i = 1}^{n} \frac{1}{2 σ^{2}} (Y_{i} - β_{0} - β^{⊤} X_{i})^{2} + \frac{1}{2} \log (2 π σ^{2}) .

We drop the second term that does not depend on the parameters to obtain the least squares cost function. Thus least squares is the maximum likelihood estimator for the Gaussian linear regression model.

Posted on September 15, 2024 from San Francisco, CA.