Bias-variance decomposition

This post supplements the supervised learning slides. Please see the slides for the setup.

We wish to derive the bias-variance decomposition on p.21 (of the slides):

\begin{array}{r} E [(Y - \hat{f} (X))^{2} ∣ X = x] = bias [\hat{f} (x)]^{2} + v a r [\hat{f} (x)] + v a r [ϵ ∣ X = x], \\ bias [\hat{f} (x)] ≜ (f (x) - E [\hat{f} (x)]), \\ v a r [\hat{f} (x)] = E [(\hat{f} (x) - E [\hat{f} (x)])^{2}] . \end{array}

All the expectations (unless otherwise stated) are with respect to $(X, Y)$ and $\hat{f}$ . Note that the irreducible error $v a r [ϵ ∣ X = x]$ depends on $x$ . This is the more general form of the irreducible error for heteroscedastic problems in which the (conditional) variance of $ϵ$ depends on $x$ .

First, we decompose the MSE of a fixed $\hat{f}$ into reducible and irreducible parts (see p.5):

\begin{aligned} E [(Y - \hat{f} (X))^{2} ∣ X = x] \\ = E [(f (X) + ϵ - \hat{f} (X))^{2} ∣ X = x] \\ = E [(f (X) - \hat{f} (X))^{2} ∣ X = x] + E [ϵ^{2} ∣ X = x] + 2 E [(f (X) - \hat{f} (X)) ϵ ∣ X = x] \\ = \underset{reducible error}{\underset{⏟}{(f (x) - \hat{f} (x))^{2}}} + E [ϵ^{2} ∣ X = x] + 2 (f (x) - \hat{f} (x)) E [ϵ ∣ X = x], \end{aligned}

where $f (x) = E [Y ∣ X = x]$ is the regression function. It is not hard to check that the conditional mean of $ϵ$ is zero:

E [ϵ ∣ X = x] = E [Y - f (X) ∣ X = x] = E [Y ∣ X = x] - f (x) = 0.

Thus the second term in the decomposition of $E [(Y - \hat{f} (X))^{2} ∣ X = x]$ is the irreducible error, and the third term is zero. Note that this decomposition remains valid for a (random) $\hat{f}$ fit to training data because $(X, Y)$ is a test sample that is independent of the training data. In other words, we can average/integrate the decomposition with respect to the training data to obtain

E [(Y - \hat{f} (X))^{2} ∣ X = x] = E [(f (x) - \hat{f} (x))^{2}] + v a r [ϵ ∣ X = x] .

Second, we decompose the reducible part of the MSE into (squared) bias and variance:

\begin{aligned} E [(f (x) - \hat{f} (x))^{2}] & = E [(f (x) - E [\hat{f} (x)] + E [\hat{f} (x)] - \hat{f} (x))^{2}] \\ = (\underset{bias [\hat{f} (x)]}{\underset{⏟}{f (x) - E [\hat{f} (x)]}})^{2} + \underset{v a r [\hat{f} (x)]}{\underset{⏟}{E [(E [\hat{f} (x)] - \hat{f} (x))^{2}]}} \\ + 2 E [(f (x) - E [\hat{f} (x)]) (E [\hat{f} (x)] - \hat{f} (x))] . \end{aligned}

The third term is zero because

\begin{aligned} E [(f (x) - E [\hat{f} (x)]) (E [\hat{f} (x)] - \hat{f} (x))] & = (f (x) - E [\hat{f} (x)]) E [E [\hat{f} (x)] - \hat{f} (x)] \\ = (f (x) - E [\hat{f} (x)]) (\underset{0}{\underset{⏟}{E [\hat{f} (x)] - E [\hat{f} (x)]}}) . \end{aligned}

Posted on September 01, 2021 from Ann Arbor, MI