STATS 413

Bias-variance decomposition

This post supplements the supervised learning slides. Please see the slides for the setup.

We wish to derive the bias-variance decomposition on p.21 (of the slides):

E[(Yf^(X))2X=x]=bias[f^(x)]2+var[f^(x)]+var[ϵX=x],bias[f^(x)](f(x)E[f^(x)]),var[f^(x)]=E[(f^(x)E[f^(x)])2].

All the expectations (unless otherwise stated) are with respect to (X,Y) and f^. Note that the irreducible error var[ϵX=x] depends on x. This is the more general form of the irreducible error for heteroscedastic problems in which the (conditional) variance of ϵ depends on x.

First, we decompose the MSE of a fixed f^ into reducible and irreducible parts (see p.5):

E[(Yf^(X))2X=x]=E[(f(X)+ϵf^(X))2X=x]=E[(f(X)f^(X))2X=x]+E[ϵ2X=x]+2E[(f(X)f^(X))ϵX=x]=(f(x)f^(x))2reducible error+E[ϵ2X=x]+2(f(x)f^(x))E[ϵX=x],

where f(x)=E[YX=x] is the regression function. It is not hard to check that the conditional mean of ϵ is zero:

E[ϵX=x]=E[Yf(X)X=x]=E[YX=x]f(x)=0.

Thus the second term in the decomposition of E[(Yf^(X))2X=x] is the irreducible error, and the third term is zero. Note that this decomposition remains valid for a (random) f^ fit to training data because (X,Y) is a test sample that is independent of the training data. In other words, we can average/integrate the decomposition with respect to the training data to obtain

E[(Yf^(X))2X=x]=E[(f(x)f^(x))2]+var[ϵX=x].

Second, we decompose the reducible part of the MSE into (squared) bias and variance:

E[(f(x)f^(x))2]=E[(f(x)E[f^(x)]+E[f^(x)]f^(x))2]=(f(x)E[f^(x)]bias[f^(x)])2+E[(E[f^(x)]f^(x))2]var[f^(x)]+2E[(f(x)E[f^(x)])(E[f^(x)]f^(x))].

The third term is zero because

E[(f(x)E[f^(x)])(E[f^(x)]f^(x))]=(f(x)E[f^(x)])E[E[f^(x)]f^(x)]=(f(x)E[f^(x)])(E[f^(x)]E[f^(x)]0).

Posted on September 01, 2021 from Ann Arbor, MI