This post supplements the supervised learning slides. Please see the slides for the setup.
We wish to show that the conditional expectation \(\Ex\big[Y\mid X=x\big]\) is the minimum mean squared error (MSE) prediction function of \(Y\) from \(X\); i.e.
\[\Ex\big[(Y - f_*(X))^2\big] \le \Ex\big[(Y - f(X))^2\big]\text{ for any (other) function }f.\]First, we note that the problem of finding the minimum MSE prediction function of \(Y\) from \(X\) is equivalent to the problem of finding the minimum MSE constant prediction of \(Y_x \triangleq Y\mid X=x\); i.e. finding the constant \(\mu_x\in\reals\) such that
\[\Ex\big[(Y_x - \mu_x)^2\big] \le \Ex\big[(Y_x - c)^2\big]\text{ for any (other) constant }c\in\reals.\]This is because the minimum MSE prediction function \(f_*\) must equal \(\mu_x\) at \(x\); i.e. \(f_*(x) = \mu_x\). Otherwise, it is possible to reduce the MSE of \(f_*\) by replacing its value at \(x\) with \(\mu_x\):
\[f(x') = \begin{cases}\mu_x & \text{if }x' = x, \\ f_*(x') & \text{otherwise}.\end{cases}\]Second, we show that \(\mu_x = \Ex\big[Y_x\big]\) by solving the optimization problem: \(\min_{c\in\reals}\Ex\big[(Y_x - c)^2\big]\). The cost function seems complicated, but it is actually a quadratic function of \(c\):
\[\Ex\big[(Y_x - c)^2\big] = \Ex\big[Y_x^2\big] - 2c\Ex\big[Y_x\big] + c^2.\]We differentiate the cost function and find its root to deduce \(\mu_x = \Ex\big[Y_x\big]\). Recalling \(f_*(x) = \mu_x\) from the first part, we conclude
\[f_*(x) = \Ex\big[Y_x\big] = \Ex\big[Y\mid X=x\big].\]Posted on August 30, 2021 from Ann Arbor, MI