Optimality of the conditional expectation

This post supplements the supervised learning slides. Please see the slides for the setup.

We wish to show that the conditional expectation $E [Y ∣ X = x]$ is the minimum mean squared error (MSE) prediction function of $Y$ from $X$ ; i.e.

E [(Y - f_{*} (X))^{2}] \leq E [(Y - f (X))^{2}] for any (other) function f .

First, we note that the problem of finding the minimum MSE prediction function of $Y$ from $X$ is equivalent to the problem of finding the minimum MSE constant prediction of $Y_{x} ≜ Y ∣ X = x$ ; i.e. finding the constant $μ_{x} \in R$ such that

E [(Y_{x} - μ_{x})^{2}] \leq E [(Y_{x} - c)^{2}] for any (other) constant c \in R .

This is because the minimum MSE prediction function $f_{*}$ must equal $μ_{x}$ at $x$ ; i.e. $f_{*} (x) = μ_{x}$ . Otherwise, it is possible to reduce the MSE of $f_{*}$ by replacing its value at $x$ with $μ_{x}$ :

f (x^{'}) = {\begin{cases} μ_{x} & if x^{'} = x, \\ f_{*} (x^{'}) & otherwise . \end{cases}

Second, we show that $μ_{x} = E [Y_{x}]$ by solving the optimization problem: $min_{c \in R} E [(Y_{x} - c)^{2}]$ . The cost function seems complicated, but it is actually a quadratic function of $c$ :

E [(Y_{x} - c)^{2}] = E [Y_{x}^{2}] - 2 c E [Y_{x}] + c^{2} .

We differentiate the cost function and find its root to deduce $μ_{x} = E [Y_{x}]$ . Recalling $f_{*} (x) = μ_{x}$ from the first part, we conclude

f_{*} (x) = E [Y_{x}] = E [Y ∣ X = x] .

Posted on August 30, 2021 from Ann Arbor, MI