This post accompanies the lecture video on constrained optimization. Please see the video for the problem setup and definitions.
We start by noting an important consequence of the projection theorem: projections are non-expansive. The projection theorem implies
\[\def\cC{\mathcal{C}} \begin{aligned} \langle x - P_{\cC}(x), P_{\cC}(y) - P_{\cC}(x)\rangle \le 0, \\ \langle P_{\cC}(y) - y, P_{\cC}(y) - P_{\cC}(x)\rangle \le 0. \end{aligned}\]Adding the two preceding inequalities and rearrange to obtain, we have
\[\langle P_{\cC}(y) - P_{\cC}(x), P_{\cC}(y) - P_{\cC}(x)\rangle \le \langle y - x, P_{\cC}(y) - P_{\cC}(x)\rangle.\]This property is known as firm non-expansiveness. We recognize the left side as \(\|P_{\cC}(x) - P_{\cC}(y)\|_2^2\) and the right side is at most \(\|x-y\|_2\|P_{\cC}(y) - P_{\cC}(y)\|_2\) (Cauchy-Schwarz inequality). We cancel terms to obtain non-expansiveness:
\[\|P_{\cC}(x) - P_{\cC}(y)\|_2 \le \|x-y\|_2.\]First, we assume \(\def\intr{\rm int}x_*\in\intr(\cC)\). The PDG update rule implies
\[\begin{aligned} \|x_{t+1} - x_*\|_2 &= \|P_{\cC}(x_t - \eta_t\partial f(x_t)) - P_{\cC}(x_*)\|_2 \\ &\le \|x_t - \eta_t\partial f(x_t) - x_*\|_2 \\ &\le \frac{\kappa - 1}{\kappa + 1}\|x_t - x_*\|_2, \end{aligned}\]where we appealed to the non-expansiveness of projection in the second step and recalled the convergence result for gradient descent in the third step. We used the condition \(x_*\in\intr(\cC)\) (implicitly) in the third step because the convergence result for gradient descent relies on the fact that \(0 = \partial f(x_*)\). We iterate the preceding bound to obtain the stated result.
Second, we drop the assumption that \(x_*\in\intr(\cC)\). Define the projected gradient step (with step size \(\eta_t\)) as
\[g_{\cC}(x) \triangleq \frac{1}{\eta_t}(x - P_{\cC}(x - \eta_t\partial f(x))).\]This is the analogue of the gradient step \(\partial f(x)\) for constrained optimization probblems; many properties of \(\partial f(x)\) hold for \(g_{\cC}(x)\). For example, \(g_{\cC}(x) = 0\) is the analogue of the usual zero-gradient optimality condition for constrained optimization problems. In terms of this step, the PGD update rule is
\[x_{t+1} \gets x_t - \eta_tg_{\cC}(x_t).\]For constant step sizes \(\eta_t = \frac1L\), we have
\[\begin{aligned} \|x_{t+1} - x_*\|_2^2 &\textstyle= \|x_t - x_* - \frac1Lg_{\cC}(x_t)\|_2^2 \\ &= \|x_t - x_*\|_2^2 + \frac{1}{L^2}\|g_{\cC}(x_t)\|_2^2 - \frac2L\langle g_{\cC}(x_t),x_t - x_*\rangle. \end{aligned}\]We recall a similar series of inequalities (with \(\partial f(x_t)\) in lieu of \(g_{\cC}(x_t)\)) in the convergence analysis of gradient descent under the regularity condition. As we shall see, there is a version of the regularity condition for the projected gradient step:
\[\langle g_{\cC}(x),x - x_*\rangle \ge \frac{\mu}{2}\|x - x_*\|_2^2 + \frac{1}{2L}\|g_{\cC}(x)\|_2^2\text{ for any }x\in\cC.\]With this condition in place, we proceed like in the convergence analysis of gradient descent under the regularity condition:
\[\begin{aligned} \|x_{t+1} - x_*\|_2^2 &= \|x_t - x_*\|_2^2 + \frac{1}{L^2}\|g_{\cC}(x_t)\|_2^2 - \frac2L\langle g_{\cC}(x_t),x_t - x_*\rangle \\ &\le \|x_t - x_*\|_2^2 - \frac{\mu}{L}\|x_t - x_*\|_2^2 \\ &= (1-\frac{\mu}{L})\|x_t - x_*\|_2^2. \end{aligned}\]We iterate this bound to obtain the stated result.
It remains to show the regularity condition for the projected gradient step. As long as the cost function is \(\mu\)-strongly convex and \(L\)-strongly smooth, we have
\[\begin{aligned} 0 &\le f(x_{t+1}) - f(x_*) \\ &\le f(x_{t+1}) - f(x_t) + f(x_t) - f(x_*) \\ &\le \langle\partial f(x_t),x_{t+1} - x_t\rangle + \frac{L}{2}\|x_{t+1} - x_t\|_2^2 + \langle\partial f(x_t),x_t - x_*\rangle - \frac{\mu}{2}\|x_t - x_*\|_2^2 \\ &= \langle\partial f(x_t),x_{t+1} - x_*\rangle + \frac{1}{2L}\|g_{\cC}(x_t)\|_2^2 - \frac{\mu}{2}\|x_t - x_*\|_2^2, \end{aligned}\]where we appealed to strong convexity and strong smoothness in the third step. We note that the projection theorem implies
\[\eta_t\langle\partial f(x_t) - g_{\cC}(x_t),x_{t+1} - x_*\rangle = \langle x_{t+1} - (x_t - \eta_t\partial f(x_t)),x_{t+1} - x_*\rangle \le 0,\]so
\[\begin{aligned} 0 &\le \langle\partial f(x_t),x_{t+1} - x_*\rangle + \frac{1}{2L}\|g_{\cC}(x_t)\|_2^2 - \frac{\mu}{2}\|x_t - x_*\|_2^2 \\ &\le \langle g_{\cC}(x_t),x_{t+1} - x_*\rangle + \frac{1}{2L}\|g_{\cC}(x_t)\|_2^2 - \frac{\mu}{2}\|x_t - x_*\|_2^2 \\ &= \langle g_{\cC}(x_t),x_t - x_*\rangle - \frac{1}{2L}\|g_{\cC}(x_t)\|_2^2 - \frac{\mu}{2}\|x_t - x_*\|_2^2. \end{aligned}\]We rearrange to obtain the regularity condition for the projected gradient step.
We have
\[\begin{aligned} f(x_{t+1}) - f(x_t) &\le \langle\partial f(x_t),x_{t+1} - x_t\rangle + \frac{L}{2}\|x_{t+1} - x_t\|_2^2 \\ &= \eta_t\langle\partial f(x_t),x_{t+\frac12} - x_t\rangle + \frac{L}{2}\eta_t^2\|x_{t+\frac12} - x_t\|_2^2 \\ &\le \eta_t\langle\partial f(x_t),x_* - x_t\rangle + \frac{L}{2}\eta_t^2D^2 \\ &\le \eta_t(f(x_*) - f(x_t)) + \frac{L}{2}\eta_t^2D^2, \end{aligned}\]where we appealed to the strong smoothness of \(f\) in the first step, plugged in the FW update rule in the second step, recalled \(x_{t+\frac12}\) solves the FW subproblem and \(\cC\) is bounded in the third step, and appealed to the convexty of \(f\) in the fourth step. We rearrange to obtain a recursion on the sequence of suboptimality gaps \(\epsilon_t \triangleq f(x_t) - f(x_*)\):
\[\epsilon_{t+1} \le (1-\eta_t)\epsilon_t + \frac{L}{2}\eta_t^2D^2.\]We induct on \(t\) to obtain the stated result.
Posted on January 26, 2021 from San Francisco, CA.