Constrained optimization proofs

This post accompanies the lecture video on constrained optimization. Please see the video for the problem setup and definitions.

Convergence of PGD on SC and SS problems

We start by noting an important consequence of the projection theorem: projections are non-expansive. The projection theorem implies

\[\def\cC{\mathcal{C}} \begin{aligned} \langle x - P_{\cC}(x), P_{\cC}(y) - P_{\cC}(x)\rangle \le 0, \\ \langle P_{\cC}(y) - y, P_{\cC}(y) - P_{\cC}(x)\rangle \le 0. \end{aligned}\]

Adding the two preceding inequalities and rearrange to obtain, we have

\[\langle P_{\cC}(y) - P_{\cC}(x), P_{\cC}(y) - P_{\cC}(x)\rangle \le \langle y - x, P_{\cC}(y) - P_{\cC}(x)\rangle.\]

This property is known as firm non-expansiveness. We recognize the left side as \(\|P_{\cC}(x) - P_{\cC}(y)\|_2^2\) and the right side is at most \(\|x-y\|_2\|P_{\cC}(y) - P_{\cC}(y)\|_2\) (Cauchy-Schwarz inequality). We cancel terms to obtain non-expansiveness:

\[\|P_{\cC}(x) - P_{\cC}(y)\|_2 \le \|x-y\|_2.\]

First, we assume \(\def\intr{\rm int}x_*\in\intr(\cC)\). The PDG update rule implies

\[\begin{aligned} \|x_{t+1} - x_*\|_2 &= \|P_{\cC}(x_t - \eta_t\partial f(x_t)) - P_{\cC}(x_*)\|_2 \\ &\le \|x_t - \eta_t\partial f(x_t) - x_*\|_2 \\ &\le \frac{\kappa - 1}{\kappa + 1}\|x_t - x_*\|_2, \end{aligned}\]

where we appealed to the non-expansiveness of projection in the second step and recalled the convergence result for gradient descent in the third step. We used the condition \(x_*\in\intr(\cC)\) (implicitly) in the third step because the convergence result for gradient descent relies on the fact that \(0 = \partial f(x_*)\). We iterate the preceding bound to obtain the stated result.

Second, we drop the assumption that \(x_*\in\intr(\cC)\). Define the projected gradient step (with step size \(\eta_t\)) as

\[g_{\cC}(x) \triangleq \frac{1}{\eta_t}(x - P_{\cC}(x - \eta_t\partial f(x))).\]

This is the analogue of the gradient step \(\partial f(x)\) for constrained optimization probblems; many properties of \(\partial f(x)\) hold for \(g_{\cC}(x)\). For example, \(g_{\cC}(x) = 0\) is the analogue of the usual zero-gradient optimality condition for constrained optimization problems. In terms of this step, the PGD update rule is

\[x_{t+1} \gets x_t - \eta_tg_{\cC}(x_t).\]

For constant step sizes \(\eta_t = \frac1L\), we have

\[\begin{aligned} \|x_{t+1} - x_*\|_2^2 &\textstyle= \|x_t - x_* - \frac1Lg_{\cC}(x_t)\|_2^2 \\ &= \|x_t - x_*\|_2^2 + \frac{1}{L^2}\|g_{\cC}(x_t)\|_2^2 - \frac2L\langle g_{\cC}(x_t),x_t - x_*\rangle. \end{aligned}\]

We recall a similar series of inequalities (with \(\partial f(x_t)\) in lieu of \(g_{\cC}(x_t)\)) in the convergence analysis of gradient descent under the regularity condition. As we shall see, there is a version of the regularity condition for the projected gradient step:

\[\langle g_{\cC}(x),x - x_*\rangle \ge \frac{\mu}{2}\|x - x_*\|_2^2 + \frac{1}{2L}\|g_{\cC}(x)\|_2^2\text{ for any }x\in\cC.\]

With this condition in place, we proceed like in the convergence analysis of gradient descent under the regularity condition:

\[\begin{aligned} \|x_{t+1} - x_*\|_2^2 &= \|x_t - x_*\|_2^2 + \frac{1}{L^2}\|g_{\cC}(x_t)\|_2^2 - \frac2L\langle g_{\cC}(x_t),x_t - x_*\rangle \\ &\le \|x_t - x_*\|_2^2 - \frac{\mu}{L}\|x_t - x_*\|_2^2 \\ &= (1-\frac{\mu}{L})\|x_t - x_*\|_2^2. \end{aligned}\]

We iterate this bound to obtain the stated result.

It remains to show the regularity condition for the projected gradient step. As long as the cost function is \(\mu\)-strongly convex and \(L\)-strongly smooth, we have

\[\begin{aligned} 0 &\le f(x_{t+1}) - f(x_*) \\ &\le f(x_{t+1}) - f(x_t) + f(x_t) - f(x_*) \\ &\le \langle\partial f(x_t),x_{t+1} - x_t\rangle + \frac{L}{2}\|x_{t+1} - x_t\|_2^2 + \langle\partial f(x_t),x_t - x_*\rangle - \frac{\mu}{2}\|x_t - x_*\|_2^2 \\ &= \langle\partial f(x_t),x_{t+1} - x_*\rangle + \frac{1}{2L}\|g_{\cC}(x_t)\|_2^2 - \frac{\mu}{2}\|x_t - x_*\|_2^2, \end{aligned}\]

where we appealed to strong convexity and strong smoothness in the third step. We note that the projection theorem implies

\[\eta_t\langle\partial f(x_t) - g_{\cC}(x_t),x_{t+1} - x_*\rangle = \langle x_{t+1} - (x_t - \eta_t\partial f(x_t)),x_{t+1} - x_*\rangle \le 0,\]

\[\begin{aligned} 0 &\le \langle\partial f(x_t),x_{t+1} - x_*\rangle + \frac{1}{2L}\|g_{\cC}(x_t)\|_2^2 - \frac{\mu}{2}\|x_t - x_*\|_2^2 \\ &\le \langle g_{\cC}(x_t),x_{t+1} - x_*\rangle + \frac{1}{2L}\|g_{\cC}(x_t)\|_2^2 - \frac{\mu}{2}\|x_t - x_*\|_2^2 \\ &= \langle g_{\cC}(x_t),x_t - x_*\rangle - \frac{1}{2L}\|g_{\cC}(x_t)\|_2^2 - \frac{\mu}{2}\|x_t - x_*\|_2^2. \end{aligned}\]

We rearrange to obtain the regularity condition for the projected gradient step.

Convergence of FW on convex and SS problems

We have

\[\begin{aligned} f(x_{t+1}) - f(x_t) &\le \langle\partial f(x_t),x_{t+1} - x_t\rangle + \frac{L}{2}\|x_{t+1} - x_t\|_2^2 \\ &= \eta_t\langle\partial f(x_t),x_{t+\frac12} - x_t\rangle + \frac{L}{2}\eta_t^2\|x_{t+\frac12} - x_t\|_2^2 \\ &\le \eta_t\langle\partial f(x_t),x_* - x_t\rangle + \frac{L}{2}\eta_t^2D^2 \\ &\le \eta_t(f(x_*) - f(x_t)) + \frac{L}{2}\eta_t^2D^2, \end{aligned}\]

where we appealed to the strong smoothness of \(f\) in the first step, plugged in the FW update rule in the second step, recalled \(x_{t+\frac12}\) solves the FW subproblem and \(\cC\) is bounded in the third step, and appealed to the convexty of \(f\) in the fourth step. We rearrange to obtain a recursion on the sequence of suboptimality gaps \(\epsilon_t \triangleq f(x_t) - f(x_*)\):

\[\epsilon_{t+1} \le (1-\eta_t)\epsilon_t + \frac{L}{2}\eta_t^2D^2.\]

We induct on \(t\) to obtain the stated result.

Posted on January 26, 2021 from San Francisco, CA.