This post accompanies the lecture video on accelerated gradient methods. Please see the video for the problem setup and definitions.
Recall the iterates of the heavy ball methods satisfy
\[\begin{bmatrix} x_{t+1} \\ x_t \end{bmatrix} = \begin{bmatrix} (1+\theta)I_d & -\theta I_d \\ I_d & 0_{d\times d} \end{bmatrix}\begin{bmatrix} x_t \\ x_{t-1} \end{bmatrix} - \begin{bmatrix} \eta\partial f(x_t) \\ 0_d \end{bmatrix}\]It is not hard to check that the iterate errors satisfy
\[\begin{aligned} \begin{bmatrix} x_{t+1} - x_* \\ x_t - x_* \end{bmatrix} &= \begin{bmatrix} (1+\theta)I_d & -\theta I_d \\ I_d & 0_{d\times d} \end{bmatrix}\begin{bmatrix} x_t \\ x_{t-1} \end{bmatrix} - \begin{bmatrix} \eta\partial f(x_t) \\ 0_d \end{bmatrix} \\ &= \underbrace{\begin{bmatrix} (1+\theta)I_d - \eta Q & -\theta I_d \\ I_d & 0_{d\times d} \end{bmatrix}}_{H} \begin{bmatrix} x_t - x_* \\ x_{t-1} - x_* \end{bmatrix} \end{aligned}\]We iterate the preceding expression and take norms to obtain
\[\left\|\begin{bmatrix} x_{T+1} - x_* \\ x_T - x_* \end{bmatrix}\right\|_2 = \|H^T\|_2\left\|\begin{bmatrix} x_1 - x_* \\ x_0 - x_* \end{bmatrix} \right\|_2\]By Gelfand’s formula, \(\|H^T\|_2 \approx \rho(H)^T\), where \(\rho(H) \triangleq \max_j\{\vert\lambda_j(H)\vert\}\) is the spectral radius of \(H\). In the rest of this proof, we show that \(\rho(H) \le \frac{\sqrt{\kappa}-1}{\sqrt{\kappa}+1}\).
We start by observing \(H\) is similar to a block diagonal matrix with \(2\times 2\) blocks, so their spectral radii coincide:
\[\begin{aligned} \rho(H) &= \rho\left(\begin{bmatrix}(1+\theta)I_d - \eta\Lambda & -\theta I_d \\ I_d & 0_{d\times d}\end{bmatrix}\right) \\ &= \max\nolimits_{j\in[d]}\rho\left(\begin{bmatrix}1 + \theta - \eta\lambda_j & -\theta \\ 1 & 0\end{bmatrix}\right), \end{aligned}\]where \(Q = V\Lambda V^T\) is the eigenvalue decomposition of \(Q\) and the \(\lambda_j\)’s are the eigenvalues of \(Q\). The eigenvalues of the \(2\times 2\) blocks are the roots of
\[z^2 - (1+\theta-\eta\lambda_j)z + \theta = 0.\]If \((1+\theta-\eta\lambda_j)^2 \le 4\theta\), then both roots have magnitude \(\sqrt{\theta}\). We set \(\theta = \max\{(1-\sqrt{\eta\lambda_{\max}(Q)}^2,(1-\sqrt{\eta\lambda_{\min}(Q)}^2\}\) to ensure
\[\textstyle\theta\in[(1-\sqrt{\eta\lambda_j})^2,(1+\sqrt{\eta\lambda_j})^2],\]which in turn ensures \((1+\theta-\eta\lambda_j)^2 \le 4\theta\). This choice of momentum parameter ensures \(\rho(H) \le \sqrt{\theta}\). We set \(\eta = \frac{4}{(\lambda_{\max}(Q)^{\frac12} + \lambda_{\min}(Q)^{\frac12})^2}\) so that \(1-\sqrt{\eta\lambda_{\max}(Q)} = 1-\sqrt{\eta\lambda_{\min}(Q)}\). This implies
\[\begin{aligned} \theta &= \max\left\{\left(1-\frac{2\lambda_{\max}(Q)^{\frac12}}{\lambda_{\max}(Q)^{\frac12} + \lambda_{\min}(Q)^{\frac12}}\right)^2,\left(1-\frac{2\lambda_{\max}(Q)^{\frac12}}{\lambda_{\min}(Q)^{\frac12} + \lambda_{\min}(Q)^{\frac12}}\right)^2\right\} \\ &= \left(\frac{\sqrt{\kappa}-1}{\sqrt{\kappa}+1}\right)^2. \end{aligned}\]Rearranging the N83 update rule, we have
\[\frac{x_{t+1}-x_t}{\sqrt{\eta}} = \frac{t-1}{t+2}\frac{x_t - x_{t-1}}{\sqrt{\eta}} - \sqrt{\eta}\partial f(y_t).\]We wish to consider the sequence of iterates \((x_t)\) a a discrete version of a curve \(X(s)\). It turns out the correct way to discretize the curve is \(x_t \approx X(t\sqrt{n})\), which suggests the change of variables \(t = \frac{s}{\sqrt{\eta}}\). This change of variables leads to \(X(s) \approx x_{s/\sqrt{\eta}} = x_t\) and \(X(s + \sqrt{\eta}) \approx x_{t+1}\). We Taylor expand \(X(s)\) to obtain
\[\begin{aligned} \frac{x_{t+1}-x_t}{\sqrt{\eta}} \approx \dot{X}(s) + \frac12\ddot{X}(s)\sqrt{\eta} \\ \frac{x_t-x_{t-1}}{\sqrt{\eta}} \approx \dot{X}(s) - \frac12\ddot{X}(s)\sqrt{\eta} \end{aligned}\]where \(X(t)\) is the N83 flow. The N83 update rule implies
\[\dot{X}(s) + \frac12\ddot{X}(s)\sqrt{\eta} \approx (1 - \frac{3\sqrt{\eta}}{s})(\dot{X}(s) - \frac12\ddot{X}(s)\sqrt{\eta}) - \sqrt{\eta}\partial f(X(s)).\]We rearrange and drop the asymptotically negligible terms (as \(\eta\to 0\)) to obtain the N83 ODE:
\[\ddot{X}(s) + \frac3s\dot{X}(s) + \partial f(X(s)) = 0.\]Define the energy/Lyapunov function \(E(s) \triangleq s^2(f(X) - f_*) + 2\|X + \frac{s}{2}\dot{X} - x_*\|_2^2\). We have
\[\begin{aligned} \dot{E} &= 2s(f(X) - f_*) + s^2\langle\partial f(X),\dot{X}\rangle + 4\langle X+\frac{s}{2}\dot{X} - x_*,\frac32\dot{X} + \frac{s}{2}\ddot{X}\rangle \\ &= 2s(f(X) - f_*) - 2s\langle X-x_*,\partial f(X)\rangle \\ &\le 0, \end{aligned}\]where we appealed to the N83 ODE in the second step and the convexity of \(f\) in the third step. This implies \(\dot{E}\) is non-increasing (in \(S\)). Thus
\[f(X(s)) - f_* \le \frac{E(s)}{s^2} \le \frac{E(0)}{s^2} = O(\frac{1}{s^2}),\]where we recalled the definition of \(E\) in the first step.
We start by recalling the basic inequality for proximal gradient methods:
\[f(x_{t+1}) - f(x) \le \frac{L}{2}(\|x - y_t\|_2^2 - \|x - x_{t+1}\|_2^2).\]Using this basic inequality, we show that FISTA reduces an energy/Lyapunov function at each iteration. Letting \(x = \frac{1}{\theta_t}x_* + (1-\frac{1}{\theta_t})x_t\) in the basic inequality, we obtain
\[\begin{aligned} &\textstyle f(x_{t+1}) - f(\frac{1}{\theta_t}x_* + (1-\frac{1}{\theta_t})x_t) \\ &\quad\le\frac{L}{2}\|\frac{1}{\theta_t}x_* + (1-\frac{1}{\theta_t})x_t - y_t\|_2^2 - \frac{L}{2}\|\frac{1}{\theta_t}x_* + (1-\frac{1}{\theta_t})x_t - x_{t+1}\|_2^2 \\ &\quad=\frac{L}{2\theta_t^2}\|x_* + (\theta_t-1)x_t - \theta_ty_t\|_2^2 - \frac{L}{2\theta_t^2}\|x_* + (\theta_t-1)x_t - \theta_tx_{t+1}\|_2^2 \\ &\quad=\frac{L}{2\theta_t^2}(\|u_t\|_2^2 - \|u_{t+1}\|_2^2), \end{aligned}\]where we defined \(u_t \triangleq \theta_{t-1}x_t - (x_* + (\theta_{t-1}-1)x_{t-1})\) and recalled the \(y_t\) update rule in the third step. The convexity of \(f\) implies
\[\textstyle f(\frac{1}{\theta_t}x_* + (1-\frac{1}{\theta_t})x_t) \le \frac{1}{\theta_t}f_* + (1-\frac{1}{\theta_t})f(x_t),\]so the left side of the preceding inequality is at least
\[\textstyle f(x_{t+1}) - f_* - (1-\frac{1}{\theta_t})(f(x_t) - f_*) \le f(x_{t+1}) - f(\frac{1}{\theta_t}x_* + (1-\frac{1}{\theta_t})x_t).\]Combining the upper and lower bounds of \(f(x_{t+1}) - f(\frac{1}{\theta_t}x_* + (1-\frac{1}{\theta_t})x_t)\), we have
\[\textstyle \theta_t^2(f(x_{t+1}) - f_*) - (\theta_t^2 - \theta_t)(f(x_t) - f_*) \le \frac{L}{2}(\|u_t\|_2^2 - \|u_{t+1}\|_2^2).\]Finally, we recognize \(\theta_t^2 - \theta_t = \theta_{t-1}^2\) and rearrange to show that FISTA reduces an energy function at each iterations:
\[\|u_{t+1}\|_2^2 + \frac{2\theta_t^2}{L}(f(x_{t+1}) - f_*) \le \|u_t\|_2^2 + \frac{2\theta_{t-1}^2}{L}(f(x_t) - f_*).\]In particular, we have
\[\begin{aligned} \frac{2\theta_{t-1}^2}{L}(f(x_t) - f_*) &\le \|u_1\|_2^2 + \frac{2\theta_0^2}{L}(f(x_1) - f_*) \\ &= \|x_1 - x_*\|_2^2 + \frac{2}{L}(f(x_1) - f_*) \end{aligned}\]Recalling the basic inequality for proximal gradient methods, we have
\[\begin{aligned} \frac{2}{L}(f(x_1) - f_*) &\le \|y_0 - x_*\|_2^2 - \|x_1 - x_*\|_2^2 \\ &= \|x_0 - x_*\|_2^2 - \|x_1 - x_*\|_2^2. \end{aligned}\]Rearranging, we have
\[\|x_1 - x_*\|_2^2 + \frac{2}{L}(f(x_1) - f_*) \le \|x_0 - x_*\|_2^2.\]We combine this with the bound on the reduction in energy from the first iteration to obtain
\[\frac{2\theta_{t-1}^2}{L}(f(x_t) - f_*) \le \|x_0 - x_*\|_2^2.\]Rearranging, we have
\[f(x_t) - f_* \le \frac{L}{2\theta_{t-1}^2}\|x_0 - x_*\|_2^2 \le \frac{2L\|x_9 - x_*\|_2^2}{(t+1)^2},\]where we recalled \(\theta_t \ge \frac{t+2}{2}\) for all \(t \ge 1\) in the second step.
Posted on February 16, 2021 from San Francisco, CA.