This post accompanies the lecture video on gradient descent. Please see the video for the problem setup and definitions.
The GD udpate rule implies successive optimization errors satisfy
\[x_{t+1} - x_* = x_t - x_* - \eta\partial f(x_t) = (I - \eta A)(x_t - x_*).\]Taking norms, we have
\[\|x_{t+1} - x_*\|_2 \le \|I - \eta A\|_2\|x_t - x_*\|_2,\]where \(\|\cdot\|_2\) with a matrix argument denotes the norm induced by the Euclidean norm on vectors. Recalling the \(\|\cdot\|_2\) norm of a symmetric matrix is the magnitude of its eigenvalue furthest from zero,
\[\|I - \eta A\|_2 = \max\{|1 - \eta\lambda_{\max}(A)|,|1-\eta\lambda_{\min}(A)|\}.\]Thus the choice of step length that minimizes \(\|I - \eta A\|_2\) is \(\eta = \frac{2}{\lambda_{\max}(A) + \lambda_{\min}(A)}\), and the minimum is \(\|I - \eta A\|_2 = \frac{\lambda_{\max}(A) - \lambda_{\min}(A)}{\lambda_{\max}(A) + \lambda_{\min}(A)}\). Combining this expression with the preceding error bound, we have
\[\begin{aligned} \|x_t - x_*\|_2 &\le (\frac{\lambda_{\max}(A) - \lambda_{\min}(A)}{\lambda_{\max}(A) + \lambda_{\min}(A)})\|x_{t-1} - x_*\|_2 \\ &\quad\vdots \\ &\le (\frac{\lambda_{\max}(A) - \lambda_{\min}(A)}{\lambda_{\max}(A) + \lambda_{\min}(A)})^t\|x_0 - x_*\|_2. \end{aligned}\]It is not hard to check that exact line search leads to \(\eta_t = \frac{\|\partial f(x_t)\|_2^2}{\langle\partial f(x_t),A\partial f(x_t)\rangle}\). This implies successive cost values satisfy
\[\begin{aligned} f(x_{t+1}) &= \frac12\langle x_t - \eta_t\partial f(x_t) - x_*,A(x_t - \eta_t\partial f(x_t) - x_*)\rangle \\ &= \frac12\langle x_t - x_*,A(x_t - x_*)\rangle - \eta_t\|\partial f(x_t)\|_2^2 + \frac{\eta_t^2}{2}\langle\partial f(x_t),A\partial f(x_t)\rangle \\ &= \frac12\langle x_t - x_*,A(x_t - x_*)\rangle - \frac{\|\partial f(x_t)\|_2^4}{2\langle\partial f(x_t),A\partial f(x_t)\rangle} \\ &= (1 - \frac{\|\partial f(x_t)\|_2^4}{\langle\partial f(x_t),A\partial f(x_t)\rangle\langle\partial f(x_t),A^{-1}\partial f(x_t)\rangle})f(x_t), \end{aligned}\]where we recalled \(f(x_t) = \frac12\langle\partial f(x_t),A^{-1}\partial f(x_t)\rangle\) in the last step. We recall Kantorovich’s inequality:
\[\frac{\|x\|_2^4}{\langle x,Ax\rangle\langle x,A^{-1}x\rangle} \ge \frac{4\lambda_{\max}(A)\lambda_{\min}(A)}{(\lambda_{\max}(A) + \lambda_{\min}(A))^2}.\]This inequality implies
\[\begin{aligned} f(x_{t+1}) &\le (1-\frac{4\lambda_{\max}(A)\lambda_{\min}(A)}{(\lambda_{\max}(A) + \lambda_{\min}(A))^2})f(x_t) \\ &= (\frac{\lambda_{\max}(A) - \lambda_{\min}(A)}{\lambda_{\max}(A) + \lambda_{\min}(A)})^2f(x_t). \end{aligned}\]We iterate the preceding bound (and recall \(f(x_*) = 0\)) to obtain the stated result.
The fundamental theorem of calculus implies
\[\partial f(x_t) - \partial f(x_*) = \int_0^1\partial^2f(x_s)(x_t - x_*)ds,\]where \(x_s \triangleq (1-s)x_t + sx_*\). This and the GD update rule implies successive optimization errors satisfies
\[\begin{aligned} x_{t+1} - x_* = x_t - x_* - \eta\partial f(x_t) = (I - \eta\int_0^1\partial^2f(x_s)ds)(x_t - x_*). \end{aligned}\]Taking norms and plugging in \(\eta = \frac{2}{L + \mu}\), we have
\[\|x_{t+1} - x_*\|_2^2 \le \sup\nolimits_{s\in[0,1]}\|I - \eta\partial^2f(x_s)\|_2\|x_t - x_*\|_2 \le (\frac{L-\mu}{L+\mu})\|x_t - x_*\|_2.\]We recognize \(\frac{L-\mu}{L+\mu} = \frac{\kappa -1}{\kappa + 1}\) and iterate the preceding bound to obtain the stated result.
We have
\[\begin{aligned} \|x_{t+1} - x_*\|_2^2 &\textstyle= \|x_t - x_* - \frac1L\partial f(x_t)\|_2^2 \\ &= \|x_t - x_*\|_2^2 + \frac{1}{L^2}\|\partial f(x_t)\|_2^2 - \frac2L\langle \partial f(x_t),x_t - x_*\rangle \\ &\le \|x_t - x_*\|_2^2 - \frac{\mu}{L}\|x_t - x_*\|_2^2 \\ &= (1-\frac{\mu}{L})\|x_t - x_*\|_2^2, \end{aligned}\]where we appealed to the regularity condition in the third step. We iterate the preceding bound to obtain the stated result.
The strong smoothness condition implies
\[\begin{aligned} f(x_{t+1}) - f(x_t) &\le \langle\partial f(x_t),x_{t+1} - x_t\rangle + \frac{L}{2}\|x_{t+1} - x_t\|_2^2 \\ &= -\eta_t\|\partial f(x_t)\|_2^2 + \frac{\eta_t^2 L}{2}\|\partial f(x_t)\|_2^2 \\ &= -\frac{1}{2L}\|\partial f(x_t)\|_2^2. \end{aligned}\]This bound is a stronger version of the observation that GD is a descent method (on SS functions). It implies
\[\begin{aligned} f(x_{t+1}) - f(x_*) &\le f(x_t) - f(x_*) - \frac{1}{2L}\|\partial f(x_t)\|_2^2 \\ &\le f(x_t) - f(x_*) - \frac{\mu}{L}(f(x_t) - f(x_*)) \\ &= (1-\frac{\mu}{L})(f(x_t) - f(x_*)). \end{aligned}\]We iterate the preceding bound to obtain the stated result.
Recall GD on \(L\)-SS functions is a descent method:
\[\frac{1}{2L}\|\partial f(x_t)\|_2^2 \le f(x_t) - f(x_{t+1})\]Summing the preceding bound, we have
\[\begin{aligned} \frac{1}{2L}\sum_{t=0}^{t-1}\|\partial f(x_t)\|_2^2 &\le \sum_{t=0}^{t-1}f(x_t) - f(x_{t+1}) \\ &= f(x_0) - f(x_t) \\ &\le f(x_0) - f(x_*). \end{aligned}\]We lower bound the left side with \(\frac{T}{2L}\min_{t\in[T]}\|\partial f(x_t)\|_2^2\), and rearrange to obtain the stated result.
Posted on January 19, 2021 from San Francisco, CA.