This post accompanies the lecture video on stochastic optimization. Please see the video for the problem setup and definitions.
First, we study the convergence of SGD with a fixed step size \(\eta < \frac1L\). Since \(F\) is strongly smooth,
\[\begin{aligned} F(x_{t+1}) - F(x_t) &\le \langle\partial F(x_t),x_{t+1}-x_t\rangle + \frac{L}{2}\|x_{t+1} - x_t\|_2^2 \\ &= -\eta\langle\partial F(x_t),g(x_t,\xi_t)\rangle + \frac{L}{2}\eta^2\|g(x_t),\xi_t\|_2^2, \end{aligned}\]where we plugged in the SGD update rule in the second step. Conditioning on \(\cH_t\triangleq\sigma(\xi_0,\dots,\xi_{t-1})\), we have
\[\begin{aligned} \Ex_t\big[F(x_{t+1})\big] - F(x_t) &\le -\eta\|\partial F(x_t)\|_2^2 + \frac{L}{2}\eta^2\Ex_t\big[\|g(x_t,\xi_t)\|_2^2\big] \\ &= -(1 - \frac{L}{2}\eta)\eta\|\partial F(x_t)\|_2^2 + \frac{L}{2}\eta^2\underbrace{\Ex_t\big[\|g(x_t,\xi_t) - \partial F(x_t)\|_2^2\big]}_{\sigma_g^2} \end{aligned}\]where we recognize \(\sigma_g^2\) as the (conditional) variance of \(g(x_t,\xi_t)\) in the second step. For any \(\eta < \frac1L\), \(1 - \frac{L}{2}\eta > \frac12\), so the preceding expression simplifies to
\[\Ex_t\big[F(x_{t+1})\big] - F(x_t) \le -\frac{\eta}{2}\|\partial F(x_t)\|_2^2 + \frac{L}{2}\eta^2\sigma_g^2.\]As long as \(F\) is \(\mu\)-strongly convex, \(\mu(F(x_t) - F_*) \le \frac12\|\partial F(x_t)\|_2^2\), so
\[\Ex_t\big[F(x_{t+1})\big] - F(x_t) \le -\eta\mu(F(x_t) - F_*)+ \frac{L}{2}\eta^2\sigma_g^2.\]We subtract \(F_*\) from both sides and rearrange to obtain
\[\Ex_t\big[F(x_{t+1})\big] - F_* \le (1-\eta\mu)(F(x_t) - F_*)+ \frac{L}{2}\eta^2\sigma_g^2.\]We unravel the recursion to obtain
\[\Ex\big[F(x_T)\big] - F_* \le (1-\eta\mu)^T(F(x_0) - F_*)+ \frac{L}{2}\eta^2\sigma_g^2\sum_{s=0}^{T-1}(1-\eta\mu)^s.\]Recalling \(\eta <\frac1L\), we have \(1-\eta\mu < 1\), so we can sum the geometric series to obtain
\[\Ex\big[F(x_T)\big] - F_* \le (1-\eta\mu)^T(F(x_0) - F_*) + \frac{L}{2\mu}\eta\sigma_g^2.\]Second, we study the convergence of SGD with diminishing step sizes \(\eta_t\gets\frac{1}{t+1}\). The SGD update rule implies
\[\begin{aligned} \|x_{t+1} - x_*\|_2^2 &= \|x_t - \eta_t g(x_t,\xi_t) - x_*\|_2^2 \\ &= \|x_t - x_*\|_2^2 - 2\eta_t\langle g(x_t,\xi_t),x_t-x_*\rangle + \eta_t^2\|g(x_t,\xi_t)\|_2^2. \end{aligned}\]Conditioning on \(\cH_t\), we have
\[\Ex_t\langle g(x_t,\xi_t),x_t-x_*\rangle = \langle\partial F(x_t),x_t-x_*\rangle \le \mu\|x_t-x_*\|_2^2,\]where we recalled \(F\) is \(\mu\)-strongly convex in the second step. Thus
\[\|x_{t+1} - x_*\|_2^2 \le (1-2\mu\eta_t)\|x_t - x_*\|_2^2 + \eta_t^2\sigma_g^2.\]To keep the proof simple, we study the convergence of a version of SVRG that picks an anchor point uniformly at random from the iterates of the previous epoch:
\[x_s \sim \unif\{x_{s-1,0},\dots,s_{s-1,m-1}\}.\]Theorem (linear convergence of SVRG): Let the \(f_i\)’s be convex and \(L\)-strongly smooth and \(F\) be \(\mu\)-strongly convex. For any step size \(\eta > 0\), as long as the epoch length \(m\) is large enough so that the contraction factor \(\gamma \triangleq \frac{1}{\mu\eta(1-2L\eta)m} + \frac{2L\eta}{1-2L\eta}\) is less than 1, the major iterates of SVRG converge linearly:
\[\Ex\big[F(x_s) - F_*\big] \le \gamma^s(F(x_{0,0}) - F_*).\]Proof: Define \(\Ex_{s,t}\) as expectation conditioned on the outputs of the stochastic gradient oracle up till the end of the \(t-1\)-th iteration of the \(s\)-th epoch and
\[g_{s,t} \triangleq \partial f_{i_{s,t}}(x_{s,t}) - \partial f_{i_{s,t}}(x_s) + \partial F(x_s)\]as the SVRG gradient estimate. We expand the SVRG update rule to obtain
\[\begin{aligned} \Ex_{s,t}\big[\|x_{s,t+1} - x_*\|_2^2\big] &= \Ex\big[\|x_{s,t} - \eta g_{s,t} - x_*\|_2^2\big] \\ &= \|x_{s,t} - x_*\|_2^2 - 2\eta\Ex_{s,t}\langle g_{t,s},x_{s,t} - x_*\rangle + \eta^2\Ex_{s,t}\big[\|g_{s,t}\|_2^2\big] \\ &= \|x_{s,t} - x_*\|_2^2 - 2\eta\langle\partial F(x_{t,s}),x_{s,t} - x_*\rangle + \eta^2\Ex_{s,t}\big[\|g_{s,t}\|_2^2\big] \\ &\le \|x_{s,t} - x_*\|_2^2 - 2\eta(F(x_{s,t}) - F_*) + \eta^2\Ex_{s,t}\big[\|g_{s,t}\|_2^2\big], \end{aligned}\]where we appealed to the unbiasedness of \(g_{s,t}\) in the third step and the convexity of \(F\) in the fourth step. We expand \(\Ex_{s,t}\big[\|g_{s,t}\|_2^2\big]\) to obtain
\[\begin{aligned} \Ex_{s,t}\big[\|g_{s,t}\|_2^2\big] &= \Ex_{s,t}\big[\|\partial f_{i_{s,t}}(x_{s,t}) - \partial f_{i_{s,t}}(x_s) + \partial F(x_s)\|_2^2\big] \\ &= \Ex_{s,t}\big[\|\partial f_{i_{s,t}}(x_{s,t}) - \partial f_{i_{s,t}}(x_*) - (\partial f_{i_{s,t}}(x_s) - \partial f_{i_{s,t}}(x_*) - \partial F(x_s))\|_2^2\big] \\ &\le 2\Ex_{s,t}\big[\|\partial f_{i_{s,t}}(x_{s,t}) - \partial f_{i_{s,t}}(x_*) \|_2\big] \\ &\quad+ 2\Ex_{s,t}\big[\|\partial f_{i_{s,t}}(x_s) - \partial f_{i_{s,t}}(x_*) - \partial F(x_s)\|_2^2\big] \\ &= 2\Ex_{s,t}\big[\|\partial f_{i_{s,t}}(x_{s,t}) - \partial f_{i_{s,t}}(x_*) \|_2\big] \\ &\quad+ 2\Ex_{s,t}\big[\|\partial f_{i_{s,t}}(x_s) - \partial f_{i_{s,t}}(x_*) - \Ex_{s,t}\big[\partial f_{i_{s,t}}(x_s) - \partial f_{i_{s,t}}(x_*)\big]\|_2^2\big] \\ &\le 2\Ex_{s,t}\big[\|\partial f_{i_{s,t}}(x_{s,t}) - \partial f_{i_{s,t}}(x_*) \|_2\big] + 2\Ex_{s,t}\big[\|\partial f_{i_{s,t}}(x_s) - \partial f_{i_{s,t}}(x_*) \|_2\big], \end{aligned}\]where we plugged
\[\partial F(x_s) = \partial F(x_s) - \partial F(x_*) = \Ex_{s,t}\big[\partial f_{i_{s,t}}(x_s) - f_{i_{s,t}}(x_*)\big]\]into the fourth step. To bound the right side of the preceding expression, we appeal to the convexity and strong smoothness of \(f_i\). The \(L\)-strong smoothness of \(f_i\) implies
\[\textstyle\frac{1}{2L}\|\partial f_i(x) - \partial f_i(x_*)\|_2^2 \le f_i(x) - f_i(x_*) - \langle\partial f_i(x_*),x-x_*\rangle.\]We sum over \(i\in[n]\) to obtain
\[\begin{aligned} \textstyle\frac{1}{2L}\sum_{i=1}^n\|\partial f_i(x) - \partial f_i(x_*)\|_2^2 &\le nF(x) - nF(x_*) - n\langle\partial F(x_*),x-x_*\rangle \\ &= n(F(x) - F_*). \end{aligned}\]Thus the right side of bound on \(\Ex_{s,t}\big[\|g_{s,t}\|_2^2\big]\) is at most
\[\Ex_{s,t}\big[\|g_{s,t}\|_2^2\big] \le 4L(F(x_{s,t}) - F_* + F(x_s) - F_*).\]This is the crucial result in this analysis of SVRG: it shows that size of the SVRG gradient estimate decreases as the iterates approach the optimal point. We plug this bound into the expansion of the SVRG update rule to obtain
\[\Ex_{s,t}\big[\|x_{s,t+1} - x_*\|_2^2\big] \le \|x_{s,t} - x_*\|_2^2 - 2\eta(1-2\eta)(F(x_{s,t}) - F_*) + 4L\eta^2(F(x_s) - F_*).\]Rearranging,
\[\Ex_{s,t}\big[\|x_{s,t+1} - x_*\|_2^2\big] + 2\eta(1-2\eta)(F(x_{s,t}) - F_*) \le \|x_{s,t} - x_*\|_2^2 + 4L\eta^2(F(x_s) - F_*).\]Let \(\Ex_s \triangleq \Ex_{s,0}\) be the expectation conditioned on the outputs of the stochastic gradient oracle up till the end of the \(s-1\) epoch. We sum and take the expectation conditioned on the outputs of the stochastic gradient oracle in the \(s\)-th epoch to obtain a bound on the progress of SVRG in an epoch:
\[\begin{aligned} &\textstyle\Ex_s\big[\|x_{s,m} - x_*\|_2^2\big] + 2\eta(1-2\eta)\sum_{t=0}^m\Ex_s\big[F(x_{s,t}) - F_*\big] \\ &\quad\le \Ex_s\big[\|x_s - x_*\|_2^2\big] + 4Lm\eta^2(F(x_s) - F_*) \\ &\textstyle\quad\le (\frac2\mu + 4Lm\eta^2)\Ex\big[F(x_s) - F(x_*)\big], \end{aligned}\]where we appealed to the strong convexity of \(F\) in the second step. The left side of the preceding expression is at least
\[\begin{aligned} &\textstyle\Ex_s\big[\|x_{s,m} - x_*\|_2^2\big] + 2\eta(1-2\eta)\sum_{t=0}^m\Ex_s\big[F(x_{s,t}) - F_*\big] \\ &\textstyle\quad\ge 2\eta(1-2\eta)\sum_{t=0}^m\Ex_s\big[F(x_{s,t}) - F_*\big] \\ &\quad\ge 2\eta(1-2\eta)m\Ex_s\big[F(x_{s+1}) - F_*\big], \end{aligned}\]where we recalled \(x_{s+1} \sim \unif\{x_{s,0},\dots,s_{s,m-1}\}\) in the second step. Thus
\[\textstyle 2\eta(1-2\eta)m\Ex_s\big[F(x_{s+1}) - F_*\big] \le (\frac2\mu + 4Lm\eta^2)\Ex\big[F(x_s) - F(x_*)\big].\]Rearranging,
\[\textstyle\Ex_s\big[F(x_{s+1}) - F_*\big] \le \frac{\frac2\mu + 4Lm\eta^2}{2\eta(1-2\eta)m}\Ex_s\big[F(x_s) - F(x_*)\big].\]We recognize the constant in front of \(\Ex_s\big[F(x_s) - F(x_*)\big]\) on the right side as the contraction factor \(\gamma\) to complete the proof.
Posted on April 07, 2021 from San Francisco, CA.