Stochastic optimization

This post accompanies the lecture video on stochastic optimization. Please see the video for the problem setup and definitions.

Convergence of SGD for strongly convex and strongly smooth problems

First, we study the convergence of SGD with a fixed step size \(\eta < \frac1L\). Since \(F\) is strongly smooth,

\[\begin{aligned} F(x_{t+1}) - F(x_t) &\le \langle\partial F(x_t),x_{t+1}-x_t\rangle + \frac{L}{2}\|x_{t+1} - x_t\|_2^2 \\ &= -\eta\langle\partial F(x_t),g(x_t,\xi_t)\rangle + \frac{L}{2}\eta^2\|g(x_t),\xi_t\|_2^2, \end{aligned}\]

where we plugged in the SGD update rule in the second step. Conditioning on \(\cH_t\triangleq\sigma(\xi_0,\dots,\xi_{t-1})\), we have

\[\begin{aligned} \Ex_t\big[F(x_{t+1})\big] - F(x_t) &\le -\eta\|\partial F(x_t)\|_2^2 + \frac{L}{2}\eta^2\Ex_t\big[\|g(x_t,\xi_t)\|_2^2\big] \\ &= -(1 - \frac{L}{2}\eta)\eta\|\partial F(x_t)\|_2^2 + \frac{L}{2}\eta^2\underbrace{\Ex_t\big[\|g(x_t,\xi_t) - \partial F(x_t)\|_2^2\big]}_{\sigma_g^2} \end{aligned}\]

where we recognize \(\sigma_g^2\) as the (conditional) variance of \(g(x_t,\xi_t)\) in the second step. For any \(\eta < \frac1L\), \(1 - \frac{L}{2}\eta > \frac12\), so the preceding expression simplifies to

\[\Ex_t\big[F(x_{t+1})\big] - F(x_t) \le -\frac{\eta}{2}\|\partial F(x_t)\|_2^2 + \frac{L}{2}\eta^2\sigma_g^2.\]

As long as \(F\) is \(\mu\)-strongly convex, \(\mu(F(x_t) - F_*) \le \frac12\|\partial F(x_t)\|_2^2\), so

\[\Ex_t\big[F(x_{t+1})\big] - F(x_t) \le -\eta\mu(F(x_t) - F_*)+ \frac{L}{2}\eta^2\sigma_g^2.\]

We subtract \(F_*\) from both sides and rearrange to obtain

\[\Ex_t\big[F(x_{t+1})\big] - F_* \le (1-\eta\mu)(F(x_t) - F_*)+ \frac{L}{2}\eta^2\sigma_g^2.\]

We unravel the recursion to obtain

\[\Ex\big[F(x_T)\big] - F_* \le (1-\eta\mu)^T(F(x_0) - F_*)+ \frac{L}{2}\eta^2\sigma_g^2\sum_{s=0}^{T-1}(1-\eta\mu)^s.\]

Recalling \(\eta <\frac1L\), we have \(1-\eta\mu < 1\), so we can sum the geometric series to obtain

\[\Ex\big[F(x_T)\big] - F_* \le (1-\eta\mu)^T(F(x_0) - F_*) + \frac{L}{2\mu}\eta\sigma_g^2.\]

Second, we study the convergence of SGD with diminishing step sizes \(\eta_t\gets\frac{1}{t+1}\). The SGD update rule implies

\[\begin{aligned} \|x_{t+1} - x_*\|_2^2 &= \|x_t - \eta_t g(x_t,\xi_t) - x_*\|_2^2 \\ &= \|x_t - x_*\|_2^2 - 2\eta_t\langle g(x_t,\xi_t),x_t-x_*\rangle + \eta_t^2\|g(x_t,\xi_t)\|_2^2. \end{aligned}\]

Conditioning on \(\cH_t\), we have

\[\Ex_t\langle g(x_t,\xi_t),x_t-x_*\rangle = \langle\partial F(x_t),x_t-x_*\rangle \le \mu\|x_t-x_*\|_2^2,\]

where we recalled \(F\) is \(\mu\)-strongly convex in the second step. Thus

\[\|x_{t+1} - x_*\|_2^2 \le (1-2\mu\eta_t)\|x_t - x_*\|_2^2 + \eta_t^2\sigma_g^2.\]

Convergence of SVRG

To keep the proof simple, we study the convergence of a version of SVRG that picks an anchor point uniformly at random from the iterates of the previous epoch:

\[x_s \sim \unif\{x_{s-1,0},\dots,s_{s-1,m-1}\}.\]

Theorem (linear convergence of SVRG): Let the \(f_i\)’s be convex and \(L\)-strongly smooth and \(F\) be \(\mu\)-strongly convex. For any step size \(\eta > 0\), as long as the epoch length \(m\) is large enough so that the contraction factor \(\gamma \triangleq \frac{1}{\mu\eta(1-2L\eta)m} + \frac{2L\eta}{1-2L\eta}\) is less than 1, the major iterates of SVRG converge linearly:

\[\Ex\big[F(x_s) - F_*\big] \le \gamma^s(F(x_{0,0}) - F_*).\]

Proof: Define \(\Ex_{s,t}\) as expectation conditioned on the outputs of the stochastic gradient oracle up till the end of the \(t-1\)-th iteration of the \(s\)-th epoch and

\[g_{s,t} \triangleq \partial f_{i_{s,t}}(x_{s,t}) - \partial f_{i_{s,t}}(x_s) + \partial F(x_s)\]

as the SVRG gradient estimate. We expand the SVRG update rule to obtain

\[\begin{aligned} \Ex_{s,t}\big[\|x_{s,t+1} - x_*\|_2^2\big] &= \Ex\big[\|x_{s,t} - \eta g_{s,t} - x_*\|_2^2\big] \\ &= \|x_{s,t} - x_*\|_2^2 - 2\eta\Ex_{s,t}\langle g_{t,s},x_{s,t} - x_*\rangle + \eta^2\Ex_{s,t}\big[\|g_{s,t}\|_2^2\big] \\ &= \|x_{s,t} - x_*\|_2^2 - 2\eta\langle\partial F(x_{t,s}),x_{s,t} - x_*\rangle + \eta^2\Ex_{s,t}\big[\|g_{s,t}\|_2^2\big] \\ &\le \|x_{s,t} - x_*\|_2^2 - 2\eta(F(x_{s,t}) - F_*) + \eta^2\Ex_{s,t}\big[\|g_{s,t}\|_2^2\big], \end{aligned}\]

where we appealed to the unbiasedness of \(g_{s,t}\) in the third step and the convexity of \(F\) in the fourth step. We expand \(\Ex_{s,t}\big[\|g_{s,t}\|_2^2\big]\) to obtain

\[\begin{aligned} \Ex_{s,t}\big[\|g_{s,t}\|_2^2\big] &= \Ex_{s,t}\big[\|\partial f_{i_{s,t}}(x_{s,t}) - \partial f_{i_{s,t}}(x_s) + \partial F(x_s)\|_2^2\big] \\ &= \Ex_{s,t}\big[\|\partial f_{i_{s,t}}(x_{s,t}) - \partial f_{i_{s,t}}(x_*) - (\partial f_{i_{s,t}}(x_s) - \partial f_{i_{s,t}}(x_*) - \partial F(x_s))\|_2^2\big] \\ &\le 2\Ex_{s,t}\big[\|\partial f_{i_{s,t}}(x_{s,t}) - \partial f_{i_{s,t}}(x_*) \|_2\big] \\ &\quad+ 2\Ex_{s,t}\big[\|\partial f_{i_{s,t}}(x_s) - \partial f_{i_{s,t}}(x_*) - \partial F(x_s)\|_2^2\big] \\ &= 2\Ex_{s,t}\big[\|\partial f_{i_{s,t}}(x_{s,t}) - \partial f_{i_{s,t}}(x_*) \|_2\big] \\ &\quad+ 2\Ex_{s,t}\big[\|\partial f_{i_{s,t}}(x_s) - \partial f_{i_{s,t}}(x_*) - \Ex_{s,t}\big[\partial f_{i_{s,t}}(x_s) - \partial f_{i_{s,t}}(x_*)\big]\|_2^2\big] \\ &\le 2\Ex_{s,t}\big[\|\partial f_{i_{s,t}}(x_{s,t}) - \partial f_{i_{s,t}}(x_*) \|_2\big] + 2\Ex_{s,t}\big[\|\partial f_{i_{s,t}}(x_s) - \partial f_{i_{s,t}}(x_*) \|_2\big], \end{aligned}\]

where we plugged

\[\partial F(x_s) = \partial F(x_s) - \partial F(x_*) = \Ex_{s,t}\big[\partial f_{i_{s,t}}(x_s) - f_{i_{s,t}}(x_*)\big]\]

into the fourth step. To bound the right side of the preceding expression, we appeal to the convexity and strong smoothness of \(f_i\). The \(L\)-strong smoothness of \(f_i\) implies

\[\textstyle\frac{1}{2L}\|\partial f_i(x) - \partial f_i(x_*)\|_2^2 \le f_i(x) - f_i(x_*) - \langle\partial f_i(x_*),x-x_*\rangle.\]

We sum over \(i\in[n]\) to obtain

\[\begin{aligned} \textstyle\frac{1}{2L}\sum_{i=1}^n\|\partial f_i(x) - \partial f_i(x_*)\|_2^2 &\le nF(x) - nF(x_*) - n\langle\partial F(x_*),x-x_*\rangle \\ &= n(F(x) - F_*). \end{aligned}\]

Thus the right side of bound on \(\Ex_{s,t}\big[\|g_{s,t}\|_2^2\big]\) is at most

\[\Ex_{s,t}\big[\|g_{s,t}\|_2^2\big] \le 4L(F(x_{s,t}) - F_* + F(x_s) - F_*).\]

This is the crucial result in this analysis of SVRG: it shows that size of the SVRG gradient estimate decreases as the iterates approach the optimal point. We plug this bound into the expansion of the SVRG update rule to obtain

\[\Ex_{s,t}\big[\|x_{s,t+1} - x_*\|_2^2\big] \le \|x_{s,t} - x_*\|_2^2 - 2\eta(1-2\eta)(F(x_{s,t}) - F_*) + 4L\eta^2(F(x_s) - F_*).\]

Rearranging,

\[\Ex_{s,t}\big[\|x_{s,t+1} - x_*\|_2^2\big] + 2\eta(1-2\eta)(F(x_{s,t}) - F_*) \le \|x_{s,t} - x_*\|_2^2 + 4L\eta^2(F(x_s) - F_*).\]

Let \(\Ex_s \triangleq \Ex_{s,0}\) be the expectation conditioned on the outputs of the stochastic gradient oracle up till the end of the \(s-1\) epoch. We sum and take the expectation conditioned on the outputs of the stochastic gradient oracle in the \(s\)-th epoch to obtain a bound on the progress of SVRG in an epoch:

\[\begin{aligned} &\textstyle\Ex_s\big[\|x_{s,m} - x_*\|_2^2\big] + 2\eta(1-2\eta)\sum_{t=0}^m\Ex_s\big[F(x_{s,t}) - F_*\big] \\ &\quad\le \Ex_s\big[\|x_s - x_*\|_2^2\big] + 4Lm\eta^2(F(x_s) - F_*) \\ &\textstyle\quad\le (\frac2\mu + 4Lm\eta^2)\Ex\big[F(x_s) - F(x_*)\big], \end{aligned}\]

where we appealed to the strong convexity of \(F\) in the second step. The left side of the preceding expression is at least

\[\begin{aligned} &\textstyle\Ex_s\big[\|x_{s,m} - x_*\|_2^2\big] + 2\eta(1-2\eta)\sum_{t=0}^m\Ex_s\big[F(x_{s,t}) - F_*\big] \\ &\textstyle\quad\ge 2\eta(1-2\eta)\sum_{t=0}^m\Ex_s\big[F(x_{s,t}) - F_*\big] \\ &\quad\ge 2\eta(1-2\eta)m\Ex_s\big[F(x_{s+1}) - F_*\big], \end{aligned}\]

where we recalled \(x_{s+1} \sim \unif\{x_{s,0},\dots,s_{s,m-1}\}\) in the second step. Thus

\[\textstyle 2\eta(1-2\eta)m\Ex_s\big[F(x_{s+1}) - F_*\big] \le (\frac2\mu + 4Lm\eta^2)\Ex\big[F(x_s) - F(x_*)\big].\]

Rearranging,

\[\textstyle\Ex_s\big[F(x_{s+1}) - F_*\big] \le \frac{\frac2\mu + 4Lm\eta^2}{2\eta(1-2\eta)m}\Ex_s\big[F(x_s) - F(x_*)\big].\]

We recognize the constant in front of \(\Ex_s\big[F(x_s) - F(x_*)\big]\) on the right side as the contraction factor \(\gamma\) to complete the proof.

Posted on April 07, 2021 from San Francisco, CA.