This post accompanies the lecture video on proximal methods. Please see the video for the problem setup and definitions.
We start with the definition of the convex conjugate:
\[f^*(y) \triangleq \sup\nolimits_x\langle y,x\rangle - f(x).\]The zero-subgradient condition implies \(y\in\partial f(x_*)\), where \(x_*\) is the argmax. This shows
\[\langle x,y\rangle = f(x) + f^*(y) \iff y\in\partial f(x).\]To deduce \(x\in\partial f^*(y)\), we recall \(f = f^{**}\). Thus \(f\) has the variational form
\[f(x) = \sup\nolimits_y\langle x,y\rangle - f^*(y).\]We repeat the preceding argument to obtain
\[\begin{aligned} \langle x,y\rangle &= f^*(x) + f^{**}(y) \\ &= f(x) + f^*(x) \end{aligned}\iff x\in\partial f^*(y).\]This property of proximal operators generalizes the eponymous property of projection maps. Let \(\def\prox{\text{prox}}x_1' \triangleq \prox_f(x_1)\) and \(x_2'\triangleq\prox_f(x_2)\). The zero-subgradient condition implies
\[\begin{aligned} x_1 - x_1'\in\partial f(x_1'), \\ x_2 - x_2'\in\partial f(x_2'). \end{aligned}\]The convexity of \(f\) implies
\[\begin{aligned} f(x_1') \ge f(x_2') + \langle x_2 - x_2', x_1' - x_2'\rangle, \\ f(x_2') \ge f(x_1') + \langle x_1 - x_1', x_2' - x_1'\rangle. \end{aligned}\]We add the preceding inequalities (and simplify) to obtain
\[\langle x_2 - x_2' + (x_1' - x_1),x_1' - x_2'\rangle \le 0.\]Rearranging, we have
\[\|x_1' - x_2'\|_2^2 \le \langle x_1 - x_2,x_1' - x_2'\rangle.\]This is known as firm non-expansiveness. By the Cauchy-Schwarz inequality, it implies non-expansiveness (hence its name).
Let \(x' \triangleq \prox_f(x)\). The zero-subgradient condition implies \(x - x'\in\partial f(x')\). We recall \(\partial f^* = \partial f^{-1}\) to obtain \(x' \in \partial f^*(x - x')\). This is equivalent to
\[x - (x-x') \in \partial f(x - x'),\]which shows that \(x - x' = \prox_{f^*}(x)\). Thus
\[x = x' + (x-x') = \prox_f(x) + \prox_{f^*}(x).\]We start by establishing basic inequality:
\[f(x_{t+1}) - f(x) \le \frac{L}{2}(\|x - x_t\|_2^2 - \|x - x_{t+1}\|_2^2) - g(x,x_t),\]where \(\def\sm{\text{sm}}\def\ns{\text{ns}}g(x,x_t) \triangleq f_{\sm}(x) - f_{\sm}(x_t) - \langle\partial f_{\sm}(x_t),x-x_t\rangle\). Define
\[\tilde{f}_t(x) \triangleq f_{\sm}(x_t) + \langle\partial f_{\sm}(x_t),x - x_t\rangle + \frac{L}{2}\|x - x_t\|_2^2 + f_{\ns}(x)\]We note that \(g\) is the difference between \(f_{\sm}(x)\) and the value at \(x\) of the linearization of \(f_{\sm}\) at \(x_{t+1}\). It is not hard to check that \(x_{t+1} = \argmin_x\tilde{f}_t(x)\), and that \(\tilde{f}_t\) majorize \(f\). The latter is a consequence of the \(L\)-strong smoothness of \(f_{\sm}\). We note that \(\tilde{f}_t\) is \(L\)-strongly convex, so
\[\begin{aligned} \tilde{f}_t(x) &\ge \tilde{f}_t(x_{t+1}) + \frac{L}{2}\|x - x_{t+1}\|_2^2 \\ &\ge f(x_{t+1}) + \frac{L}{2}\|x - x_{t+1}\|_2^2, \end{aligned}\]where we recalled \(x_{t+1}\) minimizes \(\tilde{f}_t\) in the first step and \(\tilde{f}_t\) majorizes \(f\) in the second step. We recognize the left side as
\[\tilde{f}_t(x) = f(x) - g(x,x_t) + \frac{L}{2}\|x - x_t\|_2^2\]and rearrange to obtain the basic inequality.
It remains to show the proximal gradient method converges linearly. Letting \(x = x_*\) in the basic inequality, we have
\[\begin{aligned} f(x_{t+1}) - f(x_*) &\le \frac{L}{2}(\|x_* - x_t\|_2^2 - \|x_* - x_{t+1}\|_2^2) - g(x_*,x_t) \\ &\ge \frac{L-\mu}{2}\|x_* - x_t\|_2^2 - \frac{L}{2}\|x_* - x_{t+1}\|_2^2, \end{aligned}\]where we appealed to the strong convexity of \(f_{\sm}\) in the second step. We recognize \(f(x_{t+1}) - f(x_*) \ge 0\) and rearrange to obtain
\[\|x_t - x_*\|_2^2 \le (1 - \frac{\mu}{L})\|x_{t-1} - x_*\|_2^2.\]We iterate this inequality over the first \(T-1\) iterations to obtain the stated result.
Posted on February 09, 2021 from San Francisco, CA.