Proximal methods proofs

This post accompanies the lecture video on proximal methods. Please see the video for the problem setup and definitions.

Equality case of Fenchel’s inequality

We start with the definition of the convex conjugate:

\[f^*(y) \triangleq \sup\nolimits_x\langle y,x\rangle - f(x).\]

The zero-subgradient condition implies \(y\in\partial f(x_*)\), where \(x_*\) is the argmax. This shows

\[\langle x,y\rangle = f(x) + f^*(y) \iff y\in\partial f(x).\]

To deduce \(x\in\partial f^*(y)\), we recall \(f = f^{**}\). Thus \(f\) has the variational form

\[f(x) = \sup\nolimits_y\langle x,y\rangle - f^*(y).\]

We repeat the preceding argument to obtain

\[\begin{aligned} \langle x,y\rangle &= f^*(x) + f^{**}(y) \\ &= f(x) + f^*(x) \end{aligned}\iff x\in\partial f^*(y).\]

Non-expansiveness of proximal operators

This property of proximal operators generalizes the eponymous property of projection maps. Let \(\def\prox{\text{prox}}x_1' \triangleq \prox_f(x_1)\) and \(x_2'\triangleq\prox_f(x_2)\). The zero-subgradient condition implies

\[\begin{aligned} x_1 - x_1'\in\partial f(x_1'), \\ x_2 - x_2'\in\partial f(x_2'). \end{aligned}\]

The convexity of \(f\) implies

\[\begin{aligned} f(x_1') \ge f(x_2') + \langle x_2 - x_2', x_1' - x_2'\rangle, \\ f(x_2') \ge f(x_1') + \langle x_1 - x_1', x_2' - x_1'\rangle. \end{aligned}\]

We add the preceding inequalities (and simplify) to obtain

\[\langle x_2 - x_2' + (x_1' - x_1),x_1' - x_2'\rangle \le 0.\]

Rearranging, we have

\[\|x_1' - x_2'\|_2^2 \le \langle x_1 - x_2,x_1' - x_2'\rangle.\]

This is known as firm non-expansiveness. By the Cauchy-Schwarz inequality, it implies non-expansiveness (hence its name).

Moreau decomposition

Let \(x' \triangleq \prox_f(x)\). The zero-subgradient condition implies \(x - x'\in\partial f(x')\). We recall \(\partial f^* = \partial f^{-1}\) to obtain \(x' \in \partial f^*(x - x')\). This is equivalent to

\[x - (x-x') \in \partial f(x - x'),\]

which shows that \(x - x' = \prox_{f^*}(x)\). Thus

\[x = x' + (x-x') = \prox_f(x) + \prox_{f^*}(x).\]

Convergence of proximal gradient descent

We start by establishing basic inequality:

\[f(x_{t+1}) - f(x) \le \frac{L}{2}(\|x - x_t\|_2^2 - \|x - x_{t+1}\|_2^2) - g(x,x_t),\]

where \(\def\sm{\text{sm}}\def\ns{\text{ns}}g(x,x_t) \triangleq f_{\sm}(x) - f_{\sm}(x_t) - \langle\partial f_{\sm}(x_t),x-x_t\rangle\). Define

\[\tilde{f}_t(x) \triangleq f_{\sm}(x_t) + \langle\partial f_{\sm}(x_t),x - x_t\rangle + \frac{L}{2}\|x - x_t\|_2^2 + f_{\ns}(x)\]

We note that \(g\) is the difference between \(f_{\sm}(x)\) and the value at \(x\) of the linearization of \(f_{\sm}\) at \(x_{t+1}\). It is not hard to check that \(x_{t+1} = \argmin_x\tilde{f}_t(x)\), and that \(\tilde{f}_t\) majorize \(f\). The latter is a consequence of the \(L\)-strong smoothness of \(f_{\sm}\). We note that \(\tilde{f}_t\) is \(L\)-strongly convex, so

\[\begin{aligned} \tilde{f}_t(x) &\ge \tilde{f}_t(x_{t+1}) + \frac{L}{2}\|x - x_{t+1}\|_2^2 \\ &\ge f(x_{t+1}) + \frac{L}{2}\|x - x_{t+1}\|_2^2, \end{aligned}\]

where we recalled \(x_{t+1}\) minimizes \(\tilde{f}_t\) in the first step and \(\tilde{f}_t\) majorizes \(f\) in the second step. We recognize the left side as

\[\tilde{f}_t(x) = f(x) - g(x,x_t) + \frac{L}{2}\|x - x_t\|_2^2\]

and rearrange to obtain the basic inequality.

It remains to show the proximal gradient method converges linearly. Letting \(x = x_*\) in the basic inequality, we have

\[\begin{aligned} f(x_{t+1}) - f(x_*) &\le \frac{L}{2}(\|x_* - x_t\|_2^2 - \|x_* - x_{t+1}\|_2^2) - g(x_*,x_t) \\ &\ge \frac{L-\mu}{2}\|x_* - x_t\|_2^2 - \frac{L}{2}\|x_* - x_{t+1}\|_2^2, \end{aligned}\]

where we appealed to the strong convexity of \(f_{\sm}\) in the second step. We recognize \(f(x_{t+1}) - f(x_*) \ge 0\) and rearrange to obtain

\[\|x_t - x_*\|_2^2 \le (1 - \frac{\mu}{L})\|x_{t-1} - x_*\|_2^2.\]

We iterate this inequality over the first \(T-1\) iterations to obtain the stated result.

Posted on February 09, 2021 from San Francisco, CA.