# Duality **STATS 606:** Computation and Optimization Methods in Statistics University of Michigan
including slides by Stephen
Boyd
and Lieven
Vandenberghe
## Lagrangian duality summary $$ \begin{aligned} &\min\nolimits_{x\in\reals^n} &&f_0(x) \\\\ &\subjectto &&\\{f_i(x) \le 0:\lambda_i\ge 0\\}\_{i=1}^m \\\\ & &&\\{h_i(x) = 0:\nu_i\in\reals^m\\}_{i=1}^p \end{aligned} $$ Lagrangian: $$\textstyle L(x,\lambda,\nu) \triangleq f_0(x) + \sum_{i=1}^m\lambda_if_i(x) + \sum_{i=1}^p\nu_ih_i(x) $$ The Lagrangian encodes the original problem: $$ \begin{aligned} f(x) &\triangleq \sup\nolimits_{\lambda\in\reals_+^m,\nu\in\reals^p} L(x,\lambda,\nu) \\\\ &= \begin{cases} f_0(x) &\begin{cases}\\{f_i(x) \le 0:\lambda_i\ge 0\\}\_{i=1}^m \\\\ \\{h_i(x) = 0:\nu_i\in\reals^m\\}_{i=1}^p\end{cases} \\\\ \infty &\text{otherwise} \end{cases} \end{aligned} $$
## Lagrangian duality summary primal function & problem: $$ \begin{aligned} f(x) &\triangleq \sup\nolimits_{\lambda\in\reals_+^m,\nu\in\reals^p} L(x,\lambda,\nu) \\\\ p_* &\triangleq \inf\nolimits_xf(x) \\\\ &= \inf\nolimits_{x\in\reals^n}\sup\nolimits_{\lambda\in\reals_+^m,\nu\in\reals^p} L(x,\lambda,\nu) \end{aligned} $$ dual function & problem: $$ \begin{aligned} g(\lambda,\nu) &\triangleq \inf\nolimits_x L(x,\lambda,\nu) \\\\ d^* &\triangleq \sup\nolimits_{\lambda\in\reals_+^m,\nu\in\reals^p}g(\lambda,\nu) \\\\ &= \sup\nolimits_{\lambda\in\reals_+^m,\nu\in\reals^p}\inf\nolimits_{x\in\reals^n} L(x,\lambda,\nu) \end{aligned} $$
## Ex: optimal transport Given 1. transport cost function $c:\cX\times\cY\to\reals$ 2. two distributions $P\in\Delta(\cX)$, $Q\in\Delta(\cY)$ find a transport map $\pi:\cX\times\cY\to\reals$ ($\pi(x,y)$ is the mass transported from $x$ to $y$) that 1. transports $P$ to $Q$: $$\textstyle\int_{\cY}\pi(x,y)dxdy = p(x),\quad\int_{\cX}\pi(x,y)dxdy = q(x)$$ 2. minimizes the total transport cost $\int_{\cX\times\cY}c(x,y)\pi(x,y)dxdy$
## Ex: optimal transport OT problem (finite $\cX$ and $\cY$): $$ \begin{aligned} &\min\nolimits_{\Pi\in\reals^{m\times n}} &&\Tr(C^\top\Pi) \\\\ &\subjectto &&\Pi1_n = p &&: f\in\reals^m \\\\ & &&\Pi^\top 1_m = q &&: g\in\reals^n \\\\ & &&\Pi\succeq 0 &&:\Lambda\in\reals_+^{m\times n} \end{aligned} $$ * $C_{i,j}$ is the transport cost from $x_i$ to $y_j$ * $\Pi_{i,j}$ is the mass transported from $x_i$ to $y_j$ As long as $p,q\succ 0$, Slater's CQ is satisfied because $pq^\top$ is a strictly feasible point.
## Ex: optimal transport
## Ex: optimal transport OT Lagrangian: $$ \begin{aligned} L(\Pi,f,g,\Lambda) &\triangleq \Tr(C^\top\Pi) +f^\top(p - \Pi 1_n) + g^\top(q - \Pi^\top 1_m) \\\\ &\quad{\color{red}-} \Tr(\Lambda^\top\Pi) \end{aligned} $$ OT dual function: $$g(f,g,\Lambda) = \begin{cases}f^\top p + g^\top q & 0 = C - f1_n^\top - 1_mg^\top - \Lambda\\\\ -\infty & \text{otherwise}\end{cases}$$ OT dual problem: $$ \begin{aligned} &\max\nolimits_{f\in\reals^m,g\in\reals^n} &&f^\top p + g^\top q \\\\ &\subjectto && f1_n^\top + 1_mg^\top \preceq C \end{aligned} $$
## Ex: SVM dual SVM problem: $$ \begin{aligned} &\min\nolimits_{\beta_0\in\reals,\beta\in\reals^p} &&\textstyle\frac12\\|\beta\\|\_2^2 \\\\ &\subjectto && \\{Y_i(\beta_0 + \beta^\top X_i) \ge 1:\alpha_i \ge 0\\}_{i=1}^n \end{aligned} $$ * $\\{(X_i,Y_i)\\}_{i=1}^n$: training data * $\ones\\{\beta^\top x + \beta_0\\}$: linear classifier * $\alpha_i$: Lagrange multipliers Slater's CQ is equivalent to the two classes are strictly separable in the training data.
## Ex: SVM dual SVM Lagrangian: $$\textstyle L(\beta_0,\beta,\alpha) \triangleq \frac12\\|\beta\\|\_2^2 + \sum_{i=1}^n\alpha_i(1 - Y_i(\beta_0 + \beta^\top X_i)) $$ SVM dual function: $$ g(\alpha) = \begin{cases}\sum_{i=1}^n\alpha_i - \frac12\sum_{i,j=1}^n\alpha_i\alpha_jY_iY_jX_i^\top X_j &\alpha^\top Y = 0 \\\\ -\infty &\text{otherwise}\end{cases} $$ SVM dual problem: $$ \begin{aligned} &\max\nolimits_{\alpha\in\reals_+^n} &&\textstyle\sum_{i=1}^n\alpha_i - \frac12\sum_{i,j=1}^n\alpha_i\alpha_jY_iY_jX_i^\top X_j \\\\ &\subjectto &&\alpha^\top Y = 0. \end{aligned} $$
## Failure of KKT conditions without constraint qualifications OT problem: $$ \begin{aligned} &\min\nolimits_{\Pi\in\reals^{m\times n}} &&\Tr(C^\top\Pi) \\\\ &\subjectto &&\Pi1_n = p &&: f\in\reals^m \\\\ & &&\Pi^\top 1_m = q &&: g\in\reals^n \\\\ & &&\Pi\succeq 0 &&:\Lambda\in\reals_+^{m\times n} \end{aligned} $$ OT KKT conditions: $$ \begin{aligned} C - f1_n^\top - 1_ng^\top - \Lambda = 0&&\text{(stationarity)} \\\\ \Pi1_n = p, \Pi^\top 1_m = q, \Pi\succeq 0 &&\text{(primal feasibility)} \\\\ \Lambda\succeq 0 &&\text{(dual feasibility)} \\\\\textstyle \Tr(\Lambda^\top\Pi) = 0 &&\text{(comp. slackness)} \end{aligned} $$
## Ex: optimal transport OT problem: $$ \begin{aligned} &\min\nolimits_{\Pi\in\reals^{m\times n}} &&\Tr(C^\top\Pi) \\\\ &\subjectto &&\Pi1_n = p &&: f\in\reals^m \\\\ & &&\Pi^\top 1_m = q &&: g\in\reals^n \\\\ & &&\Pi\succeq 0 &&:\Lambda\in\reals_+^{m\times n} \end{aligned} $$ OT KKT conditions: $$ \begin{aligned} C - f1_n^\top - 1_ng^\top - \Lambda = 0&&\text{(stationarity)} \\\\ \Pi1_n = p, \Pi^\top 1_m = q, \Pi\succeq 0 &&\text{(primal feasibility)} \\\\ \Lambda\succeq 0 &&\text{(dual feasibility)} \\\\\textstyle \Tr(\Lambda^\top\Pi) = 0 &&\text{(comp. slackness)} \end{aligned} $$
## Ex: optimal transport OT dual problem: $$ \begin{aligned} &\max\nolimits_{f\in\reals^m,g\in\reals^n} && f^\top p + g^\top q \\\\ &\subjectto && f 1_n^\top + 1_mg^\top \preceq C \equiv\\{f_i + g_j \le C_{i,j}\\}_{i,j = 1}^{m,n} \end{aligned} $$ The $c$-concavity argument shows that $$ \begin{aligned} f_i = \min\nolimits_jC_{i,j} - g_j,\quad i\in[m], \\\\ g_j = \min\nolimits_iC_{i,j} - f_i,\quad j\in[n]. \end{aligned} $$ Complementary slackness + stationarity: $$ \begin{aligned} \Pi_{i,j} > 0 &\Rightarrow \Lambda_{i,j} = 0 \\\\ &\equiv f_i + g_j = C_{i,j} \end{aligned} $$
## Ex: Sinkhorn algorithm entropy regularized OT problem (finite $\cX$ and $\cY$): $$ \begin{aligned} &\min\nolimits_{\Pi\in\reals^{m\times n}} &&\Tr(C^\top\Pi) - \eps \Tr(\Pi^\top\log\Pi) \\\\ &\subjectto &&\Pi1_n = p &&: f\in\reals^m \\\\ & &&\Pi^\top 1_m = q &&: g\in\reals^n \end{aligned} $$ The non-negativity constraint is implicit in the domain of log. entropy regularized OT KKT conditions: $$ \begin{aligned}\textstyle \Pi = \diag(\exp(\frac{f}{\eps}))\exp(\frac{C}{\eps})\diag(\exp(\frac{g}{\eps})) &&\text{(stationarity)} \\\\ \begin{aligned} \Pi1_n = p \\\\ \Pi^\top1_m = q \end{aligned} &&\text{(primal feasibility)} \end{aligned} $$
## Ex: Sinkhorn algorithm
OT map
entropy regularized OT map
## Ex: Sinkhorn algorithm From stationarity, $$\textstyle \Pi = \diag(u)\exp(\frac1\eps C)\diag(v)\text{ for some }u\in\reals^m,v\in\reals^n. $$ From primal feasibility, we see that $u$, $v$ satifies $$ \begin{aligned}\textstyle \diag(u)\exp(\frac1\eps C)\diag(v)1_n = p, \\\\\textstyle \diag(v)\exp(\frac1\eps C)^\top\diag(u)1_m = q. \end{aligned} $$ Sinkhorn algorithm: $$ \begin{aligned}\textstyle u_{t+1} \gets \frac{\diag(u_t)\exp(\frac1\eps C)\diag(v_t)1_n}{p}\\\\\textstyle v_{t+1} \gets \frac{\diag(v_t)\exp(\frac1\eps C)^\top\diag(u_{t+1})1_m}{q} \end{aligned} $$
## Ex: support vectors SVM problem: $$ \begin{aligned} &\min\nolimits_{\beta_0\in\reals,\beta\in\reals^p} &&\textstyle\frac12\\|\beta\\|\_2^2 \\\\ &\subjectto && \\{1 - Y_i(\beta_0 + \beta^\top X_i) \le 0:\alpha_i \ge 0\\}_{i=1}^n \end{aligned} $$ SVM KKT conditions: $$ \begin{aligned} \begin{aligned}\textstyle 0 = \sum_{i=1}^n\alpha_iY_i \\\\\textstyle \beta = \sum_{i=1}^n\alpha_iX_iY_i \end{aligned}&&\text{(stationarity)} \\\\ \\{Y_i(\beta_0 + \beta^\top X_i) \ge 1\\}\_{i=1}^n &&\text{(primal feasibility)} \\\\ \alpha\succeq 0 &&\text{(dual feasibility)} \\\\\textstyle 0 = \sum_{i=1}^n\alpha_i(1 - Y_i(\beta_0 + \beta^\top X_i)) &&\text{(comp. slackness)} \end{aligned} $$
## Ex: support vectors From stationarity $$\textstyle \beta = \sum_{i=1}^n\alpha_iX_iY_i,\quad\alpha\in\reals_+^n, $$ we see that only the $(X_i,Y_i)$'s such that $\alpha_i > 0$ (directly) affect $\beta$. From complementary slackness $$\textstyle 0 = \sum_{i=1}^n\alpha_i(1 - Y_i(\beta_0 + \beta^\top X_i)), $$ we see that the $\alpha_i > 0$ implies $Y_i(\beta_0 + \beta^\top X_i) = 1$. Training samples such that $\alpha_i > 0$ (i.e. $Y_i(\beta_0 + \beta^\top X_i) = 1$) are called **support vectors** because they "support" the decision boundary.
## Ex: support vectors
## Ex: LASSO dual The LASSO problem $$\textstyle \min_{\beta\in\reals^d}\frac12\\|y - X\beta\\|_2^2 + \lambda\\|\beta\\|_1 $$ has no constraints, so its dual function is (a) constant. Consider the equivalent linearly-constrained problem $$ \begin{aligned} &\min\nolimits_{\widehat{y}\in\reals^n,\beta\in\reals^d} &&\textstyle\frac12\\|y-\widehat{y}\\|\_2^2 + \lambda\\|\beta\\|_1 \\\\ &\subjectto &&\widehat{y} = X\beta &&:r\in\reals^n \end{aligned} $$ This is a standard trick to obtain a non-trivial dual function/problem for unconstrained problems. Slater's condition is satisfied (as long as $X \ne 0$).
## Ex: LASSO dual The dual function is $$ \begin{aligned} g(\beta,\widehat{y}) &\textstyle= \min_{\beta,\widehat{y}}\frac12\\|y-\widehat{y}\\|\_2^2 + \lambda\\|\beta\\|_1 + r^\top(\widehat{y} - X\beta) \\\\ &\textstyle= \frac12\\|y\\|_2^2 - \sup\_{\widehat{y}}\\{(y-r)^\top\widehat{y} - \frac12\\|\widehat{y}\\|_2^2\\} \\\\ &\textstyle\quad - \sup_\beta\\{(X^\top r)^\top\beta - \lambda\\|\beta\\|_1\\} \\\\ &\textstyle= \frac12\\|y\\|\_2^2 - \frac12\\|y - r\\|\_2^2 - I\_{\lambda\bB_\infty^d}(X^\top r). \end{aligned} $$ The dual problem is $$ \begin{aligned} &\max\nolimits_{r\in\reals^n} &&\textstyle\frac12\\|y\\|\_2^2 - \frac12\\|y - r\\|_2^2, \\\\ &\subjectto &&\\|X^\top r\\|_\infty \le \lambda. \end{aligned} $$
## Ex: LASSO dual The dual problem is the projection of $y$ onto the pre-image of the unit $\ell_\infty$ norm ball under $\frac1\lambda X$: $$ \begin{aligned} &\min\nolimits_{r\in\reals^n} &&\textstyle\frac12\\|y - r\\|_2^2, \\\\ &\subjectto &&\\|X^\top r\\|_\infty \le \lambda. \end{aligned} $$ Note that the optimal value of this version of the dual problem is **not** equal to the optimal value of the LASSO problem.
## Ex: $D$-optimal experiment design **Idea:** minimize volume of confidence ellipsoids $$ \begin{aligned} &\min\nolimits_{w\in\reals_+^N} &&\textstyle\log\det((X_i\diag(w) X_i^\top)^{-1}) \\\\ &\subjectto &&w_1 + \dots + w_N = n \end{aligned} $$ Consider the equivalent linearly-constrained problem $$ \begin{aligned} &\min\nolimits_{V\in\symm^N,w\in\reals_+^N} &&\textstyle-\log\det V \\\\ &\subjectto &&V = X\diag(w)X^\top &&:\Lambda\in\reals^n \\\\ & &&w_1 + \dots + w_N = n &&:\lambda\in\reals \end{aligned}. $$
## Ex: $D$-optimal experiment design $D$-optimal design Lagrangian: $$ \begin{aligned} &L(V,w,\Lambda,\lambda) \\\\ &\quad\triangleq -\log\det V + \Tr(\Lambda(V - X\diag(w)X^\top)) \\\\ &\qquad+ \lambda(1_N^\top w - n) \\\\ &\textstyle\quad= -\log\det V + \Tr(\Lambda V) - \Tr(\Lambda(\sum_{i=1}^Nw_iX_iX_i^\top))\\\\ &\textstyle\qquad +\lambda\sum_{i=1}^Nw_i - \lambda n \\\\ &\textstyle\quad= -\log\det V + \Tr(\Lambda V) + \sum_{i=1}^Nw_i(\lambda - X_i^\top\Lambda X_i) - \lambda n \end{aligned} $$
## Ex: $D$-optimal experiment design $D$-optimal design dual function: $$ \begin{aligned} g(\Lambda,\lambda) &= \begin{cases}\log\det\Lambda + N - \lambda n & \begin{cases}\Lambda\in\symm_+^N \\\\ \\{x_i^\top\Lambda x_i \le \lambda\\}_{i=1}^N\end{cases}\\\\ -\infty & \text{otherwise}\end{cases} \end{aligned} $$ * $\min_{V\in\symm^N}-\log\det V + \Tr(\Lambda V) = \begin{cases}\log\det\Lambda + N & \Lambda\in\symm_+^N \\\\-\infty & \text{otherwise}\end{cases}$ * $\min_{w\in\reals_+^N}\sum_{i=1}^Nw_i(\lambda - X_i^\top\Lambda X_i) = \begin{cases}0 & \\{x_i^\top\Lambda x_i \le \lambda\\}_{i=1}^N \\\\-\infty & \text{otherwise}\end{cases}$
## Ex: $D$-optimal experiment design $D$-optimal design dual problem: $$ \begin{aligned} &\max\nolimits_{\Lambda\in\symm_+^N,\lambda\in\reals} &&\textstyle\log\det\Lambda + N - \lambda n\\\\ &\subjectto &&\\{x_i^\top\Lambda x_i \le \lambda\\}_{i=1}^N \end{aligned} $$ CoV: $M \triangleq \frac{n}{\lambda}\Lambda$ $$ \begin{aligned} &\max\nolimits_{M\in\symm_+^N,\lambda\in\reals} &&\textstyle\log\det M + \log\frac{\lambda}{n} + N - \lambda n\\\\ &\subjectto &&\\{x_i^\top Mx_i \le n\\}_{i=1}^N \end{aligned} $$ We maximize WRT $\lambda$ to obtain $$ \begin{aligned} &\max\nolimits_{M\in\symm_+^N,\lambda\in\reals} &&\textstyle\log\det M + N - 2\log n - 1\\\\ &\subjectto &&\\{x_i^\top Mx_i \le n\\}_{i=1}^N \end{aligned} $$