A Variational Perspective on Accelerated Methods in Optimization | Columbia Advanced Machine Learning Seminar

This is a follow-up to my previous post. As I mentioned, Dar lead our discussion about two basically unrelated papers. This post is about the second of the two, “A Variational Perspective on Accelerated Methods in Optimization” by Wibisono et al. [1]. This is a more theoretical paper investigating the nature of accelerated gradient methods and the natural scope for such concepts. Here we’ll introduce and motivate some of the mathematical aspects and physical intuition used in the paper, along with an overview of the main contributions.

Generally, whenever we are able to prove lower bound results about the performance of a class of algorithms to solve a problem, we would like a (practically applicable) method which attains this lower bound. In the case of optimizing a $C^1$ convex function $f$ with $\nabla f$ Lipschitz, we can obtain a lower bound of $1/t^2$ for the rate of convergence (after $t$ steps) of any procedure depending linearly on the gradients of $f$ ; however, vanilla gradient descent only attains a rate of $1/t$ . Nesterov’s accelerated gradient descent, on the other hand, achieves this lower bound rate; much more widely, the phenomenon of acceleration allows us to at least improve convergence rates, and in certain circumstances obtain the optimal rate exactly.

However, the nature of acceleration is not well understood. This paper attempts to place Nesterov’s approach as a methodology for the discretization of a certain class of differential equations. These differential equations are derived as solutions to the Euler-Lagrange problem of minimizing what the authors call the Bregman Lagrangian, a functional depending (partially) on the Bregman divergence (this being the only link to the generative models paper we also discussed).

Nesterov’s accelerated gradient descent

I probably don’t need to remind you of the defining equation for vanilla gradient descent - but here it is anyways: given an initial position $x_1$ and a sequence of step sizes $\{ \gamma_n \}_{n \geq 1}$ , we perform iterative updates of the form

$x_{n+1} = x_n - \gamma_n \nabla f(x_n).$

Nor do I likely need to remind you of where it can go wrong and why we’d like to be able to do better. For now though, we’re only concerned about one aspect - that the convergence rate of gradient descent is $O(1/t)$ after $t$ steps (under the regularity conditions mentioned above in the introduction), whereas the oracle is $O(1/t^2)$ , and so we’d like to try and attain it.

Thankfully, we’re able to do so, via the following algorithm proposed by Nesterov [4]. We begin by defining the following sequences:

$\lambda_0 = 0, \lambda_{s+1} = \frac{ 1 + \sqrt{ 1 + 4 \lambda_s^2}}{2}, \gamma_s = \frac{1 - \lambda_s}{\lambda_{s+1}}.$

Note that $\lambda_{s+1}$ is the positive solution to the quadratic equation $x^2 - x - \lambda_s^2 = 0$ , and that $\gamma_s \geq 0$ . Taking a pair of initializations $(x_1, y_1)$ with $x_1 = y_1$ , we then update $x_n$ as follows (given some arbitary $\epsilon > 0$ ):

$y_{n+1} = x_n - \frac{1}{\epsilon} \nabla f(x_n),$ $x_{n+1} = (1 - \gamma_n) y_{n+1} + \gamma_n y_n.$

In other words, we perform gradient descent steps, but then update $x_n$ as a weighted average from the current and past step. However, why this approach attains the oracle (as compared to why it provides a speed up) isn’t well understood - looking at the algorithm itself doesn’t provide much intuition, and looking at the derivation of its convergence rate isn’t illuminating either. The approach by the authors which we’re about to discuss is one particular way of looking at how this phenomenen can arise.

Optimization in non-Euclidean geometries and the Bregman Lagrangian

Consider the optimization problem of minimizing $f(x)$ over $\mathcal{X} = \mathbb{R}^d$ , where $f: \mathcal{X} \mapsto \mathbb{R}$ is a continuously differentiable convex function, with a unique minimizer $x^* \in \mathcal{X}$ . To consider a non-Euclidean setting, we suppose $\mathcal{X}$ is endowed with a distance generation function $h : \mathcal{X} \mapsto \mathbb{R}$ which is continuous differentiable, convex, and is such that $\| \nabla h(x) \| \to \infty$ as $\| x\| \to \infty$ . This can then be used to define the Bregman divergence

$D_h(y, x) = h(y) - h(x) - \langle \nabla h(x), y - x \rangle$

which is non-negative as $h$ is convex. (Note that this is not a metric, as it is neither necessarily symmetric not satisify the triangle inequality.) When $y$ and $x$ are close to each other, this is approximately equal to the Hessian metric

$\frac{1}{2} \langle y - x, \nabla^2 h(x) (y - x) \rangle.$

The authors then introduce (albeit without much intution) the Bregman Lagrangian

$\mathcal{L}(X, V, t) = e^{\alpha_t + \gamma_t} \left( D_h(X_t + e^{-\alpha_t}V_t, X_t) - e^{\beta_t} f(X_t) \right)$

where $X_t \in \mathcal{X}$ denotes position at time $t$ , $V_t \in \mathbb{R}^d$ the velocity at time $t$ , and times belong to an interval $\mathbb{T}$ of time. Here, the functions $\alpha, \beta, \gamma : \mathbb{T} \to \mathbb{R}$ are all continuously differentiable functions of time. From a physical perspective,

$f$ plays the role of a potential function, whose damping is controlled by $\beta$ ,
$D_h\left( \cdot, \cdot \right)$ plays the role of the kinetic energy (which can be motivated by the approximation to the Hessian metric)
$\alpha$ controls the damping of the velocity term, and $\gamma$ that of the overall Lagrangian.

In such a scenario, if we have kinetic energy $T$ and potential energy $V$ , we would set the Lagrangian to be $\mathcal{L} = T - V$ ; it is clear to see how the Bregman Langrangian is similar in this respect. We would then like to find a path $\{ X_t : t \in \mathbb{T}\}$ which minimizes the action

$\mathcal{J}[X] = \int_{\mathbb{T}} \mathcal{L}(X_t, \dot{X}_t, t ) \, dt$

(where over-dot’s denote differentiation with respect to time) which is a functional on paths. The principle of least action from physics states that the true observed motion of a particle with Langrangian $\mathcal{L}$ corresponds to that which minimizes the action; this is the reason why we are interested in such quantities.

The Euler-Lagrange equation

We can then use the calculus of variations in order to find a curve which minimizes this functional. In order to explain how we can try and do this, we can use the following intutition for when considering the simpler problem of finding the minimizing point of a curve $f(x)$ for $x \in \mathbb{R}$ , for example. Generally, the idea of a stationary point in ordinary calculus is that a function $f$ is stationary at a point $x^*$ when, after perturbing the function slightly, it does not change too much in the sense that

$f(x^* + \epsilon) = f(x^*) + O(\epsilon^2).$

If $f$ is sufficiently differentiable so that for any point $x$ we know that (say via Taylor’s theorem)

$f(x + \epsilon) = f(x) + \epsilon f'(x) + O(\epsilon^2),$

then we know that points for which $f'(x) = 0$ are stationary points. From this, we can then build up critera for when stationary points are local/global minima/maxima and so on.

The same idea is used in the calculus of variations; we aim to find paths $X$ for when we perturb them slightly by a new path, the corresponding action does not change too much. However, there is a lot more freedom in what it means to perturb away from a path than for a simple point; although this seems like it may be a problem, provided we formulate our pertubations correctly, it is actually quite advantageous. Namely, we consider perturbing the path by another $\eta$ so that the end-points of the new path are unchanged, and examine the behaviour as the scale of $\eta$ changes, which we can do simply by perturbing by $\epsilon \eta$ for a scalar $\epsilon$ , and examine what happens as $\epsilon \to 0$ . We now express this idea mathematically.

In the simplest case, we impose boundary conditions on $X$ at the end points of $\mathbb{T}$ , so if $\mathbb{T} = [t_1, t_2]$ , we enforce that $X(t_1) = a$ and $X(t_2) = b$ (say). Given this, we then seek to find paths $X$ for which

$\mathcal{J}[X + \epsilon \eta] = \mathcal{J}[X] + O_{\eta}(\epsilon^2)$

whenever $\eta$ is a (sufficently smooth) path with $\eta(t_1) = \eta(t_2) = 0$ , and $\epsilon$ here acts as a scaling of the pertubation. (The subscript $\eta$ on the big-Oh term denotes that the exact scaling depends on $\eta$ .) As $\mathcal{J}[X + \epsilon \eta]$ is a function of $\epsilon$ (keeping $\eta$ for now fixed), by using Taylor’s theorem, integration by parts and the boundary conditions $\eta(t_1) = \eta(t_2) = 0$ , it is a straightforward exercise to show that

$\mathcal{J}[X + \epsilon \eta] = \mathcal{J}[X] + \epsilon \left( \int_{\mathbb{T}} \left[ \frac{ \partial \mathcal{L}}{\partial X}(X_t, \dot{X}_t, t) - \frac{d}{dt} \left\{ \frac{\partial \mathcal{L}}{\partial V}(X_t, \dot{X}_t, t) \right\} \right] \eta(t) \, dt \right) + O_{\eta}(\epsilon^2).$

As we want the $\epsilon$ term to be equal to $0$ for all “pertubed paths” $\eta$ , it follows that the $[ \cdots ]$ term must be equal to zero, and so stationary paths of the action correspond to solutions of the Euler-Lagrange equation

$\frac{d}{dt} \left\{ \frac{\partial \mathcal{L}}{\partial V}(X_t, \dot{X}_t, t) \right\} = \frac{ \partial \mathcal{L}}{\partial X}(X_t, \dot{X}_t, t).$

As expected from ordinary calculus, a path being stationary is a necessary criterion for it to be a local/global minima/maxima of the action; sufficient conditions can be motivated along a similar line as to how we do so in ordinary calculus.

Back to the Bregman Lagrangian

The Euler-Lagrange equation for the Bregman Lagrangian gives rise to a second order differential equation of the form (under the assumption that the Hessian matrix $\nabla^2 h$ is invertible)

$\ddot{X}_t + \left( e^{\alpha_t} - \alpha_t \right) \dot{X}_t + e^{2\alpha_t + \beta_t} \left[ \nabla^2 h(X_t + e^{-\alpha_t} \dot{X}_t) \right]^{-1} \nabla f(X_t)$ $+ e^{\alpha_t}\left( \dot{\gamma}_t - e^{\alpha}_t \right) \left[ \nabla^2 h(X_t + e^{-\alpha_t} \dot{X}_t) \right]^{-1} \left( \nabla h(X_t + e^{-\alpha_t} \dot{X}_t) - \nabla h(X_t) \right) = 0.$

In order to simplify it slightly, the authors impose the ideal scaling conditions

$\dot{\beta}_t \leq e^{\alpha_t} \qquad \text{ and } \qquad \dot{\gamma}_t = e^{\alpha_t}$

so that the Euler-Lagrange equation simplifies to (after using the second scaling condition)

$\ddot{X}_t + \left( e^{\alpha_t} - \alpha_t \right) \dot{X}_t + e^{2\alpha_t + \beta_t} \left[ \nabla^2 h(X_t + e^{-\alpha_t} \dot{X}_t) \right]^{-1} \nabla f(X_t) = 0$

where the only change is that the last term has been removed.

As the overall plan is to find discretized solutions for this class of ODE’s and analyse their convergence properties, we first want to obtain a convergence rate for exact solutions to the Euler-Lagrange equation. A frequent tool in the analysis of ODE’s and PDE’s are that of energy methods - from a physical perspective, we are interested in when quantities such as energy are conserved - which allow us to analyse the behaviour of the system. The authors then define an energy functional of the form

$\mathcal{E}_t := D_h\left( x^*, X_t + e^{-\alpha_t}\dot{X}_t \right) + e^{\beta_t} \left( f(X_t) - f(x^*) \right).$

Again a physical perspective can be used to motivate this. As mentioned before, if a system has kinetic energy $T$ and potential energy $V$ , we let the Lagrangian be of the form $\mathcal{L} = T - V$ . The corresponding Hamiltonian, corresponding to the total energy of the system, would be of the form $\mathcal{H} = T + V$ ; the above energy functional therefore can be thought of corresponding to the system without the extra damping $e^{\alpha_t + \gamma_t}$ in the original Langrangian. However, it is worth noting that here $\mathcal{E}_t$ does not correspond to the true Hamiltonian, which is given by

$\mathcal{H}(X, P, t) = e^{\alpha_t + \gamma_t} \left( D_{h^*}( \nabla h(X) + e^{-\gamma_t}P, \nabla h(X) ) + e^{\beta_t} f(X) \right),$

where $h^*$ is the convex dual of $h$ and $P$ now corresponds to the momentum. Anyways, under the ideal scaling conditions, it is then straightforward to show that if $X_t$ is a solution to the Euler-Lagrange equation, $\mathcal{E}_t$ is non-increasing in time (this is where the first scaling condition is used), and so it follows that

$f(X_t) - f(x^*) \leq C e^{-\beta_t}$

for some constant $C$ . The authors then show that the family of Lagrangians is closed under time dilations (that is, upon time changes $t \mapsto \tau(t)$ ) after suitable transformation of $(\alpha, \beta, \gamma)$ .

Discretization schemes

Letting $p > 0$ be a parameter, if we choose the choice of parameters

$\alpha_t = \log p - \log t$ $\beta_t = p \log t + \log C$ $\gamma_t = p \log t$

where $C > 0$ is a constant, then the ideal scaling conditions are satisfied, and the corresponding Euler-Lagrange equation

$\ddot{X}_t + \frac{p +1}{t} \dot{X}_t + C p^2 t^{p-2} \left[ \nabla^2 h\left( X_t + \frac{t}{p} \dot{X}_t \right) \right]^{-1} \nabla f(X_t) = 0$

corresponds to the continuous-time limit of Nesterov’s accelerated mirror descent (when $p = 2$ ) and that of Nesterov’s accelerated cubic-regularized Newton’s method (when $p = 3$ ).

The idea is to now discretize the above ODE in a way so that the discrete-time algorithm attains the same convergence rate as that in the continuous time case. The authors note that a naive discretization appears to be unstable (via empirical observation), and ultimately show that, provided $f \in \, \mathcal{C}^p$ and some other smoothness constraints on $f$ and $h$ hold, and provided $C$ is not sufficiently large, the following discretization sceheme has an $O(1/(\epsilon k^p))$ convergence rate:

$x_{k+1} = \frac{p}{k+p} z_k + \frac{k}{k+p} y_k,$ $y_k = \mathrm{argmin}_y \left\{ f_{p-1}(y; x_k) + \frac{N}{\epsilon p} \| y - x_k \|^p \right\},$ $z_k = \mathrm{argmin}_z \left\{ C p k^{(p-1)} \langle \nabla f(y_k), z \rangle + \frac{1}{\epsilon} D_h(z, z_{k-1}) \right\},$

where $k^{(p-1)} = k(k+1)\cdots(k+p-2)$ is the rising factorial, and $f_{p-1}(y;x)$ is the $(p-1)$ -st order Taylor approximation of $f$ centered at $x$ . One can show that the $p=2$ case will recover Nesterov’s original method in some form. However, while this then gives an algorithm which works for all $p \geq 2$ , from a practical perspective it is unclear on whether it is worthwhile considering considering higher orders of $p$ than 2 (due to the need to calculate $f_{p-1}(y;x_k)$ at each iteration).

Conclusion

By considering discretization schemes of the Euler-Lagrange equations corresponding to a family of Lagrangians, the authors are able to recover acceleration phenomenen for convergence using techniques motivated by Nesterov’s original arguments. However, there are still some interesting observations of the Bregman Lagrangian to be explored further; for example, whether the corresponding Hamiltonian can be studied further and whether the gauge invariance of the Euler-Lagrange equation can be used to give more natural interpretations of the role of the function $h$ . In addition, Dar also mentioned that there is a similarity in the comparison between vanilla and accelerated gradient descent, and that of Adagrad [2] and Adam [3]; currently there is no known link to a family of ODE’s for these algorithms, and so it would be interesting to see if there is a similar approach could be used here.

References

[1] Wibisono, Andre, Ashia C. Wilson, and Michael I. Jordan. “A Variational Perspective on Accelerated Methods in Optimization.” arXiv preprint arXiv:1603.04245 (2016). link

[2] Duchi, John, Elad Hazan, and Yoram Singer. “Adaptive subgradient methods for online learning and stochastic optimization.” Journal of Machine Learning Research 12. Jul (2011): 2121-2159. link

[3] D.P. Kingma, J. Ba. “Adam: A Method for Stochastic Optimization”. The International Conference on Learning Representations (ICLR), San Diego, 2015. link

[4] Y. Nesterov. “A method of solving a convex programming problem with convergence rate $O(1/k^2)$ ”. Soviet Mathematics Doklady, 27(2):372–376, 1983.