Nonconvex Loss Functions for Classifiers and Deep Networks

Keyon Vafa

October 27, 2016

Two papers this week proved convergence results for optimizing non-convex loss functions using stochastic gradients. In The Landscape of Empirical Risk for Non-convex Losses by Mei, Song, Yu Bai, and Andrea Montanari, 2016 [1], the authors show that while empirical risk for squared loss is non-convex for linear classifiers, there are numerous desirable qualities once we reach a certain sample size, namely exponentially fast convergence to a local minimum (which is also the global minimum). In Deep Learning without Poor Local Minima by Kawaguchi (2016) [2], by reducing deep linear networks to deep nonlinear networks, the author shows that, among other things, every unique minimum is a global minimum and every non-minimum critical point is a saddle point.

MCMC for Big Data

Ghazal Fazelnia

October 24, 2016

This week we read a paper on Scalable Exact algorithm for Bayesian inference for Big Data by Pollock et al, 2016 [1]. They introduce a Monte Carlo algorithm based on a Markov process with a quasi-stationary distribution that coincides with the distribution of interest. The authors show theoretical guarantees for recovering the correct limiting target distribution using their proposed algorithm. Moreover, they argue that this methodology is practical for big data in which they use a subsampling technique with sub-linear iterative cost as a function of data size.

Attention Models in Image and Caption Generation

Liping Liu and Patrick Stinson

October 16, 2016

We read two papers last Thursday: the “DRAW” paper by Gregor et al, 2014 and the “Show, Attend, Tell” paper by Xu et al, 2015. Both embed an attention model in deep neural networks (DNNs). The first paper generates images that attempt to match the distribution of input images while the second paper generates captions for images. With the attention model, both models output data over multiple time steps, focusing on one region of the image per step. To implement this sequential generative process, both models used long short-term memory units (LSTMs) as one layer of the network.

Columbia Advanced Machine Learning Seminar

Upcoming

Nonconvex Loss Functions for Classifiers and Deep Networks

MCMC for Big Data

Attention Models in Image and Caption Generation