A few favorite papers of 2017

January 9, 2018

This isn't an exhaustive list, and I will inevitably forget some papers. I'll keep updating as a remember, and will probably expand some of the background/contribution sections as I have time, so that they're more accessible.

Background: Language models and NLP tasks almost always use a softmax to compute a distribution over the vocabulary, and usually this is computed as $\sigma(\mathbf{W}\mathbf{h})$, where $\mathbf{h}$ is a $d$-dimensional context vector from a previous layer, and $\mathbf{W} \in \mathbb{R}^{M \times d}$ is a word embedding, letting $M$ be the vocabulary size. In language modeling, the model might generate a sequence of contexts $\mathbf{h}_1, \ldots, \mathbf{h}_N$ 1. Taking the probability vector for each word and stacking it gives a probability matrix $\mathbf{A}$, generated by the model 2.

Contribution: The authors show that the probability matrix is inherent limited in rank by the dimension of the embedding. This is bad, because there are a set of probability distributions that we can't even represent using our model! So no matter how good we are at the architecture beforehand (using all sorts of fancy dropouts / weight drops / etc.), we are limited by this "softmax bottleneck". Note that $d \approx 300$, and the maximum rank of this matrix is $M$, which is usually 2 orders of magnitude larger. The (simple) solution? Just blending together the softmaxes from $k$ different embeddings. They train this new model, reducing $d$ to make the number of parameters comparable, and get a ~4 PPL reduction from the previous SOTA on PTB + WikiText.

Why I like this: It's so simple! And yet, somewhat surprising at the same time. It's the kind of thing I worried about briefly (in an abstract way, without the framework they've provided) when I first learned about softmaxes, but then kind of brushed away. And yet they've shown that just adding simple correction will make models better. It's transferable to translation / summarization models, and I think whenever I think about a probabilistic model I'll think briefly about what kinds of probability distributions are representable.

Background: This paper falls under the umbrella of 'interpretability', which is a heated subject in the research world at the moment.

Contribution: This paper (1) looks at earlier work (Cook & Weisberg, 1982) explaining how the parameters of a learned model might change slightly (i.e., the gradient of the parameters) when we change the weighting of datapoints. It uses this via the chain rule to look at the gradient of the test loss with respect to upweight/downweighting a data point. Then, it finds (somewhat) fast algorithms for computing this loss, and applies it to some (small) neural networks. Essentially, it finds a method for 'walking in the space of images' to find a training image that can make your model screw up. Think: insert a random datapoint into the online MNIST files, and suddenly your model is horrible. Note that Yann LeCun doesn't use https on that website.

Why I like this: I'm a big fan of this paper because it uses some fundamental research on influence functions in a new way. Also, the framework it proposes, that of adversarial training images, is definitely interesting. There are some caveats, of course, with work like this: (non)convexity, complicated models (for which the Hessian is not easy!), efficiency, and the authors treat those issues head-on, rather than brushing them aside. That being said, it's unclear whether it'll work without some more algorithmic refinement (a problem I tackled a bit this semester but didn't make much progress on). I also have an implementation of this work in PyTorch that I hope to open source soon, since it's a little nontrivial and probably of use. This is now possible since PyTorch has some support for automatic HVP calculations.

I'm a bit biased because Pierre taught an excellent inference class this spring which I took, but it was definitely interesting to me.

Background: Because of burn-in, MCMC samples are always biased. The bias can be reduced by extending burn-in, of course, so this might not be a practical concern, but actually burn-in can be a significant contributor (imagine you have to take a few samples from a large set of distributions, like $p(x_i | z_i)$ after you've already sampled $z_i$ in an ad-hoc way).

Contribution: This paper shows a method (via coupled chains) that shows how to eliminate bias in MCMC samples.

It's very rare to see a work both push SOTA on a benchmark, and be faster at the same time.

I really mean this section to be about these papers and all of the followups, which I think are somewhat natural extensions of the same line of work. See REBAR, which makes this estimator unbiased by using it as a control variate for the REINFORCE estimator, and RELAX, which takes things one step further by extending it to the RL paradigm, where you can only sample one action rather than a weighted mix of actions.

Background: Variational inference is a powerful technique (see Jeffrey's blog post here), and the research community has made a lot of progress in SVI recently. However, for discrete latent variables, the best we can do is use black-box variational inference, which is interesting, but very hard to implement and tune 3.

Contribution: These papers give a solution by introducing a relaxation of discrete distributions to distributions over the simplex, and also gives them a reparametrization gradient. Actually, adding Gumbel noise is a common trick in the economics and ranking literature (see: McFadden 1974, Mattson et. al. 2011), but the real innovation here is realizing that taking a softmax over logits + Gumbel noise is reparametrizable, and useful as an approximation to the real argmax (which is the Gumbel max trick). The papers themselves have good explanations, and Eric Jang has a nice blog post about how to do it in practice. Actually I think this connection is worth exploring, so I'll go over it in a blog post later.

Why I like this: It's really useful! Especially with the two follow up works, this is actually a very useful alternative to BBVI (just see the results from our paper), and makes it much easier to build and train generative models with discrete latent states. For about a month after reading this paper whenever I heard about a generative process I would think: "can this be modeled using a VAE?". Also, I think the papers are well written.

I'm not sure if this paper counts as "this year", but it was in ICLR 2017 so I'll count it.

Background: It's difficult to evaluate some generative models. Suppose that there's a latent vector $\mathbf{z}$, and we generate $\mathbf{x}$ stepwise, so that $\mathbf{z} \sim p(\mathbf{z})$, and then $\mathbf{x} \sim p(\mathbf{x} | \mathbf{z})$. Then for a hold-out element $\mathbf{x}_{\operatorname{test}}$, we have to integrate out all values of $\mathbf{z}$! This can be expensive, since integration usually means Monte-Carlo sampling. We have some alternatives, like the ELBO or IWAE bounds for VAEs, but even there it is a lower bound, and we're not confident that the ELBO is even asymptotically consistent (the IWAE metric is, but we don't know how fast). For GANs we only seem to have a kernel density estimator (KDE).

Contribution: First, the authors use bidirectional Monte Carlo (BDMC) to upper bound the log marginal, and show that annealed importance sampling is accurate enough to use to evaluate the quality of other estimates. Then, they use it to evaluate kernel density estimators and IWAE on a number of datasets, and also compare VAEs and GANs (!). They essentially show that IWAE is not quite good enough yet, and KDE is very bad in high dimensional spaces, as expected (it's not even consistent). Also, they show that VAEs are significantly better on evaluation than GANs (since they actually fit to a log-marginal estimator).

Why I like this: It ties together the promises of a lot of other papers, and has good experimental results, which are cleanly interpreted. I don't think it's foundational, but it gives good motivation for trying to come up with good evaluation metrics for GANs, and lets people know that their concerns about "lower bounds not being good enough" are warranted.

Frequentist Consistency of Variational Bayes

Background: Variational bayes / variational inference is well established as a fast alternative to Markov-chain Monte Carlo (MCMC), but we have few theoretical results about it. In particular people are concerned that even discounting the fact that the variational posterior might be misspecified, we don't know that in the large-data limit, the variational posterior is centered at the 'best possible estimate' of the latent variables in some sense, and is asymptotically normal.

Contribution: Consider just the case of estimating the posterior distributions over $\theta$, the model's parameters (so, shared global latent variables) 4. We're being pretty Bayesian here in this sense, so we have a posterior $q(\theta)$. Consider that given $\theta$ , we can pick the best possible $q(z)$ (in maximizing the ELBO), and pick the best possible $\theta$ in this way, and call it the variational frequentist estimate (VFE). Then, assuming that this VFE is consistent, the paper shows (1) the VB posterior $q^*(\theta)$ converges to the member of the family that minimizes the KL divergence with a normal centered at the VFE / true parameter, with some variance, and (2) it is consistent. And then they do applications!

Why I liked it: While it doesn't attempt to tackle all of variatonal inference in one sitting, it is well written and the style of proof reflects how people prove the Bernstein von Mises theorem, so I assume it has the same level of impact (I have a bad sense). Note that there are liberal technical conditions throughout, but they're usually not egregious (see: Robert Nickl's lecture notes on the BVM here, which gives some insight). Also, I appreciate these kinds of consistency works because they are difficult, but let us be confident that in the long run the work on variational inference will pay off. Mostly, I like proofs :)

footnotes

1

The paper uses $N$ to denote the set of all possible contexts in a language, making some assumptions about finiteness. I'm simplifying here a bit and talking about the contexts in a single sentence which is smaller. The implications are the same, I think.

2

Note that this idea of a probability matrix assumes that we're using teacher-forcing, which is usual practice in language modeling. In other words, the distribution over word 2 only depends on the context vector, which only depends on the context vector from the previous state and the correct word 1, not on our choice of word 1. This isn't a hack - it's the right thing to do when trying to estimate the true log marginal.

3

Does anyone know of an open-source implementation of BBVI (i.e. control variates and Rao-Blackwellization)? Jeffrey and I worked on this, and we think it's very very expensive to compute the 'correct' control variate as recommended in their paper, based on how PyTorch and Tensorflow are implemented, but we might be missing a detail.