Approximate Bayesian Computation | Columbia Advanced Machine Learning Seminar

Last week we read two new papers on Approximate Bayesian Computation (ABC), a method of approximate Bayesian inference for models with intractable likelihoods. These papers explore how stochastic gradients of the ABC log likelihood can be brought to bear on these challenging problems.

The ABC’s of Approximate Bayesian Computation

Consider a complex simulator, like a weather forecasting algorithm, a coalescent model of population genetics, a nonlinear stochastic differential equation, or a graphics rendering engine. These simulators aim to model some natural process or phenomena, and while they inevitably fall short, they are still invaluable resources. $^1$ The simulator takes in a set of parameters, $\theta$ , and produces a potentially stochastic output, $\hat{y} \sim p(y \, | \, \theta)$ . Given a simulator and a set of real-world observations, $y$ , our goal is to compute the posterior distribution over parameters, using our simulator as a model:

$p(\theta \, | \, y) \propto p(\theta) \, p(y \, | \, \theta).$

Unfortunately, the likelihood here is intractable — we can sample from this distribution by running the simulator, but if the simulator is a black box, we can’t compute the likelihood of a given observation!

ABC

Fortunately, we still have a few tricks up our sleeve. For example, we could imagine rejection sampling by proposing $\hat{\theta} \sim p(\theta)$ , simulating $\hat{y} \sim p(y \, | \, \hat{\theta})$ , and accepting $\hat{\theta}$ if $\hat{y} = y$ . Of course, in practice the probability of exactly generating the data may be zero. Roughly speaking, idea behind ABC is to weaken this constraint and accept if the simulated data is “close” to the observed data, as measured by some kernel, such as,

$K_{\epsilon}(y, \hat{y}) = \mathcal{N}(y \, | \, \hat{y}, \epsilon^2).$

While this rejection sampling approach may work in low dimensions, it will be horribly inefficient for non-trivial problems. Instead, we will use Markov chain Monte Carlo algorithms to collect samples from the posterior, using the kernel in place of the intractable likelihood. Specifically, we will approximate it with the ABC likelihood,

$p_\epsilon (y \, | \, \theta) = \int K_\epsilon(y, \hat{y}) \, p(\hat{y} \, | \, \theta) \, \mathrm{d}\hat{y},$

which we approximate using Monte Carlo,

$p_\epsilon (y \, | \, \theta) \approx \frac{1}{S} \sum_{s=1}^S K_\epsilon(y, \hat{y}^{(s)}); \quad \hat{y}^{(s)} \sim p(\hat{y} \, | \, \theta).$

Since this is an unbiased estimate of the ABC likelihood, we can plug it into a Metropolis-Hastings algorithm and appeal to the theory of pseudomarginal MCMC. With proposal distribution $q(\theta' \, | \, \theta)$ , we obtain the following acceptance probability,

$\alpha = \min \left\{ 1, \; \frac{p(\theta') \sum_{s=1}^S p_\epsilon(y \, | \, \hat{y}^{(s)'}) \, q(\theta \, | \, \theta')} {p(\theta) \sum_{s=1}^S p_\epsilon(y \, | \, \hat{y}^{(s)}) \, q(\theta' \, | \, \theta)} \right\}.$

For a through yet readily accessible discussion of pseudomarginal methods and their relation to ABC, please see the excellent set of notes Christrian wrote for this blog.

Finally, we note that the simple Monte Carlo estimate is not the only way of defining the ABC likelihood. As an alternative, Wood (2010) [3] defined the synthetic likelihood based on a conditionally Gaussian model,

$p_\epsilon(y | \theta) = \int \mathcal{N}(y | \hat{y}, \epsilon^2) \, \mathcal{N}(\hat{y} | \mu(\theta), \, \sigma(\theta)^2) \, \mathrm{d}\hat{y}$ $= \mathcal{N}(y | \mu(\theta), \epsilon^2 + \sigma(\theta)^2).$

While this conditionally Gaussian assumption may be a gross simplification of the true likelihood, this Gaussian form can have advantageous properties, as we will see below.

It is important to note that these methods will target a posterior distribution, $p_\epsilon(\theta \, | \, y) \propto p_\epsilon(y \, | \, \theta) \, p(\theta),$ which is not equal to the true posterior for any nonzero $\epsilon$ . Moreover, for high dimensional problems, the kernel is typically not applied directly to the observations, but rather to hand-chosen statistics of the observations, $t(y)$ , and the simulations, $t(\hat{y})$ . If these are sufficient statistics then we have some further guarantees on the asymptotic correctness of the algorithm, but in general this will not be the case. $^2$

Hamiltonian ABC

Metropolis-Hastings is a great start, but for high dimensional parameters, it may not do the job. For such problems, state of the art approaches rely on gradients of the log likelihood to guide the exploration of the posterior distribution. For example, Hamiltonian Monte Carlo (HMC) and its recently proposed stochastic variants have shown great promise. Last week’s reading by Meeds, Leender, and Welling (2015) [1] applied these techniques to the domain of ABC.

The critical issue is how to approximate gradients of the log likelihood given only the ability to sample from the model. Meeds et al. (2015) suggest a simultaneous perturbation stochastic approximation (SPSA), which boils down to a finite difference approximation that requires fewer simulations than the standard approach. Moreover, they suggest using persistent random seeds in their simulation algorithm, similar to some applications of dependent random streams in MCMC, in order to achieve lower variance gradient estimates. Finally, they compare the standard ABC likelihood and the synthetic likelihood and find that, in some cases, the synthetic likelihood affords lower variance gradient estimates at the cost of some bias.

At first glance, it seems we have a problem: stochastic HMC algorithms typically rely on unbiased estimates of the log likelihood in order for their guarantees to hold (and even with such estimates there remain concerns about the asymptotic convergence of the algorithms). Here we have access to unbiased estimates of the ABC likelihood, not its log (recall that we used the unbiased likelihood estimate to claim the correctness of the pseudomarginal approach). To reconcile this, we should think of Hamiltonian ABC as inference on the augmented space of both parameters and simulations; i.e. as a Markov chain that targets the posterior, $p_\epsilon(\theta, \{\hat{y}^{(s)}\} \, | \, y)$ . Conditioning on the simulations, the finite difference approximation to the gradient of the log likelihood w.r.t. $\theta$ is just a deterministic function. Minibatches of datapoints introduce stochasticity, but not bias since the log likelihood is a sum over datapoints. Using SPSA gradients introduces some additional stochasticity and a slight bias, but in practice this seems to be insignificant given the other approximations inherent in ABC. $^3$

For high dimensional problems, leveraging gradient information, even if it comes from finite difference approximations, could provide substantial gains in efficiency for high dimensional inference problems. Hamiltonian ABC seems like a promising step in this direction.

Variational ABC

As we all know, MCMC isn’t the only game in town. Building on recent work on variational inference for intractable likelihoods (VBIL) [4], Moreno et al. (2016) [2] propose to approximate the posterior as,

$p_\epsilon(\theta | y) \approx q(\theta ; \lambda),$

and maximize the evidence lower bound (ELBO),

$\mathcal{L} = \int q(\theta ; \lambda) \log \int K_\epsilon(y, \hat{y}) \, p(\hat{y} \, | \, \theta) \, \mathrm{d}\hat{y} \, \mathrm{d}\theta - \mathrm{KL}(q(\theta; \lambda) \, || \, p(\theta)).$

Rather than approximating gradients of the ELBO with finite differences, as in Meeds et al. (2015) [1], here the authors assume that we can actually take gradients through the simulator using an (awesome) automatic differentiation library like autograd (more on the implications of this assumption shortly). Given gradients through the simulator, the authors then suggest parameterizing both the variational factor and the simulator such that they become deterministic functions of the parameters and a stream of random bits. Using these “reparameterization tricks,” we can write the ELBO as an expectation with respect to a random stream and compute lower variance Monte Carlo estimates of its gradient with respect to the parameters, $\theta$ .

Effectively, Moreno et al. (2016) suggest that the simulator isn’t always as black box as it appears. If we can implement it in Python and factor out the randomness, then many of the now-standard tools for variational inference are at our disposal. This does beg the question, however, if our black box simulator is actually gray, then when is our likelihood really intractable? If the output is a simple function of some latent variables that we can integrate over with Monte Carlo, then why invoke ABC? Clearly there are still many cases where the likelihood is indeed inaccessible, but in my opinion, this paper suggests that if we pop the lid on some of our complicated simulators, we may in fact have more tools at our disposal than we at first thought.

References

[1] Meeds, Edward, Robert Leenders, and Max Welling. “Hamiltonian ABC.” arXiv preprint arXiv:1503.01916 (2015). link

[2] Moreno, Alexander, et al. “Automatic Variational ABC.” arXiv preprint arXiv:1606.08549 (2016). link

[3] Wood, Simon N. “Statistical inference for noisy nonlinear ecological dynamic systems.” Nature 466.7310 (2010): 1102-1104.

[4] Tran, Minh-Ngoc, David J. Nott, and Robert Kohn. “Variational Bayes with intractable likelihood.” arXiv preprint arXiv:1503.08621 (2015).

Footnotes

1. Updated per John's comments below.
2. See John's comment regarding the validity of this approach.
3. See Dustin's comment for more on our discussion of this point in last week's meeting.