A few weeks ago, Dar lead our discussion of “Learning in Implicit Generative Models” by Mohamed and Lakshminarayanan [1]. This paper gives a good overview of techniques for learning in implicit generative models, and has links to several of the areas we’ve discussed this past year, which I’ll reference throughout.

Learning in Implicit Generative Models

Usually, when thinking of probabilistic methods we think of an explicit parametric specification of the distribution of a random variable $X$ , in which we directly specify a log-likelihood $\log q_{\theta}(x)$ with $\theta$ indexing a family of possible distributions. The goal is then to learn $\theta$ from the log-likelihood. In contrast, it is frequently more natural to specify models implicitly via a latent variable $Z \in \mathbb{R}^m$ which is very easy to generate (e.g multivariate gaussian random variables) which we then transform via a deterministic function $\mathcal{G}_{\theta} : \mathbb{R}^m \mapsto \mathbb{R}^n$ , again with $\theta$ indexing a family of such functions.

But why is the problem of learning $\theta$ different here? Namely, it boils down to that the density $q_{\theta}(x)$ of $X$ can be highly intractable for a number of different reasons. Writing $q(z)$ for the density of the latent variable $Z$ , we have that

$q_{\theta}(x) = \frac{ \partial }{\partial x_1 } \cdots \frac{ \partial }{\partial x_d } \int_{ \{z : \mathcal{G}_{\theta}(z) \leq x \}} q(z) \, dz$

where for $z, z' \in \mathbb{R}^m$ , $z \leq z'$ if and only if $z_i \leq z'_i$ for $i = 1, \ldots, m$ . Although for certain types of $\mathcal{G}$ (e.g when it is easily invertible) this integral is tractable, frequently we have scenarios where $\mathcal{G}$ is non-linear and is specified by a deep network. As mentioned in our discussion, it could be possible to use normalizing flows (which we’ve earlier discussed in the context of variational inference) to allow for this integral to be evaluated in certain circumstances; however if $d > m$ or $\mathcal{G}$ is not invertible, then we would have to proceed differently. We would then likely encounter several compounding issues:

The integration region itself may be very hard to determine.
Even if we knew the integration region, the integral itself may then be computationally demanding, depending on the geometry of the integration region, it’s dimension, and the form of the density $q(z)$ .
Even if we can calculate the integral, we then have to take $d$ partial derivatives, which in a high-dimensional scenario is very challenging.

We would therefore like to have a likelihood-free approach to inference, for which there have been various solutions proposed, including generative adversarial networks (GANs) [2] and classifier ABC [3]. We’ve discussed ABC (Approximate Bayesian Computation) before on the blog (see Scott’s blog post from when we discussed a few methods extending it to work in higher dimensions), and GANs have been a frequent reoccuring topic; for a nice quick review of GANs, see the beginning of Robin’s blog post. The contribution of this paper is to discuss and bring together the ideas developed in these methods in unison, where previously they have been isolated to more specialized areas of the literature.

Notation: The data are denoted by a random variable $X$ , whose true data density is $p^*(x)$ , with intractable model density $q_{\theta}(x)$ . $\theta$ is reserved for model parameters; $\phi$ is used to denote other parameters, such as for discrminator functions.

Likelihood-free inference via density ratios and differences

What are the advantages of generative models? Namely, we can easily draw samples from the model - the latent variables $z$ should be chosen so that they can be generated easily, and then we can simply evaluate the function $\mathcal{G}_{\theta}(\cdot)$ at that point to obtain a draw. Therefore, we should try and use methods which allow for two sets of samples to be compared in order to drive learning. We, for example, can test for whether the true data distribution $p^*(x)$ and the model distribution $q_{\theta}(x)$ are equal by directly estimating

the density difference $r_{\theta}(x) = p^*(x) - q_{\theta}(x)$ , or
the density ratio $r_{\theta}(x) = p^*(x) / q_{\theta}(x)$ .

As estimating the individual marginal likelihoods will be more difficult than estimating the above ratios/differences directly, which we can then use for comparisons (by seeing how close they are to 0 and 1 respectively), this is therefore an attractive way to proceed. There are (broadly speaking) four general approaches to this problem we could take:

class-probability matching,
divergence matching,
ratio matching, and
moment matching

which we’ll discuss in a bit more detail now.

Four general approaches

Class-probability matching - This boils down to using a classifier which can distingush between the observed data and data generated from the model. To describe the general approach, let $Y$ denote a random variable which assigns label $Y = 1$ to samples from the true data distribution, and $Y = 0$ to those from the model distribution, so we can write $p^*(x) = p(x \mid Y = 1)$ and $q_{\theta}(x) = p(x \mid Y = 0)$ . Writing $p(Y = 1) = \pi$ , it therefore follows by Bayes’ formula that

$\frac{ p^*(x) }{q_{\theta}(x) } = \frac{ p(Y = 1\mid x) }{ p(Y = 0 \mid x) } \cdot \frac{ 1 - \pi}{\pi}$

and so estimating the density ratio is equivalent to that of class-probability estimation, as we now only need to compute $p(Y = 1\mid x)$ .

We now need to specify a scoring function or discriminator $\mathcal{D}(x; \phi) = p(Y = 1\mid x)$ depending on some parameters $\phi$ , which is linked to the density ratio via the mapping $\mathcal{D} = \frac{r}{r+1}$ . This means we can use tools for building classifiers (e.g deep neural networks) to specify these functions. Given a scoring function, we now need to specify a proper scoring rule in order to learn parameters. Choosing a Bernoulli (logarithmic) loss, it follows that we can arrive at equation 5 of the paper, namely the GAN objective

$\mathcal{L}(\phi, \theta) = \pi \mathbb{E}_{p^*(x)}[ - \log \mathcal{D}(x; \phi) ] + (1 - \pi) \mathbb{E}_{q(z)}\left[ - \log\left( 1 - \mathcal{D}(\mathcal{G}_{\theta}(z); \phi) \right) \right],$

which we then try and minimize by bi-level optimization. Of course, we can use other scoring rules if desired, with the overall strategy remaining the same.

Divergence minimization - Here the idea is to try and use a divergence between $p^*(\cdot)$ and $q_{\theta}(\cdot)$ in order to drive learning. One class of divergences is the $f$ -divergence, which for a convex function $f$ is of the form

$D_f\left[ p^*(X) \| q_{\theta}(X) \right] = \int q_{\theta}(x) f\left( \frac{p^*(x)}{q_{\theta}(x)} \right) \, dx.$

If these seem completely foreign to you, then note that by the following choices of $f$ , we can obtain some more familar measures of distance $^1$ :

the KL-divergence - $f(u) = u \log(u)$
the total variation distance - $f(u) = \tfrac{1}{2} \lvert u - 1 \rvert$
the $\chi^2$ divergence - $f(u) = (u - 1)^2$
the Jensen-Shannon divergence - $f(u) = u \log\left( \tfrac{2u}{1+u} \right)$

Writing $f'$ for the derivative of $f$ and $f^\dagger$ for the Fenchel conjugate (or convex dual), this is lower bounded by

$\sup_t \mathbb{E}_{p^*(x)}\left[ t(X) \right] - \mathbb{E}_{q_{\theta}(x)}[f^\dagger(t(X))].$

The optimum $t^*(x)$ is related to the density ratio via $t^*(x) = f'(r(x))$ , and so by substituting in this to the above formula, it is equivalent to the following minimization problem in $r_{\phi}(\cdot)$

$\mathcal{L} = \mathbb{E}_{p^*(x)}\left[ - f'(r_{\phi}(X)) \right] + \mathbb{E}_{q_{\theta}(x)}\left[ f^\dagger(f'(r_{\phi}(X))) \right],$

which we can again optimize via bi-level optimization. Here there is no discriminator persay, with its role taken by the ratio function itself. We note that if we let $f(u) = u \log(u) - (u+1) \log(u +1)$ , then we again obtain the original GAN objective; the paper on f-GANs by Nowizin et al. [4] discusses this approach further. As mentioned by the authors, there is also an equivalence $^2$ between this and class probability estimation.

Ratio matching - Writing the true density ratio as $r^*(x) = p^*(x) / q_{\theta}(x)$ and allowing $r_{\phi}(x)$ to be an approxiation parameterized by $\phi$ , we could consider the loss

$\mathcal{L} = \frac{1}{2} \int q_{\theta}(x) \left( r_{\phi}(x) - r^*(x) \right)^2 \, dx,$

which can be generalized to trying to minimize the Bregman divergence for a given function $f$ . Although we’ll encounter the Bregman divergence later (including its definition), for now we’ll simply remark that one can show

$B_f\left( r^*(x) \| r_{\phi}(x) \right) = \mathcal{L}\left( r_{\phi}(x) \right) + D_f\left[ p^*(x) \| q_{\theta}(x) \right]$

where $\mathcal{L}\left(r_{\phi}(x) \right)$ is the same ratio loss as that for divergence minimization above. However, as the divergence term is dependent on $q_{\theta}(x)$ , which is unknown, we could consider an approximation where e.g the ratio is near-optimal, in which case $D_f\left[ p^*(x) \| q_{\theta}(x) \right] = \mathbb{E}_{q_{\theta}(x)}\left[ f(r_{\phi}(x)) \right]$ , with the corresponding approximate loss then used for learning.

Moment matching - One last technique is to try and use the fact that $p^*$ and $q_{\theta}$ are identical if and only if the expectations of any test statistic $s(x)$ are identical under both laws, and so we try and minimize the gap

$\mathcal{L}(\phi, \theta) = \left( \mathbb{E}_{p^*(x)}\left[ s(x) \right] - \mathbb{E}_{q(z)}\left[ s(\mathcal{G}_{\theta}(z)) \right] \right)^2.$

Here the choice of $s$ is very important to the performance of the procedure, as ideally we would like all the moments of the distributions to be matched. If the functions $s(x)$ belong to a reproducing kernel Hilbert space $\mathcal{H}$ , this objective can then be reformulated to give

$\sup_{f \in \mathcal{F}} \mathbb{E}_{p^*(x)}\left[ f(x) \right] - \mathbb{E}_{q(z)}\left[ f(\mathcal{G}_{\theta}(z)) \right]$

where $\mathcal{F} \subseteq \mathcal{H}$ is some function space, giving rise to the maximum mean discrepancy. We can then generalize this to other integral probability metrics, such as the Wasserstein distance, which is used in Wasserstein GANs. In addition, ABC also generally uses a moment matching approach to learning for these types of generative models.

Conclusion

To conclude, although the choice of whether we implicitly define or prescribe our model affects the type of learning and inferential procedures we can use, the authors highlight that

“Any implicit model can be easily turned into a prescribed model by adding a simple likelihood function on the generated outputs, so the distinction is not essential”.

As a consequence, it is important to distingush between the choice of model, the choice of inference and the resulting choice of algorithms; within this, it is perhaps worth noting e.g GANs come under only this third umbrella, and there are other algorithms out there.

References

[1] Mohamed, Shakir, and Balaji Lakshminarayanan. “Learning in Implicit Generative Models.” arXiv preprint arXiv:1610.03483 (2016). link

[2] Goodfellow, Ian, et al. “Generative adversarial nets.” Neural Information Processing Systems (NIPS) (2014). link

[3] M. U. Gutmann, R. Dutta, S. Kaski, and J. Corander. “Statistical inference of intractable generative models via classification”. arXiv preprint arXiv:1407.4981, 2014. link

[4] Nowozin, Sebastian, Botond Cseke, and Ryota Tomioka. “f-GAN: Training Generative Neural Samplers using Variational Divergence Minimization.” Neural Information Processing Systems (NIPS) (2016). link

Footnotes

Strictly, only the total variation distance is actually a metric; infact, one can show that it is the only $f$ -divergence (up to scaling) which gives rise to a proper metric.
Having had a quick look, I’m unsure on whether this is something which would be practically useful; it could be interesting to give this a further look for whether it is the case.

Columbia Advanced Machine Learning Seminar

Implicit Generative Models — What are you GAN-na do?