This week we discussed MADE (Germain et al., 2015 [1]) and NADE (Uria et al., 2016 [2]), two papers on autoregressive distribution estimation. The two papers take a similar approach to estimating the distributions of data. Namely, they modify the structure of autoencoder neural networks to yield properly normalized, autoregressive models. NADE, first introduced in 2011, lays the groundwork for these models. MADE extends these ideas to deep networks with binary data. The recent, journal paper on NADE further extends these ideas to real valued data, explores more interesting network architectures, and performs more extensive experiments.

The figures and algorithms below are taken from the aforementioned papers.

Distribution Estimation

Set up

Given a set of examples $\{x^{(t)}\}_{t=1}^T$ , where $x \in R^D$ , the goal is to estimate a joint distribution $p(x)$ . This distribution quantifies the statistical properties of data and can be used as a generative model to sample new data. This generative model is useful in many applications such as classification, data denoising or missing data completion. This problem is relatively easy if the dimensionality of the data is low (e.g., estimate distribution from many examples of real valued numbers). However, in cases when data is high dimensional (e.g., space of pixels of an image), estimation of the data distribution becomes difficult. The main problem is that as dimensionality increases the volume of the space the distribution $p(x)$ needs to cover increases exponentially, making it harder for finite datasets to give a clear picture of the statistical properties of that space.

Autoencoder neural networks:

One powerful idea to estimate the distribution of data is to utilize the power of neural networks as function approximators. In this setting, a neural network learns a feed-forward representation of the input data examples in its hidden layers with the goal of regenerating the input data as accurately as possible. These hidden representations can thus reveal the statistical structures of the data generative distribution. For example, to learn the representation of binary data using a one-layer network, we can frame the problem as follows:

$h(x) = g(b + Wx)$ $\hat x = \text{sigmoid}(c + V h(x))$

where $g$ is the hidden layer nonlinear activation function, $W$ and $V$ are network input-to-hidden and hidden-to-output weights, respectively, and $b$ and $c$ are the bias terms. The main advantage of this framework is that it is very flexible and easy to train to find the best parameters $\theta = \{W, V, b, c\}$ with stochastic gradient descent. The typical loss function used if data is binary is the cross-entropy:

$l(x) = \sum_{d=1}^D -x_d \log(\hat {x}_d) - (1-x_d) \log(1 - \hat {x}_d)$

The output of the network, $\hat{x}_d$ , is interpreted as the probability that the $d$ -th output is one, i.e. $\Pr(x_d = 1) = \hat{x}_d$ . From this perspective, the network maps an input to a collection of probabilities, and the loss represents the log likelihood of the data under an independent Bernoulli noise model. It is tempting to interpret $l(x)$ as a negative log probability of $x$ . However, $e^{-l(x)}$ is not a proper probability mass function; it is non-negative but it does not sum to one. Normalizing it would require an intractable sum over all $2^D$ inputs. Moreover, in the general case of fully connected network, this is not an ideal approach to density estimation. The main confound is that in a fully connected network the generative process of data at dimension $d$ depends on the input data at dimensions $d$ . Thus, with enough hidden units, the network can learn a trivial map that simply copies the input data to the output (i.e., there is a trivial set of weight that assigns $\hat {x}_d$ arbitrarily close to one when $x_d$ equals one and arbitrarily close to zero otherwise). One can see that in the trivial case of copying the input to the output, $l(x) \equiv 0$ for all $x$ , and hence after normalization, the output would be the uniform distribution. NADE ad MADE address these two issues by placing restrictions on the autoencoder network.

Distribution estimation with autoregression:

The decomposition of joint distributions to product of conditions gives a solution to the above problem. In general, the joint distribution over data can be written in the form of conditional product as follows:

$p(x_1, ..., x_D) = p(x_D \, | \, x_1, ..., x_{D-1}) p(x_{D-1} \, | \, x_1,..., x_{D-2}) ... p(x_1)$ $=\prod_{d=1}^{D} p(x_d \, | \, x_{ \lt d})$

Remember the main problem of the autoencoders is that $\hat x_d$ depends on all $x$ ’s due to the full connections in the neural network. However, if the connections are modified to satisfy this autoregressive property, this will eliminate the possibilities of trivial representations and will allow the network to learn a proper joint distribution. The loss function becomes then a valid negative log probability:

$l(x) = - \log p(x_1, ..., x_D) = \sum_{d=1}^D - \log p(x_d \, | \, x_{\lt d})$ $= \sum_{d=1}^D -x_d \log p(x_d = 1 \, | \, x_{\lt d}) - (1-x_d) \log p(x_d = 0 \, | \, x_{\lt d})$

Masked Autoencoder for Distribution Estimation (MADE)

The main idea of MADE is to modify the connections in the autoencoder to satisfy the autoregressive property using masked weights. To enforce that there are no dependencies between $\hat {x}_d$ and $x_{\geq d}$ , MADE ensures there is no computational paths between $\hat {x}_d$ and $x_{\geq d}$ by multiply the network weight by masks as follows:

$h(x) = g(b + (M^W \odot W) x)$ $\hat x = \text{sigmoid}(c + (M^V \odot V) h(x))$

where $M^W$ and $M^V$ are mask matrices. The matrix product of $M^W$ and $M^V$ represents the number of computational paths from the input to the output in this one-layer network. Thus, to satisfy the autoregressive property, we need to choose $M^W$ and $M^V$ such that the matrix $M^{WV} = M^W M^V$ is lower triangle. That is there is no computational paths between $\hat {x}_d$ and $x_{\geq d}$ . The same framework generalizes to deep networks with more than one hidden layer by ensuring the product of the masks have a lower triangular structure (Figure 1). The procedure is detailed in Algorithm 1. MADE focused entirely on estimation the distribution of only binary data.

MADE Figure 1 MADE Algorithm 1

Neural Auroregressive Distribution Estimation (NADE)

The same approach of utilizing the autoregressive property by modifying the autoencoder weights is also adopted in NADE except that NADE uses fixed set of masks (NADE algorithm 1 and Figure 1) whereas in MADE masks are allowed to change. NADE Figure 1

One major extension of NADE over MADE is its ability to handle real valued data by modifying the model to the Gaussian-RBM (Welling et al., [3]). That is each of the conditionals is modeled by a mixture of Gaussians as follows:

$p(x_{o_d} \, | \, x_{o_{\lt d}}) = \sum_{c=1}^C \pi_{o_d,c} \mathcal{N} (x_{o_d}; \mu_{o_d,c}, \sigma_{o_d,c}^2)$ $\pi_{o_d,c} = \frac{\exp\{z_{o_d,c}^{(\pi)}\}}{\sum_{c=1}^C\exp\{z_{o_d,c}^{(\pi)}\}}$ $\mu_{o_d,c} = z_{o_d,c}^{(\pi)}$ $\sigma_{o_d,c} = \exp\{z_{o_d,c}^{(\sigma)}\}$ $z_{o_d,c}^{(\pi)} = b_{o_d,c}^{(\pi)} + \sum_{k=1}^H V_{o_d,k,c}^{(\pi)} h_{d,k}$ $z_{o_d,c}^{(\mu)} = b_{o_d,c}^{(\mu)} + \sum_{k=1}^H V_{o_d,k,c}^{(\mu)} h_{d,k}$ $z_{o_d,c}^{(\sigma)} = b_{o_d,c}^{(\sigma)} + \sum_{k=1}^H V_{o_d,k,c}^{(\sigma)} h_{d,k}$

In deep architectures, NADE used masks in a slightly different way than MADE. That is the input to network is the concatenation of the masked data and the mask itself (Figure 2). This allows the network to identify cases when input data is truly zero from cases when input data is zero because of the mask. NADE also explored other autoencoder architectures such as convolutional neural networks (Figure 3). NADE Figure 2 NADE Figure 3

Results:

MADE

MADE was trained on UCI binary datasets using stochastic gradient descent with mini-batches of size 100 and a lookahead of 30 for early stopping. The results are quantified by the average negative-likelihood on the test set of each data.

The results for the UCI data shows that MADE is the best performing model on almost half of the tested datasets (Table 4). MADE Table 4

NADE

NADE has more extensive experimental section. Here, I discuss the results on the UCI datasets classification experiments. Table 2 in NADE compares the log likelihood performance to other datasets and to also MADE. In these experiments, both NADE and MADE were better than other models in many cases. The performance of NADE and MADE were similar in almost all the datasets, but NADE was slightly better.

NADE Table 2

Conclusions

Both NADE and MADE are methods motivated by the idea of modeling valid distributions using autoregressive property. The two methods modify autoencoder networks to enforce the autoregressive property on the network weights. The two methods successfully identify valid joint distributions while avoiding trivial solutions and intractable normalization constants. NADE takes the idea of autoregressive models one step further by additionally estimating the distributions of non-binary data and to other network architecture like convolutional networks.

References

[1] Germain, Mathieu, et al. “MADE: masked autoencoder for distribution estimation.” International Conference on Machine Learning. 2015. link

[2] Uria, Benigno, et al. “Neural Autoregressive Distribution Estimation.” arXiv preprint arXiv:1605.02226 (2016). link

[3] Max Welling, Michal Rosen-Zvi, and Geoffrey E. Hinton. Exponential family harmoniums with an application to information retrieval. In Advances in Neural Information Processing Systems 17, pages 1481–1488. MIT Press, 2005.link