On “Better Exploiting Latent Variables in Text Modeling”

I’ve been working on latent variable language models for some time, and intend to make it the topic of my PhD. So when Google Scholar recommended “Better Exploiting Latent Variables in Text Modeling”, I was naturally excited to see that this work has continued beyond the Bowman’s paper on VAE language models. Of course, since then, there have been multiple improvements on the original model. More recently, Yoon Kim from Harvard has been publishing papers on this topic that have been particularly interesting.

Can Chinese Rooms Think?

There’s a tendency as a machine learning or CS researcher to get into a philosophical debate about whether machines will ever be able to think like humans. This argument goes so far back that the people that started the field have had to grapple with it. It’s also fun to think about, especially with sci-fi always portraying AI vs human world-ending/apocalypse showdowns, and humans always prevailing because of love or friendship or humanity.

But there’s a tendency for people in such a debate to wind up talking past each other.

Computing Log Normal for Isotropic Gaussians

Consider a matrix $\mathbf{X}$ with rows of datapoints $\mathbf{x_i}$ which are $(n, d)$. The matrix $\mathbf{M}$ is made up of the $\boldsymbol{\mu}_j$ of $k$ different Gaussian components. The task is to compute the log probability of each of these $k$ components for all $n$ data points. In [1]: import theano import theano.tensor as T import numpy as np import time X = T.matrix('X') M = T.

Details of the Hierarchical VAE (Part 2)

So, as a recap, here’s a (badly drawn) diagram of the architecture:

Details of the Hierarchical VAE (Part 1)

The motivation to use a hierarchical architecture for this task was two-fold:

1. Learning a vanilla encoder-decoder type of architecture for the task would be the basic deep learning go-to model for such a task. However, the noise modeled if we perform maximum likelihood is only at the pixel level. This seems inappropriate as it implies there is one “right answer”, with some pixel colour variations, given an outer context. The hierarchical VAE models different uncertainties at different levels of abstraction, so it seems like a good fit.
2. I wanted to investigate how the hierarchical factorisation of the latent variables affect learning in such a model. It turns out certain layers overfit, or I would diagnose the problem as overfitting, and I’m unsure how to remedy those problems.

Samples from the Hierarchical VAE

Each of the following plots are samples of the conditional VAE that I’m using for the inpainting task. As expected with results from a VAE, they’re blurry. However, the fun thing about having a hierarchy of latent variables is I can freeze all the layers except for one, and vary that just to see the type of noise it models. The pictures are generated by using the $\mu_{z_l}(z_{l-1})$ for all layers except for the $i$-th layer.

$i=1$

Hierarchical Variational Autoencoders

$$\newcommand{\expected}2{\mathbb{E}_{#1}\left[ #2 \right]} \newcommand{\prob}[3]{{#1}_{#2} \left( #3 \right)} \newcommand{\condprob}[4]{{#1}_{#2} \left( #3 \middle| #4 \right)} \newcommand{\Dkl}2{D_{\mathrm{KL}}\left( #1 | #2 \right)} \newcommand{\muvec}{\boldsymbol \mu} \newcommand{\sigmavec}{\boldsymbol \sigma} \newcommand{\uttid}{s} \newcommand{\lspeakervec}{\vec{w}} \newcommand{\lframevec}{\vec{z}} \newcommand{\lframevect}{\lframevec_t} \newcommand{\inframevec}{\vec{x}} \newcommand{\inframevect}{\inframevec_t} \newcommand{\inframeset}{\inframevec_1,\hdots,\inframevec_T} \newcommand{\lframeset}{\lframevec_1,\hdots,\lframevec_T} \newcommand{\model}2{\condprob{#1}{#2}{\lspeakervec,\lframeset}{\inframeset}} \newcommand{\joint}{\prob{p}{}{\lspeakervec,\lframeset,\inframeset}} \newcommand{\normalparams}2{\mathcal{N}(#1,#2)} \newcommand{\normal}{\normalparams{\mathbf{0}}{\mathbf{I}}} \newcommand{\hidden}1{\vec{h}^{(#1)}} \newcommand{\pool}{\max} \newcommand{\hpooled}{\hidden{\pool}} \newcommand{\Weight}1{\mathbf{W}^{(#1)}} \newcommand{\Bias}1{\vec{b}^{(#1)}}$$

I’ve decided to approach the inpainting problem given for our class project IFT6266 using a hierarchical variational autoencoder.

While the basic VAE only has a single latent variable, this architecture assumes the image generation process comes from a hierarchy of latent variables, each dependent on its parents. So the factorisation looks like this:

$$p(x|z_1, z_2,\dots,z_L) = p(x|z_1)p(z_1|z_2) \dots p(z_{L-1}|z_L)$$

An architecture like this was used in the PixelVAE paper, but there they use a more complex PixelCNN structure at each layer, which I am attempting to do without. In their model, the recognition model or the encoder is not hierarchical — the $q_\phi$ network is structured in the following way:

$$q(z_1, z_2,\dots,z_L | x) = q(z_1|x)q(z_2|x) \dots q(z_L|x)$$

Deciding <u>W</u>hen <u>T</u>o <u>F</u>eedforward (or WTF gates)

Another paper of mine, titled “Towards Implicit Complexity Control using Variable-Depth DNNs for ASR Systems” got accepted to the International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2016 in Shanghai, which happened not too long ago.

The idea behind this one was the intuition that in a classification task, some instances should be simpler than others to classify. Similarly, the problem of deciding when to stop in an RNN setting is also an important one. If we take the bAbI task for example, and go an extra step and assume the number of logical steps to arrive at the answer is not provided for you, then you need to know when the network is ‘ready’ to give an answer.