## Details of the Hierarchical VAE (Part 2)

So, as a recap, here’s a (badly drawn) diagram of the architecture:

## Details of the Hierarchical VAE (Part 1)

The motivation to use a hierarchical architecture for this task was two-fold:

1. Learning a vanilla encoder-decoder type of architecture for the task would be the basic deep learning go-to model for such a task. However, the noise modeled if we perform maximum likelihood is only at the pixel level. This seems inappropriate as it implies there is one “right answer”, with some pixel colour variations, given an outer context. The hierarchical VAE models different uncertainties at different levels of abstraction, so it seems like a good fit.
2. I wanted to investigate how the hierarchical factorisation of the latent variables affect learning in such a model. It turns out certain layers overfit, or I would diagnose the problem as overfitting, and I’m unsure how to remedy those problems.

## Samples from the Hierarchical VAE

Each of the following plots are samples of the conditional VAE that I’m using for the inpainting task. As expected with results from a VAE, they’re blurry. However, the fun thing about having a hierarchy of latent variables is I can freeze all the layers except for one, and vary that just to see the type of noise it models. The pictures are generated by using the $\mu_{z_l}(z_{l-1})$ for all layers except for the $i$-th layer.

$i=1$

## Hierarchical Variational Autoencoders

$$\newcommand{\expected}[2]{\mathbb{E}_{#1}\left[ #2 \right]} \newcommand{\prob}[3]{{#1}_{#2} \left( #3 \right)} \newcommand{\condprob}[4]{{#1}_{#2} \left( #3 \middle| #4 \right)} \newcommand{\Dkl}[2]{D_{\mathrm{KL}}\left( #1 \| #2 \right)} \newcommand{\muvec}{\boldsymbol \mu} \newcommand{\sigmavec}{\boldsymbol \sigma} \newcommand{\uttid}{s} \newcommand{\lspeakervec}{\vec{w}} \newcommand{\lframevec}{\vec{z}} \newcommand{\lframevect}{\lframevec_t} \newcommand{\inframevec}{\vec{x}} \newcommand{\inframevect}{\inframevec_t} \newcommand{\inframeset}{\inframevec_1,\hdots,\inframevec_T} \newcommand{\lframeset}{\lframevec_1,\hdots,\lframevec_T} \newcommand{\model}[2]{\condprob{#1}{#2}{\lspeakervec,\lframeset}{\inframeset}} \newcommand{\joint}{\prob{p}{}{\lspeakervec,\lframeset,\inframeset}} \newcommand{\normalparams}[2]{\mathcal{N}(#1,#2)} \newcommand{\normal}{\normalparams{\mathbf{0}}{\mathbf{I}}} \newcommand{\hidden}[1]{\vec{h}^{(#1)}} \newcommand{\pool}{\max} \newcommand{\hpooled}{\hidden{\pool}} \newcommand{\Weight}[1]{\mathbf{W}^{(#1)}} \newcommand{\Bias}[1]{\vec{b}^{(#1)}}$$
I’ve decided to approach the inpainting problem given for our class project IFT6266 using a hierarchical variational autoencoder.

While the basic VAE only has a single latent variable, this architecture assumes the image generation process comes from a hierarchy of latent variables, each dependent on its parents. So the factorisation looks like this:

$$p(x|z_1, z_2,\dots,z_L) = p(x|z_1)p(z_1|z_2) \dots p(z_{L-1}|z_L)$$

An architecture like this was used in the PixelVAE paper, but there they use a more complex PixelCNN structure at each layer, which I am attempting to do without. In their model, the recognition model or the encoder is not hierarchical — the $q_\phi$ network is structured in the following way:
$$q(z_1, z_2,\dots,z_L | x) = q(z_1|x)q(z_2|x) \dots q(z_L|x)$$

## Deciding When To Feedforward (or WTF gates)

Another paper of mine, titled “Towards Implicit Complexity Control using Variable-Depth DNNs for ASR Systems” got accepted to the International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2016 in Shanghai, which happened not too long ago.

The idea behind this one was the intuition that in a classification task, some instances should be simpler than others to classify. Similarly, the problem of deciding when to stop in an RNN setting is also an important one. If we take the bAbI task for example, and go an extra step and assume the number of logical steps to arrive at the answer is not provided for you, then you need to know when the network is ‘ready’ to give an answer.

## Constraining Hidden Layers for Interpretability (eventually, hopefully…)

I haven’t written much this past year, so I guess as a parting post for 2015, I’d talk a little bit about the poster I presented at ASRU 2015. The bulk of the stuff’s in the paper, plus I’m still kind of unsure about the legality about putting stuff that’s in the paper on this blog post, so I think I’ll talk about the other things that didn’t make it in.

## Learning to Transduce with Unbounded Memory – The Neural Stack

DeepMind has in the past week released a paper proposing yet another approach to having a memory structure within a neural network. This time, they implement a stack, queue and a deque “data structure” within their models. While this idea is not necessarily new, it incorporates some of the broad ideas seen in the Neural Turing Machines, where they try to have a model that is end-to-end differentiable, rather than have the data structure decoupled from the training process. I have to admit I haven’t read any of these previous papers before, but it’s definitely on my to read list.

In any case, this paper claims that using these memory structures beats having an 8-layered LSTM network trained for the same task. If this is true, this may mean we finally have some justification for these fancier models — simply throwing bigger networks at problems just isn’t as efficient.

I’ve spent some time trying to puzzle out what exactly they’re trying to do here with the neural stack. I suspect once I’ve figured this out, the queue and deque will be pretty similar, so I don’t think I will go through them in the same detail. Continue reading

## Generating Singlish with LSTMs

So in in the last week, Andrej Karpathy wrote a post about the current state of RNNs, and proceeded to dump a whole bunch of different kinds of text data into them to see what they learn. Training language models and then sampling from them is lots of fun, and a character-level model is extra interesting because you see it come up with new words that actually kind of mean something sometimes. Even Geoffrey Hinton has some fun with it in this talk.

So after reading through Karpathy’s code, I got a few tips about how to do this properly, as I haven’t been able to train a proper language model before.

The data is obtained from one of Singapore’s most active sub-forums on HardwareZone.com.sg, Eat-Drink-Man-Woman. I like to think it’s a… localised 4chan. So know what to expect going in. If you want to just see what the model generates, go here.