## Details of the Hierarchical VAE (Part 2)

So, as a recap, here’s a (badly drawn) diagram of the architecture:

$$\newcommand{\expected}[2]{\mathbb{E}_{#1}\left[ #2 \right]} \newcommand{\prob}[3]{{#1}_{#2} \left( #3 \right)} \newcommand{\condprob}[4]{{#1}_{#2} \left( #3 \middle| #4 \right)} \newcommand{\Dkl}[2]{D_{\mathrm{KL}}\left( #1 \| #2 \right)} \newcommand{\muvec}{\boldsymbol \mu} \newcommand{\sigmavec}{\boldsymbol \sigma} \newcommand{\uttid}{s} \newcommand{\x}{\mathbf{x}} \newcommand{\z}{\mathbf{z}} \newcommand{\h}{\mathbf{h}} \newcommand{\u}{\mathbf{u}} \newcommand{\xinner}{\x_\text{inner}} \newcommand{\xouter}{\x_\text{outer}} \newcommand{\model}[2]{\condprob{#1}{#2}{\lspeakervec,\lframeset}{\inframeset}} \newcommand{\joint}{\prob{p}{}{\lspeakervec,\lframeset,\inframeset}} \newcommand{\normalparams}[2]{\mathcal{N}(#1,#2)} \newcommand{\normal}{\normalparams{\mathbf{0}}{\mathbf{I}}} \newcommand{\hidden}[1]{\vec{h}^{(#1)}} \newcommand{\pool}{\max} \newcommand{\hpooled}{\hidden{\pool}} \newcommand{\Weight}[1]{\mathbf{W}^{(#1)}} \newcommand{\Bias}[1]{\vec{b}^{(#1)}}$$
Hopefully, it makes sense. The goal is to reiterate the fact that the feature extraction from the image is performed by the same function with shared parameters. Also, the latent variables are fed back into the convolution upsampling process after they are sampled when sampling from the conditional prior.

For the sake of simplicity in the model descriptions, there are a couple of minor details I’ve left out. For example, I’ve made connections from the feature maps of the encoder to the decoder with the same feature map size in order to “re-include” information from the immediate border. This helps with giving more consistent colours with border at the edges of the inpainted region.

## Instability in the KL terms

One problem that occurs during training VAE models with conditional priors is that the models learnt for $p(\z_l|\z_{l+1})$ and $q(\z_l|\z_{l-1})$ don’t match up. This causes a big penalty, which affects the gradients being backpropagated. Because of the bigger number of random variables, this problem is compounded when training this model. Batch norm worked exceedingly well at remedying these problems.

However, the model then starts to overfit very easily. Despite this, the samples from an overfitted model do not display any semblance of memorising pictures from the training data. The following are plots of the training curves and validation curves.

The difference between the KL-divergence terms in the lower layers between training and validation grows as training continues. This means that the generative portion of the model cannot successfully match the distribution of the lower layers’ latent variables given the latent variables in the higher layers.

## Remedies and Future Work?

One possible solution to this would be to go the way of the PixelVAE, and introduce an autoregressive structure at every layer. I did briefly consider implementing DeepMind’s multi-scale alternative to pixel-by-pixel “drawing”, but the time that that would involve was too much.

Also included in the project was caption data, which may provide more information about the region that requires inpainting. I didn’t use that data in my project, though I doubt it would help much.