Details of the Hierarchical VAE (Part 1)

The motivation to use a hierarchical architecture for this task was two-fold:

Learning a vanilla encoder-decoder type of architecture for the task would be the basic deep learning go-to model for such a task. However, the noise modeled if we perform maximum likelihood is only at the pixel level. This seems inappropriate as it implies there is one “right answer”, with some pixel colour variations, given an outer context. The hierarchical VAE models different uncertainties at different levels of abstraction, so it seems like a good fit.
I wanted to investigate how the hierarchical factorisation of the latent variables affect learning in such a model. It turns out certain layers overfit, or I would diagnose the problem as overfitting, and I’m unsure how to remedy those problems.

$$\newcommand{\expected}[2]{\mathbb{E}_{#1}\left[ #2 \right]}

\newcommand{\prob}[3]{{#1}_{#2} \left( #3 \right)}

\newcommand{\condprob}[4]{{#1}_{#2} \left( #3 \middle| #4 \right)}

\newcommand{\Dkl}[2]{D_{\mathrm{KL}}\left( #1 | #2 \right)}

\newcommand{\muvec}{\boldsymbol \mu}

\newcommand{\sigmavec}{\boldsymbol \sigma}

\newcommand{\uttid}{s}

\newcommand{\x}{\mathbf{x}}

\newcommand{\z}{\mathbf{z}}

\newcommand{\h}{\mathbf{h}}

\newcommand{\u}{\mathbf{u}}

\newcommand{\xinner}{\x_\text{inner}}

\newcommand{\xouter}{\x_\text{outer}}

\newcommand{\model}[2]{\condprob{#1}{#2}{\lspeakervec,\lframeset}{\inframeset}}

\newcommand{\joint}{\prob{p}{}{\lspeakervec,\lframeset,\inframeset}}

\newcommand{\normalparams}[2]{\mathcal{N}(#1,#2)}

\newcommand{\normal}{\normalparams{\mathbf{0}}{\mathbf{I}}}

\newcommand{\hidden}[1]{\vec{h}^{(#1)}}

\newcommand{\pool}{\max}

\newcommand{\hpooled}{\hidden{\pool}}

\newcommand{\Weight}[1]{\mathbf{W}^{(#1)}}

\newcommand{\Bias}[1]{\vec{b}^{(#1)}}$$

Model description

I denote $\xinner$ as the inner square that the inpainting task requires us to predict, and $\xouter$ as the outer context. The latent variables are $\z_1\dots\z_L$. The generative process is the following:

$$\begin{align*}

&\condprob{p}{\theta}{\xinner}{\z_1, \z_2,\dots,\z_L, \xouter} = \

&\qquad\qquad\condprob{p}{\theta}{\xinner}{\z_1,\dots,\z_L, \xouter}\condprob{p}{\theta}{\z_1}{\z_2,\dots,\z_L ,\xouter} \dots \condprob{p}{\theta}{\z_{L-1}}{\z_L, \xouter},

\end{align*}

$$with the goal to maximise $\condprob{p}{}{\xinner}{\xouter}$.

The approximating posterior distribution we then use factorises in the following way:

$$\begin{align*}

&\condprob{q}{\phi}{\z_1, \z_2,\dots,\z_L}{\xinner, \xouter} = \

&\qquad\qquad\condprob{q}{\phi}{\z_1}{\xinner, \xouter}\condprob{q}{\phi}{\z_2}{\xinner, \xouter} \dots \condprob{q}{\phi}{\z_L}{\xinner, \xouter}

\end{align*}$$

Architecture

In particular, our model assigns a 2D layout for the latent variables $\z_1,\dots\z_{L-1}$. $\z_L$ is a 1D vector. In order to be conservative about parameter usage, feature extraction for $q_\phi$ and $p_\theta$ share the same convolutional network comprised of convolution-pooling layers.

The input is parameterised as a 4 x 64 x 64 tensor. The last channel is a binary input: if the pixel is missing (part of the inpainting task output) this is a 0, if it is given, it is a 1.

For $q_\phi$, the input ($\h_1$) first 3 feature maps correspond to the R, G, B channels of the full image, and, since $q_\phi$ is conditioned on both $\xinner$ and $\xouter$, the fourth channel is comprised entirely of $1$s.

For $p_\theta$, the input ($\h_1$) first 3 feature maps correspond to the R, G, B channels of the masked image (the middle 32 x 32 is filed with $0$s). The fourth channel is comprised entirely of $1$s except for the middle 32 x 32 region.

For $l = 2,\dots,L$

$$\h_{l} = g_l(\h_{l-1})$$

Then,

$$\condprob{q}{\phi}{\z_l}{\xinner, \xouter} = \mathcal{N}\left(\mu_l(\h_l),(\sigma_l(\h_l))^2I\right),$$

where $\h_l$ is a hidden layer of the convolutional network, using both $\xinner$ and $\xouter$ (the full image) as input.

On the generative model side of things, the approach is similar to the decoder in an encoder-decoder model. However, in each upsampling-convolution step, there are outputs $\mu_{L+l}(\h_{L+l})$ and $\sigma_{L+l}(\h_{L+l})$. So for $l = L,\dots,2L$

$$\begin{align*}

\h_{l} &= f_l(\h_{l-1}, \z_{l-1}), \

\condprob{p}{\theta}{\z_l}{\z_{l-1},\dots,\z_L, \xouter} &= \mathcal{N}\left(\mu_l(\h_l),(\sigma_l(\h_l))^2I\right),

\end{align*}$$

this parameterisation results in $p_\theta$ being dependent on all previous layers of latent variables because each latent variable is dependent on $\h_{l-1}$.

@misc{tan2017-04-08,
  title        = {Details of the Hierarchical VAE (Part 1)},
  author       = {Tan, Shawn},
  howpublished = {\url{https://blog.wtf.sg/2017/04/09/details-of-the-hierarchical-vae-part-1/}},
  year         = {2017}
}