Hierarchical Variational Autoencoders

$$\newcommand{\expected}2{\mathbb{E}_{#1}\left[ #2 \right]}

\newcommand{\prob}[3]{{#1}_{#2} \left( #3 \right)}

\newcommand{\condprob}[4]{{#1}_{#2} \left( #3 \middle| #4 \right)}

\newcommand{\Dkl}2{D_{\mathrm{KL}}\left( #1 | #2 \right)}

\newcommand{\muvec}{\boldsymbol \mu}

\newcommand{\sigmavec}{\boldsymbol \sigma}

\newcommand{\uttid}{s}

\newcommand{\lspeakervec}{\vec{w}}

\newcommand{\lframevec}{\vec{z}}

\newcommand{\lframevect}{\lframevec_t}

\newcommand{\inframevec}{\vec{x}}

\newcommand{\inframevect}{\inframevec_t}

\newcommand{\inframeset}{\inframevec_1,\hdots,\inframevec_T}

\newcommand{\lframeset}{\lframevec_1,\hdots,\lframevec_T}

\newcommand{\model}2{\condprob{#1}{#2}{\lspeakervec,\lframeset}{\inframeset}}

\newcommand{\joint}{\prob{p}{}{\lspeakervec,\lframeset,\inframeset}}

\newcommand{\normalparams}2{\mathcal{N}(#1,#2)}

\newcommand{\normal}{\normalparams{\mathbf{0}}{\mathbf{I}}}

\newcommand{\hidden}1{\vec{h}^{(#1)}}

\newcommand{\pool}{\max}

\newcommand{\hpooled}{\hidden{\pool}}

\newcommand{\Weight}1{\mathbf{W}^{(#1)}}

\newcommand{\Bias}1{\vec{b}^{(#1)}}$$

I’ve decided to approach the inpainting problem given for our class project IFT6266 using a hierarchical variational autoencoder.

While the basic VAE only has a single latent variable, this architecture assumes the image generation process comes from a hierarchy of latent variables, each dependent on its parents. So the factorisation looks like this:

$$p(x|z_1, z_2,\dots,z_L) = p(x|z_1)p(z_1|z_2) \dots p(z_{L-1}|z_L)$$

An architecture like this was used in the PixelVAE paper, but there they use a more complex PixelCNN structure at each layer, which I am attempting to do without. In their model, the `recognition model` or the encoder is not hierarchical — the $q_\phi$ network is structured in the following way:

$$q(z_1, z_2,\dots,z_L | x) = q(z_1|x)q(z_2|x) \dots q(z_L|x)$$

The paper has a nice diagram illustrating what the structure of the model looks like:

One interesting observation to make about the model is that the $p_\theta(z_l|z_{l+1})$s are never trained directly with signals from the reconstruction of $x$. Instead, the model first learns to create a good reconstruction ($\expected{q(z_1|x)}{\log p(x|z_1)}$), which increases $\Dkl{q(z_1|x)}{p(z_1|z_2)}$. In minimising $\Dkl{q(z_1|x)}{p(z_1|z_2)}$, it then has to learn a good $q(z_2|x)$, which in turn increases $\Dkl{q(z_2|x)}{p(z_2|z_3)}$. The process then goes on slowly to the top level latent variable.

One change I’ve made to my formulation of the model for the inpainting problem is to turn it into a `conditional’ VAE. I split $x$ up into $x_\text{outer}$ and $x_\text{inner}$. The prior for the top level latent variable, is then conditioned on $x_\text{outer}$, $p(z_L|x_\text{outer})$. This gives a model that requires the outer border during the generation process.

@misc{tan2017-03-11,
  title        = {Hierarchical Variational Autoencoders},
  author       = {Tan, Shawn},
  howpublished = {\url{https://blog.wtf.sg/2017/03/12/hierarchical-variational-autoencoders/}},
  year         = {2017}
}