Below you will find pages that utilize the taxonomy term “IFT6266”
Details of the Hierarchical VAE (Part 1)
The motivation to use a hierarchical architecture for this task was two-fold:
- Learning a vanilla encoder-decoder type of architecture for the task would be the basic deep learning go-to model for such a task. However, the noise modeled if we perform maximum likelihood is only at the pixel level. This seems inappropriate as it implies there is one “right answer”, with some pixel colour variations, given an outer context. The hierarchical VAE models different uncertainties at different levels of abstraction, so it seems like a good fit.
- I wanted to investigate how the hierarchical factorisation of the latent variables affect learning in such a model. It turns out certain layers overfit, or I would diagnose the problem as overfitting, and I’m unsure how to remedy those problems.
Samples from the Hierarchical VAE
Each of the following plots are samples of the conditional VAE that I’m using for the inpainting task. As expected with results from a VAE, they’re blurry. However, the fun thing about having a hierarchy of latent variables is I can freeze all the layers except for one, and vary that just to see the type of noise it models. The pictures are generated by using the $\mu_{z_l}(z_{l-1})$ for all layers except for the $i$-th layer.
$i=1$
Hierarchical Variational Autoencoders
$$\newcommand{\expected}2{\mathbb{E}_{#1}\left[ #2 \right]}
\newcommand{\prob}[3]{{#1}_{#2} \left( #3 \right)}
\newcommand{\condprob}[4]{{#1}_{#2} \left( #3 \middle| #4 \right)}
\newcommand{\Dkl}2{D_{\mathrm{KL}}\left( #1 | #2 \right)}
\newcommand{\muvec}{\boldsymbol \mu}
\newcommand{\sigmavec}{\boldsymbol \sigma}
\newcommand{\uttid}{s}
\newcommand{\lspeakervec}{\vec{w}}
\newcommand{\lframevec}{\vec{z}}
\newcommand{\lframevect}{\lframevec_t}
\newcommand{\inframevec}{\vec{x}}
\newcommand{\inframevect}{\inframevec_t}
\newcommand{\inframeset}{\inframevec_1,\hdots,\inframevec_T}
\newcommand{\lframeset}{\lframevec_1,\hdots,\lframevec_T}
\newcommand{\model}2{\condprob{#1}{#2}{\lspeakervec,\lframeset}{\inframeset}}
\newcommand{\joint}{\prob{p}{}{\lspeakervec,\lframeset,\inframeset}}
\newcommand{\normalparams}2{\mathcal{N}(#1,#2)}
\newcommand{\normal}{\normalparams{\mathbf{0}}{\mathbf{I}}}
\newcommand{\hidden}1{\vec{h}^{(#1)}}
\newcommand{\pool}{\max}
\newcommand{\hpooled}{\hidden{\pool}}
\newcommand{\Weight}1{\mathbf{W}^{(#1)}}
\newcommand{\Bias}1{\vec{b}^{(#1)}}$$
I’ve decided to approach the inpainting problem given for our class project IFT6266 using a hierarchical variational autoencoder.
While the basic VAE only has a single latent variable, this architecture assumes the image generation process comes from a hierarchy of latent variables, each dependent on its parents. So the factorisation looks like this:
$$p(x|z_1, z_2,\dots,z_L) = p(x|z_1)p(z_1|z_2) \dots p(z_{L-1}|z_L)$$
An architecture like this was used in the PixelVAE paper, but there they use a more complex PixelCNN structure at each layer, which I am attempting to do without. In their model, the `recognition model` or the encoder is not hierarchical — the $q_\phi$ network is structured in the following way:
$$q(z_1, z_2,\dots,z_L | x) = q(z_1|x)q(z_2|x) \dots q(z_L|x)$$