$$\newcommand{\expected}2{\mathbb{E}_{#1}\left[ #2 \right]}
\newcommand{\prob}[3]{{#1}_{#2} \left( #3 \right)}
\newcommand{\condprob}[4]{{#1}_{#2} \left( #3 \middle| #4 \right)}
\newcommand{\Dkl}2{D_{\mathrm{KL}}\left( #1 | #2 \right)}
\newcommand{\muvec}{\boldsymbol \mu}
\newcommand{\sigmavec}{\boldsymbol \sigma}
\newcommand{\uttid}{s}
\newcommand{\lspeakervec}{\vec{w}}
\newcommand{\lframevec}{\vec{z}}
\newcommand{\lframevect}{\lframevec_t}
\newcommand{\inframevec}{\vec{x}}
\newcommand{\inframevect}{\inframevec_t}
\newcommand{\inframeset}{\inframevec_1,\hdots,\inframevec_T}
\newcommand{\lframeset}{\lframevec_1,\hdots,\lframevec_T}
\newcommand{\model}2{\condprob{#1}{#2}{\lspeakervec,\lframeset}{\inframeset}}
\newcommand{\joint}{\prob{p}{}{\lspeakervec,\lframeset,\inframeset}}
\newcommand{\normalparams}2{\mathcal{N}(#1,#2)}
\newcommand{\normal}{\normalparams{\mathbf{0}}{\mathbf{I}}}
\newcommand{\hidden}1{\vec{h}^{(#1)}}
\newcommand{\pool}{\max}
\newcommand{\hpooled}{\hidden{\pool}}
\newcommand{\Weight}1{\mathbf{W}^{(#1)}}
\newcommand{\Bias}1{\vec{b}^{(#1)}}$$
I’ve decided to approach the inpainting problem given for our class project IFT6266 using a hierarchical variational autoencoder.
While the basic VAE only has a single latent variable, this architecture assumes the image generation process comes from a hierarchy of latent variables, each dependent on its parents. So the factorisation looks like this:
p(x|z1,z2,…,zL)=p(x|z1)p(z1|z2)…p(zL−1|zL)
An architecture like this was used in the PixelVAE paper, but there they use a more complex PixelCNN structure at each layer, which I am attempting to do without. In their model, the `recognition model` or the encoder is not hierarchical — the qϕ network is structured in the following way:
q(z1,z2,…,zL|x)=q(z1|x)q(z2|x)…q(zL|x)
Continue reading →