:(

## Can Chinese Rooms Think?

There’s a tendency as a machine learning or CS researcher to get into a philosophical debate about whether machines will ever be able to think like humans. This argument goes so far back that the people that started the field have had to grapple with it. It’s also fun to think about, especially with sci-fi always portraying AI vs human world-ending/apocalypse type showdowns, with humans prevailing due to love/friendship/humanity.

However, there’s a tendency for people in such a debate to wind up talking past each other.

## Details of the Hierarchical VAE (Part 2)

So, as a recap, here’s a (badly drawn) diagram of the architecture:

## Details of the Hierarchical VAE (Part 1)

The motivation to use a hierarchical architecture for this task was two-fold:

1. Learning a vanilla encoder-decoder type of architecture for the task would be the basic deep learning go-to model for such a task. However, the noise modeled if we perform maximum likelihood is only at the pixel level. This seems inappropriate as it implies there is one “right answer”, with some pixel colour variations, given an outer context. The hierarchical VAE models different uncertainties at different levels of abstraction, so it seems like a good fit.
2. I wanted to investigate how the hierarchical factorisation of the latent variables affect learning in such a model. It turns out certain layers overfit, or I would diagnose the problem as overfitting, and I’m unsure how to remedy those problems.

## Samples from the Hierarchical VAE

Each of the following plots are samples of the conditional VAE that I’m using for the inpainting task. As expected with results from a VAE, they’re blurry. However, the fun thing about having a hierarchy of latent variables is I can freeze all the layers except for one, and vary that just to see the type of noise it models. The pictures are generated by using the $\mu_{z_l}(z_{l-1})$ for all layers except for the $i$-th layer.

$i=1$

## Hierarchical Variational Autoencoders

$$\newcommand{\expected}[2]{\mathbb{E}_{#1}\left[ #2 \right]} \newcommand{\prob}[3]{{#1}_{#2} \left( #3 \right)} \newcommand{\condprob}[4]{{#1}_{#2} \left( #3 \middle| #4 \right)} \newcommand{\Dkl}[2]{D_{\mathrm{KL}}\left( #1 \| #2 \right)} \newcommand{\muvec}{\boldsymbol \mu} \newcommand{\sigmavec}{\boldsymbol \sigma} \newcommand{\uttid}{s} \newcommand{\lspeakervec}{\vec{w}} \newcommand{\lframevec}{\vec{z}} \newcommand{\lframevect}{\lframevec_t} \newcommand{\inframevec}{\vec{x}} \newcommand{\inframevect}{\inframevec_t} \newcommand{\inframeset}{\inframevec_1,\hdots,\inframevec_T} \newcommand{\lframeset}{\lframevec_1,\hdots,\lframevec_T} \newcommand{\model}[2]{\condprob{#1}{#2}{\lspeakervec,\lframeset}{\inframeset}} \newcommand{\joint}{\prob{p}{}{\lspeakervec,\lframeset,\inframeset}} \newcommand{\normalparams}[2]{\mathcal{N}(#1,#2)} \newcommand{\normal}{\normalparams{\mathbf{0}}{\mathbf{I}}} \newcommand{\hidden}[1]{\vec{h}^{(#1)}} \newcommand{\pool}{\max} \newcommand{\hpooled}{\hidden{\pool}} \newcommand{\Weight}[1]{\mathbf{W}^{(#1)}} \newcommand{\Bias}[1]{\vec{b}^{(#1)}}$$
I’ve decided to approach the inpainting problem given for our class project IFT6266 using a hierarchical variational autoencoder.

While the basic VAE only has a single latent variable, this architecture assumes the image generation process comes from a hierarchy of latent variables, each dependent on its parents. So the factorisation looks like this:

$$p(x|z_1, z_2,\dots,z_L) = p(x|z_1)p(z_1|z_2) \dots p(z_{L-1}|z_L)$$

An architecture like this was used in the PixelVAE paper, but there they use a more complex PixelCNN structure at each layer, which I am attempting to do without. In their model, the recognition model or the encoder is not hierarchical — the $q_\phi$ network is structured in the following way:
$$q(z_1, z_2,\dots,z_L | x) = q(z_1|x)q(z_2|x) \dots q(z_L|x)$$

## Deciding When To Feedforward (or WTF gates)

Another paper of mine, titled “Towards Implicit Complexity Control using Variable-Depth DNNs for ASR Systems” got accepted to the International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2016 in Shanghai, which happened not too long ago.

The idea behind this one was the intuition that in a classification task, some instances should be simpler than others to classify. Similarly, the problem of deciding when to stop in an RNN setting is also an important one. If we take the bAbI task for example, and go an extra step and assume the number of logical steps to arrive at the answer is not provided for you, then you need to know when the network is ‘ready’ to give an answer.