## Deciding When To Feedforward (or WTF gates)

Another paper of mine, titled “Towards Implicit Complexity Control using Variable-Depth DNNs for ASR Systems” got accepted to the International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2016 in Shanghai, which happened not too long ago.

The idea behind this one was the intuition that in a classification task, some instances should be simpler than others to classify. Similarly, the problem of deciding when to stop in an RNN setting is also an important one. If we take the bAbI task for example, and go an extra step and assume the number of logical steps to arrive at the answer is not provided for you, then you need to know when the network is ‘ready’ to give an answer.

Being in a lab that mostly does work related to DNN acoustic modelling for speech recognition, I figured I could pitch it as a way for reducing computation time during runtime. I figured, sil frames must be pretty straightforward to classify, and a huge proportion of the frames in speech recognition are silence.

If we consider each layer in the network as a representation of the original input that is being ‘untangled’ to perform the final task of discrimination at the final layer, then it might be possible to perform the classification using any of the representations. Further, some of the lower representations might even be useful/better for certain classes, if we could only find a way to dynamically decide when to use them, and when to continue the feedforward process.

I was thinking about how some kind of gating system could work for this that would decide at every layer: ‘Do I output, or do I feedforward?’. Eventually Professor Sim and I arrived at a kind of cascading mechanism that satisfied 2 important properties:

• You had to be able to evaluate the value of the gate as a probability at the current layer without seeing the subsequent ones, or you’d defeat the purpose
• The gates should be a distribution that sums to 1

This gave me a way to frame it as a kind of marginalisation over a conditional probablity (probability of phoneme given stop signal ($s_l$) and frame):
$$P(y|x) = \sum_{l=1}^L P(y|s_l,x)P(s_l|x),$$
where $L$ is the number of layers and $P(s_l|x)$ is given by,
$$P(s_l|x) = g_l \prod_{l’=1}^{l-1} (1-g_{l’})$$

In order to make this a distribution, $g_L$ is always 1. Graphically,

(the variable names are different here, this is a diagram I drew early on in the paper writing process)

We called this model VDNN, instead of the very cool WTF gates I’d have used if I only thought of it sooner (though I’m pretty sure it’d never fly).

Indeed, there were differences when you look at the average layers used and the class of the frame that is being classified:

But silence was not where I expected it to be!

Incidentally, Alex Graves at DeepMind had a very similar idea (that he actually got to work much better, and on RNNs too) which he named Adaptive Computation Time (ACT). The method we used is nothing but a footnote of failure in his paper:

Page 5:

However experiments show that networks trained to minimise expected rather than total halting time learn to ‘cheat’…

Oh well.