## Constraining Hidden Layers for Interpretability (eventually, hopefully…)

I haven’t written much this past year, so I guess as a parting post for 2015, I’d talk a little bit about the poster I presented at ASRU 2015. The bulk of the stuff’s in the paper, plus I’m still kind of unsure about the legality about putting stuff that’s in the paper on this blog post, so I think I’ll talk about the other things that didn’t make it in.

## Learning to Transduce with Unbounded Memory – The Neural Stack

DeepMind has in the past week released a paper proposing yet another approach to having a memory structure within a neural network. This time, they implement a stack, queue and a deque “data structure” within their models. While this idea is not necessarily new, it incorporates some of the broad ideas seen in the Neural Turing Machines, where they try to have a model that is end-to-end differentiable, rather than have the data structure decoupled from the training process. I have to admit I haven’t read any of these previous papers before, but it’s definitely on my to read list.

In any case, this paper claims that using these memory structures beats having an 8-layered LSTM network trained for the same task. If this is true, this may mean we finally have some justification for these fancier models — simply throwing bigger networks at problems just isn’t as efficient.

I’ve spent some time trying to puzzle out what exactly they’re trying to do here with the neural stack. I suspect once I’ve figured this out, the queue and deque will be pretty similar, so I don’t think I will go through them in the same detail.

## Generating Singlish with LSTMs

So in in the last week, Andrej Karpathy wrote a post about the current state of RNNs, and proceeded to dump a whole bunch of different kinds of text data into them to see what they learn. Training language models and then sampling from them is lots of fun, and a character-level model is extra interesting because you see it come up with new words that actually kind of mean something sometimes. Even Geoffrey Hinton has some fun with it in this talk.

So after reading through Karpathy’s code, I got a few tips about how to do this properly, as I haven’t been able to train a proper language model before.

The data is obtained from one of Singapore’s most active sub-forums on HardwareZone.com.sg, Eat-Drink-Man-Woman. I like to think it’s a… localised 4chan. So know what to expect going in. If you want to just see what the model generates, go here.

## Long Short-Term Memory

There seems to be a resurgence in using these units in the past year. They were first proposed in 1997 by Hochreiter and Schmidhuber, but, along with most neural network literature seemed to have been forgotten for a while, until work on neural networks made a comeback, and focus started shifting toward RNNs again. Some of the more interesting recent work using LSTMs have come from Schmidhuber’s student Alex Graves. Notice the spike here in 2009 when Graves first wrote about cursive handwriting recognition (and generation) using LSTMs.

## Neural Turing Machines FAQ

There’s been some interest in Neural Turing Machines paper, and I’ve been getting some questions regarding my implementation via e-mail and the comments section on this blog. I plan to make this a blog post where I’ll regularly come back and update with answers to some of these questions as they come up, so do check back!

## Learning Gaussian Feature Extractors

While playing around with the MNIST dataset and the example code, I tried to visualise the weights of the connections from the weights to the hidden layer. These can be thought of as feature extractors of the input.

## Neural Turing Machines – Copy Task

After much fiddling around with the instability of the training procedure, I still haven’t found a recipe that would get it to converge consistently.

I did find though, that training it on shorter sequences first, before letting it see longer ones avoids huge gradients that would make the parameters explode into NaNs. And that is a huge help. Doing that still does not guarantee convergence though, and I only get a good model at random, like this one I’ve trained here copying a sequence of length 10:

## Neural Turing Machines – Implementation Hell

I’ve been struggling with the implementation of the NTM for the past week and a half now.

  <p>
There are various problems that I&#8217;ve been trying to deal with. The paper is relatively sparse when it comes to details of the architecture, and a lot more brief when it comes to the training process. Alex Graves trains RNNs a lot in his work, and it seems to me some of the tricks he has used here might have been distributed through his previous work.
</p>
</div>


## Neural Turing Machines – A First Look

Some time last week, a paper from Google DeepMind caught my attention.

  <p>
The paper is of particular interest to me because I&#8217;ve been thinking about how a recurrent neural network could learn to have access to an external form of memory. The approach taken here is interesting as it makes use of a balance between seeking using similarity of content, and shifting from that using location.
</p>

<p>
My focus this time would be on some of the details needed for implementation. Some of these specifics are glossed over in the paper, and I&#8217;ll try to infer whatever I can and, perhaps in the next post, have code (in Theano, what else?) to present.
</p>

<p>