## Generating Singlish with LSTMs

So in in the last week, Andrej Karpathy wrote a post about the current state of RNNs, and proceeded to dump a whole bunch of different kinds of text data into them to see what they learn. Training language models and then sampling from them is lots of fun, and a character-level model is extra interesting because you see it come up with new words that actually kind of mean something sometimes. Even Geoffrey Hinton has some fun with it in this talk.

So after reading through Karpathy’s code, I got a few tips about how to do this properly, as I haven’t been able to train a proper language model before.

The data is obtained from one of Singapore’s most active sub-forums on HardwareZone.com.sg, Eat-Drink-Man-Woman. I like to think it’s a… localised 4chan. So know what to expect going in. If you want to just see what the model generates, go here.

## Getting the data

I trained an initial model using some data that was first crawled by a friend some years back. It wasn’t really big enough, and so Jingwen took it upon to himself to write a quick scraper for EDMW as an exercise. I’m not sure about the legalities of actually posting the scraped data, so I’m just directing you to the thing that gets said data.

Comparing the two datasets, it seems EDMW 4 years back was a little more… civilised. I’ll just leave it at that. The new data makes for more hilarious generated sentences anyway, so I’m calling it a win.

## Speeding up mini-batch training by leveraging GPU capabilities

The power of the GPU in applications like ours is in the parallelism it is capable of when doing matrix multiplications. This, coupled with algorithms implemented in matrix manipulation libraries, can give a huge gain once your code is vectorised.

I previously did mini-batch training the naive way when training RNNs. This involved calculating gradients for each instance and accumulating them, and then averaging them before applying them to the model. The problem with this approach is that it’s sequential, and leaves the GPU capabilities unexploited. What I then implemented was to group sequences with similar lengths together, and then padding them to the same length using the END token. The RNN is then run over all of the sequences in the batch at once. If you’re dealing with a standard RNN with 100 hidden units and have 10 sequences per minibatch, then at every time step, you’re dealing with a 10 x 100 matrix and calculating the 10 hidden states at the next time step simultaneously.

This has required some tensor acrobatics in my LSTM implementation. Some of the tips came from the Theano usergroup.

The key points to note are:

1. Whenever something can be achieved via a matrix multiplication or indexing using a matrix of indices, do it that way.
2. Shuffle your dimensions in the tensors around so that you can achieve the above. Reshape them if you have to.
3. Keep track of which axis represents what. It’s easy to go insane once you have more than 2 dimensions.

## Implementing the neural network in Javascript

So in trying to put up the text generator for all the world to see, I decided I wasn’t going to run it as an API that people could use to abuse my tiny little instance with. So I ported the network to Javascript, and let it run on the viewer’s browser. Implementing the forward propagation was actually a pretty fun exercise, and I exploited some of Javascript’s more functional aspects to get it to somewhat behave the way numpy would.

In doing this, a lot of new arrays are created at every step of the computation due to each pairwise operation creating a new array. Some of these operations can actually be optimised as one specialised operation over 4 different elements. An example of this is the usual cell computation formula, which, if written as a series of vector multiplication and additions, creates 2 redundant arrays. Over the course of a sequence and many sequences, this incurs quite a memory footprint. I got around this by creating an applier function that takes in a function to be computed over any number of arrays, and returns a function that takes in those number of arrays:

NeuNet.applier = function(fun) {
var arg_count = fun.length;
var buffer = new Array(arg_count);
return function() {
var result = new Array(arguments[0].shape[0]);
for ( var i=0; i < result.length; i++ ) {
for ( var j=0; j < arguments.length; j++ ) {
buffer[j] = arguments[j].data[i];
}
result[i] = fun.apply(null,buffer);
}
return new NeuNet(result);
}
};


This then allows me to build a computeCell function like so:

var computeCell = NeuNet.applier(
return forget_gate * prev_cell + in_gate * cell_updates;
}
);


In this scenario only one new array is created, and one smaller buffer array is maintained over the life of the function.

The next steps to take this would be to look at using ArrayBuffers for the vectors and matrices to use those Javascript typed array views to implement splicing. That would be another reduction in memory cost.

Something more extreme would be to eventually have something like what Theano does in Javascript. Build the expression tree without computing anything first, and then we can make optimisations like the above by traversing the tree, and then subsequently calculate gradients over the expression.

The code for all of this is hosted here. I’ll be making some changes to the Javascript over the next few days, but I think it’s all good to go. Enjoy!