Category: classification
Constraining Hidden Layers for Interpretability (eventually, hopefully…)
I haven’t written much this past year, so I guess as a parting post for 2015, I’d talk a little bit about the poster I presented at ASRU 2015. The bulk of the stuff’s in the paper, plus I’m still kind of unsure about the legality about putting stuff that’s in the paper on this blog post, so I think I’ll talk about the other things that didn’t make it in.
NLP with Neural Networks
Category: complex-structure
Deciding <u>W</u>hen <u>T</u>o <u>F</u>eedforward (or WTF gates)
Another paper of mine, titled “Towards Implicit Complexity Control using Variable-Depth DNNs for ASR Systems” got accepted to the International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2016 in Shanghai, which happened not too long ago.
The idea behind this one was the intuition that in a classification task, some instances should be simpler than others to classify. Similarly, the problem of deciding when to stop in an RNN setting is also an important one. If we take the bAbI task for example, and go an extra step and assume the number of logical steps to arrive at the answer is not provided for you, then you need to know when the network is ‘ready’ to give an answer.
Neural Turing Machines FAQ
There’s been some interest in Neural Turing Machines paper, and I’ve been getting some questions regarding my implementation via e-mail and the comments section on this blog. I plan to make this a blog post where I’ll regularly come back and update with answers to some of these questions as they come up, so do check back!
Learning Gaussian Feature Extractors
While playing around with the MNIST dataset and the example code, I tried to visualise the weights of the connections from the weights to the hidden layer. These can be thought of as feature extractors of the input.
Neural Turing Machines – A First Look
Some time last week, a paper from Google DeepMind caught my attention.
<p>
The paper is of particular interest to me because I’ve been thinking about how a recurrent neural network could learn to have access to an external form of memory. The approach taken here is interesting as it makes use of a balance between seeking using similarity of content, and shifting from that using location.
</p>
<p>
My focus this time would be on some of the details needed for implementation. Some of these specifics are glossed over in the paper, and I’ll try to infer whatever I can and, perhaps in the next post, have code (in Theano, what else?) to present.
</p>
<p>
Recursive Auto-encoders: An Introduction
I’ve talked a little bit about recursive auto-encoders a couple of posts ago. In the deep learning lingo, an auto-encoder network usually refers to an architecture that takes in an input vector, and through a series of transformations, is trained to reproduce that input in its prediction layer. The reason for doing this is to extract features that describe the input. One might think of it as a form of compression: If the network is asked to be able to reproduce an input with after passing it through hidden layers with a lot less neurons than the input layer, then some sort of compression has to happen in order for it to be able to create a good reconstruction. So let’s consider the above network. 8 inputs, 8 outputs, and 3 in the hidden layer. If we feed the network a one-hot encoding of 1 to 8 (setting only the neuron corresponding to the input to 1), and insist that that input be reconstructed at the output layer, guess what happens?
Category: computer-science
Computing Log Normal for Isotropic Gaussians
Neural Turing Machines FAQ
There’s been some interest in Neural Turing Machines paper, and I’ve been getting some questions regarding my implementation via e-mail and the comments section on this blog. I plan to make this a blog post where I’ll regularly come back and update with answers to some of these questions as they come up, so do check back!
Category: cost-function
Constraining Hidden Layers for Interpretability (eventually, hopefully…)
I haven’t written much this past year, so I guess as a parting post for 2015, I’d talk a little bit about the poster I presented at ASRU 2015. The bulk of the stuff’s in the paper, plus I’m still kind of unsure about the legality about putting stuff that’s in the paper on this blog post, so I think I’ll talk about the other things that didn’t make it in.
Learning Gaussian Feature Extractors
While playing around with the MNIST dataset and the example code, I tried to visualise the weights of the connections from the weights to the hidden layer. These can be thought of as feature extractors of the input.
Connectionist Temporal Classification (CTC) with Theano
This will be the first time I’m trying to present code I’ve written in an ipython notebook. The style’s different, but I think I’ll permanently switch to this method of presentation for code-intensive posts from now on. A nifty little tool that makes doing this so convenient is ipy2wp. It uses WordPress’ xml-rpc to post the HTML directly to the platform.
In any case, I’ve started working with the NUS School of Computing speech recognition group, and they’ve been using deep neural networks for classification of audio frames to phonemes. This requires a preprocessing step that aligns the audio frames to phonemes in order to reduce this to a simple classification problem.
CTC describes a way to compute the probability of a sequence of phonemes for a sequence of audio frames, accounting for all possible alignments. We can then define an objective function to maximise the probability of the phoneme sequence given the audio frame sequence from training data.
Category: fancy-penalty-terms
Constraining Hidden Layers for Interpretability (eventually, hopefully…)
I haven’t written much this past year, so I guess as a parting post for 2015, I’d talk a little bit about the poster I presented at ASRU 2015. The bulk of the stuff’s in the paper, plus I’m still kind of unsure about the legality about putting stuff that’s in the paper on this blog post, so I think I’ll talk about the other things that didn’t make it in.
Connectionist Temporal Classification (CTC) with Theano
This will be the first time I’m trying to present code I’ve written in an ipython notebook. The style’s different, but I think I’ll permanently switch to this method of presentation for code-intensive posts from now on. A nifty little tool that makes doing this so convenient is ipy2wp. It uses WordPress’ xml-rpc to post the HTML directly to the platform.
In any case, I’ve started working with the NUS School of Computing speech recognition group, and they’ve been using deep neural networks for classification of audio frames to phonemes. This requires a preprocessing step that aligns the audio frames to phonemes in order to reduce this to a simple classification problem.
CTC describes a way to compute the probability of a sequence of phonemes for a sequence of audio frames, accounting for all possible alignments. We can then define an objective function to maximise the probability of the phoneme sequence given the audio frame sequence from training data.
Category: hack
Computing Log Normal for Isotropic Gaussians
Category: ift6266
Details of the Hierarchical VAE (Part 1)
The motivation to use a hierarchical architecture for this task was two-fold:
- Learning a vanilla encoder-decoder type of architecture for the task would be the basic deep learning go-to model for such a task. However, the noise modeled if we perform maximum likelihood is only at the pixel level. This seems inappropriate as it implies there is one “right answer”, with some pixel colour variations, given an outer context. The hierarchical VAE models different uncertainties at different levels of abstraction, so it seems like a good fit.
- I wanted to investigate how the hierarchical factorisation of the latent variables affect learning in such a model. It turns out certain layers overfit, or I would diagnose the problem as overfitting, and I’m unsure how to remedy those problems.
Samples from the Hierarchical VAE
Each of the following plots are samples of the conditional VAE that I’m using for the inpainting task. As expected with results from a VAE, they’re blurry. However, the fun thing about having a hierarchy of latent variables is I can freeze all the layers except for one, and vary that just to see the type of noise it models. The pictures are generated by using the $\mu_{z_l}(z_{l-1})$ for all layers except for the $i$-th layer.
$i=1$
Hierarchical Variational Autoencoders
$$\newcommand{\expected}2{\mathbb{E}_{#1}\left[ #2 \right]}
\newcommand{\prob}[3]{{#1}_{#2} \left( #3 \right)}
\newcommand{\condprob}[4]{{#1}_{#2} \left( #3 \middle| #4 \right)}
\newcommand{\Dkl}2{D_{\mathrm{KL}}\left( #1 | #2 \right)}
\newcommand{\muvec}{\boldsymbol \mu}
\newcommand{\sigmavec}{\boldsymbol \sigma}
\newcommand{\uttid}{s}
\newcommand{\lspeakervec}{\vec{w}}
\newcommand{\lframevec}{\vec{z}}
\newcommand{\lframevect}{\lframevec_t}
\newcommand{\inframevec}{\vec{x}}
\newcommand{\inframevect}{\inframevec_t}
\newcommand{\inframeset}{\inframevec_1,\hdots,\inframevec_T}
\newcommand{\lframeset}{\lframevec_1,\hdots,\lframevec_T}
\newcommand{\model}2{\condprob{#1}{#2}{\lspeakervec,\lframeset}{\inframeset}}
\newcommand{\joint}{\prob{p}{}{\lspeakervec,\lframeset,\inframeset}}
\newcommand{\normalparams}2{\mathcal{N}(#1,#2)}
\newcommand{\normal}{\normalparams{\mathbf{0}}{\mathbf{I}}}
\newcommand{\hidden}1{\vec{h}^{(#1)}}
\newcommand{\pool}{\max}
\newcommand{\hpooled}{\hidden{\pool}}
\newcommand{\Weight}1{\mathbf{W}^{(#1)}}
\newcommand{\Bias}1{\vec{b}^{(#1)}}$$
I’ve decided to approach the inpainting problem given for our class project IFT6266 using a hierarchical variational autoencoder.
While the basic VAE only has a single latent variable, this architecture assumes the image generation process comes from a hierarchy of latent variables, each dependent on its parents. So the factorisation looks like this:
$$p(x|z_1, z_2,\dots,z_L) = p(x|z_1)p(z_1|z_2) \dots p(z_{L-1}|z_L)$$
An architecture like this was used in the PixelVAE paper, but there they use a more complex PixelCNN structure at each layer, which I am attempting to do without. In their model, the `recognition model` or the encoder is not hierarchical — the $q_\phi$ network is structured in the following way:
$$q(z_1, z_2,\dots,z_L | x) = q(z_1|x)q(z_2|x) \dots q(z_L|x)$$
Category: natural-language-processing
NLP with Neural Networks
Category: neural-turing-machines
Neural Turing Machines FAQ
There’s been some interest in Neural Turing Machines paper, and I’ve been getting some questions regarding my implementation via e-mail and the comments section on this blog. I plan to make this a blog post where I’ll regularly come back and update with answers to some of these questions as they come up, so do check back!
Neural Turing Machines – Copy Task
After much fiddling around with the instability of the training procedure, I still haven’t found a recipe that would get it to converge consistently.
I did find though, that training it on shorter sequences first, before letting it see longer ones avoids huge gradients that would make the parameters explode into NaNs. And that is a huge help. Doing that still does not guarantee convergence though, and I only get a good model at random, like this one I’ve trained here copying a sequence of length 10:
Neural Turing Machines – Implementation Hell
I’ve been struggling with the implementation of the NTM for the past week and a half now.
<p>
There are various problems that I’ve been trying to deal with. The paper is relatively sparse when it comes to details of the architecture, and a lot more brief when it comes to the training process. Alex Graves trains RNNs a lot in his work, and it seems to me some of the tricks he has used here might have been distributed through his previous work.
</p>
</div>
Neural Turing Machines – A First Look
Some time last week, a paper from Google DeepMind caught my attention.
<p>
The paper is of particular interest to me because I’ve been thinking about how a recurrent neural network could learn to have access to an external form of memory. The approach taken here is interesting as it makes use of a balance between seeking using similarity of content, and shifting from that using location.
</p>
<p>
My focus this time would be on some of the details needed for implementation. Some of these specifics are glossed over in the paper, and I’ll try to infer whatever I can and, perhaps in the next post, have code (in Theano, what else?) to present.
</p>
<p>
Category: recurrent-neural-networks
Deciding <u>W</u>hen <u>T</u>o <u>F</u>eedforward (or WTF gates)
Another paper of mine, titled “Towards Implicit Complexity Control using Variable-Depth DNNs for ASR Systems” got accepted to the International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2016 in Shanghai, which happened not too long ago.
The idea behind this one was the intuition that in a classification task, some instances should be simpler than others to classify. Similarly, the problem of deciding when to stop in an RNN setting is also an important one. If we take the bAbI task for example, and go an extra step and assume the number of logical steps to arrive at the answer is not provided for you, then you need to know when the network is ‘ready’ to give an answer.
Learning to Transduce with Unbounded Memory – The Neural Stack
DeepMind has in the past week released a paper proposing yet another approach to having a memory structure within a neural network. This time, they implement a stack, queue and a deque “data structure” within their models. While this idea is not necessarily new, it incorporates some of the broad ideas seen in the Neural Turing Machines, where they try to have a model that is end-to-end differentiable, rather than have the data structure decoupled from the training process. I have to admit I haven’t read any of these previous papers before, but it’s definitely on my to read list.
In any case, this paper claims that using these memory structures beats having an 8-layered LSTM network trained for the same task. If this is true, this may mean we finally have some justification for these fancier models — simply throwing bigger networks at problems just isn’t as efficient.
I’ve spent some time trying to puzzle out what exactly they’re trying to do here with the neural stack. I suspect once I’ve figured this out, the queue and deque will be pretty similar, so I don’t think I will go through them in the same detail.
Generating Singlish with LSTMs
So in in the last week, Andrej Karpathy wrote a post about the current state of RNNs, and proceeded to dump a whole bunch of different kinds of text data into them to see what they learn. Training language models and then sampling from them is lots of fun, and a character-level model is extra interesting because you see it come up with new words that actually kind of mean something sometimes. Even Geoffrey Hinton has some fun with it in this talk.
So after reading through Karpathy’s code, I got a few tips about how to do this properly, as I haven’t been able to train a proper language model before.
The data is obtained from one of Singapore’s most active sub-forums on HardwareZone.com.sg, Eat-Drink-Man-Woman. I like to think it’s a… localised 4chan. So know what to expect going in. If you want to just see what the model generates, go here.
Long Short-Term Memory
There seems to be a resurgence in using these units in the past year. They were first proposed in 1997 by Hochreiter and Schmidhuber, but, along with most neural network literature seemed to have been forgotten for a while, until work on neural networks made a comeback, and focus started shifting toward RNNs again. Some of the more interesting recent work using LSTMs have come from Schmidhuber’s student Alex Graves. Notice the spike here in 2009 when Graves first wrote about cursive handwriting recognition (and generation) using LSTMs.
Neural Turing Machines – Copy Task
After much fiddling around with the instability of the training procedure, I still haven’t found a recipe that would get it to converge consistently.
I did find though, that training it on shorter sequences first, before letting it see longer ones avoids huge gradients that would make the parameters explode into NaNs. And that is a huge help. Doing that still does not guarantee convergence though, and I only get a good model at random, like this one I’ve trained here copying a sequence of length 10:
Neural Turing Machines – Implementation Hell
I’ve been struggling with the implementation of the NTM for the past week and a half now.
<p>
There are various problems that I’ve been trying to deal with. The paper is relatively sparse when it comes to details of the architecture, and a lot more brief when it comes to the training process. Alex Graves trains RNNs a lot in his work, and it seems to me some of the tricks he has used here might have been distributed through his previous work.
</p>
</div>
Neural Turing Machines – A First Look
Some time last week, a paper from Google DeepMind caught my attention.
<p>
The paper is of particular interest to me because I’ve been thinking about how a recurrent neural network could learn to have access to an external form of memory. The approach taken here is interesting as it makes use of a balance between seeking using similarity of content, and shifting from that using location.
</p>
<p>
My focus this time would be on some of the details needed for implementation. Some of these specifics are glossed over in the paper, and I’ll try to infer whatever I can and, perhaps in the next post, have code (in Theano, what else?) to present.
</p>
<p>
NLP with Neural Networks
Recursive Auto-encoders: Momentum
In the previous post, we wrote the code for RAE using the Theano library, but it wasn’t successful in performing the simple task of reversing a randomised sequence of 1 to 8. One of the tricks we can use for dealing with time sequence data is to use a small learning rate, along with momentum. I’ll be discussing what momentum is, and showing a simple way momentum can be implemented in Theano.
Recursive Auto-encoders: An Introduction
I’ve talked a little bit about recursive auto-encoders a couple of posts ago. In the deep learning lingo, an auto-encoder network usually refers to an architecture that takes in an input vector, and through a series of transformations, is trained to reproduce that input in its prediction layer. The reason for doing this is to extract features that describe the input. One might think of it as a form of compression: If the network is asked to be able to reproduce an input with after passing it through hidden layers with a lot less neurons than the input layer, then some sort of compression has to happen in order for it to be able to create a good reconstruction. So let’s consider the above network. 8 inputs, 8 outputs, and 3 in the hidden layer. If we feed the network a one-hot encoding of 1 to 8 (setting only the neuron corresponding to the input to 1), and insist that that input be reconstructed at the output layer, guess what happens?
Remembering sequences (poorly) with RNNs
I have a project going recently that aims to train a recurrent network to memorise and repeat sequences of characters. It’s here, and it hasn’t been going really well, but I thought I’d share a little bit of why I wanted to do this and why I thought it might work.
Category: reinforcement-learning
Learning about reinforcement learning, with Tetris
For our final assignment for the NUS Introduction to Artificial Intelligence class (CS3243), we were asked to design a Tetris playing agent. The goal of the assignment was to get students to be familiar with the idea of heuristics and how they work, getting them to manually tune features to get a reasonably intelligent agent. However, the professor included this in the assignment folder, which made me think we had to implement the Least-squares Policy Iteration algorithm for the task.
I’ll probably discuss LSPI in more detail in another post, but for now, here are the useful features we found for anyone trying to do the same thing.
Category: uncategorized
Can Chinese Rooms Think?
There’s a tendency as a machine learning or CS researcher to get into a philosophical debate about whether machines will ever be able to think like humans. This argument goes so far back that the people that started the field have had to grapple with it. It’s also fun to think about, especially with sci-fi always portraying AI vs human world-ending/apocalypse showdowns, and humans always prevailing because of love or friendship or humanity.
But there’s a tendency for people in such a debate to wind up talking past each other.
Dropout using Theano
A month ago I tried my hand at the Higgs Boson Challenge on Kaggle. I tried using an approach neural networks that got me pretty far initially, but other techniques seemed to have won out.
“It’s like Hinton diagrams, but for the terminal.”
Finding Maximum Dot (or Inner) Product
A problem that often arises in machine learning tasks is trying to find a row in a matrix that gives the highest dot product given a query vector. Some examples of such situations:
- You’ve performed some kind of matrix factorisation for collaborative filtering for say, a movie recommendation system, and now, given a new user, you want to be able to specify a couple of movies that your system would predict he would rate highly.
- A neural network where the final softmax predictive layer is huge (but you managed to train it, somehow).
In both these cases, the problem boils down to trying to search a collection of vectors to find the one that gives the highest (or the $k$ highest) dot product(s).
A simple way to do this would be to perform a matrix multiplication, and then to find the best scoring vector by scanning through the values. This is effectively performing $N$ dot product computations for a matrix with $N$ rows. Can we do better?
March Madness with Theano
I’m not particularly familiar with NCAA Men’s Division I Basketball Championship, but I’ve seen the March Machine Learning Madness challenge come up for a few years now, and I’ve decided to try my hand at it today.
I also haven’t tried a machine learning task like this one. At it’s simplest (assuming you don’t harvest more data about each team and their players), all you have is a set of game data: who won, who lost, and their respective scores. Intuitively, we should be able to look at tables like these and get a rough sense of who the better teams are. But how do we model it as a machine learning problem?
Naive Bayes Categorisation (with some help from Elasticsearch)
Back in November, I gave a talk during one of the Friday Hackers and Painters sessions at Plug-in@Block 71, aptly titled “How I do categorisation and some naive bayes sh*t” by Calvin Cheng. I promised I’d write a follow-up blog post with the materials I presented during the talk, so here it is.
My Quora Codesprint Submission
(this is x-posted on Quora)
I’ve had some experience in the past with machine learning, but I feel like I still don’t have a proper methodology. I’d like to hear what you guys think about what I’ve done here.
Category: updating-rule
Implementing AdaDelta
The end of this post (I don’t know where the article is now. Can’t find it.) had a diagram showing the improvements of AdaDelta over standard SGD and AdaGrad, so I decided to look up what AdaGrad actually does. The details are written in the paper, including it’s “derivation”. It’s basically an improvement over AdaGrad, using rolling averages and also multiplying by the RMS of the rolling average of changes to the weight.
Recursive Auto-encoders: Momentum
In the previous post, we wrote the code for RAE using the Theano library, but it wasn’t successful in performing the simple task of reversing a randomised sequence of 1 to 8. One of the tricks we can use for dealing with time sequence data is to use a small learning rate, along with momentum. I’ll be discussing what momentum is, and showing a simple way momentum can be implemented in Theano.