## Dropout using Theano

A month ago I tried my hand at the Higgs Boson Challenge on Kaggle. I tried using an approach neural networks that got me pretty far initially, but other techniques seemed to have won out.

### Update

Seeing how much traffic this particular post gets, I’d like to update it after playing around with some of the functions available. I’m now convinced the following is a better way to do dropout given a batch:

hidden1 = T.nnet.relu(T.dot(X,W_input_to_hidden1) + b_hidden1) if training: hidden1 = T.switch(srng.binomial(size=hidden1.shape,p=0.5),hidden1,0) else: hidden1 = 0.5 * hidden1 |

Using `T.switch` seems to be much faster than multiplying, and `T.nnet.relu` now exists to get a reLU activation function. Enjoy!

### Original Post

The model is a simple neural network with one hidden layer. I started with using a sigmoid activation in the hidden layer, and then eventually switched to using rectified linear units. One simple way to implement this in Theano is first doing the linear step, and then setting all negative values to 0.

hidden1 = T.dot(X,W_input_to_hidden1) + b_hidden1 # linear step hidden1 = hidden1 * (hidden1 > 0) # has effect of setting negative values to 0. |

Doing all of that got me pretty far, but I decided to see if I could implement the dropout technique in a simple way. Turns out it isn’t too hard. Theano provides `RandomStreams`

which can be used to sample from various distributions. The approach I use is to sample a vector of the same size as the hidden layer from a binomial distribution with a probability of 0.5. What happens when you then multiply this with the reLUs is that it randomly sets half of the outputs to 0, sort of like a random mask, which is what we need for dropout. When the gradient is calculated, since the values are not fed forward to the output layer, there is no error for the unit to be back propagated, and everything works as it should.

Putting it all together, the function that constructs the network looks like this:

def build_network(input_size,hidden_size): srng = RandomStreams(seed=12345) X = T.fmatrix('X') W_input_to_hidden1 = U.create_shared(U.initial_weights(input_size,hidden_size)) b_hidden1 = U.create_shared(U.initial_weights(hidden_size)) W_hidden1_to_output = U.create_shared(U.initial_weights(hidden_size)) b_output = U.create_shared(U.initial_weights(1)[0]) def network(training): hidden1 = T.dot(X,W_input_to_hidden1) + b_hidden1 hidden1 = hidden1 * (hidden1 > 0) if training: hidden1 = hidden1 * srng.binomial(size=(hidden_size,),p=0.5) else: hidden1 = 0.5 * hidden1 output = T.nnet.sigmoid(T.dot(hidden1,W_hidden1_to_output) + b_output) return output parameters = [ W_input_to_hidden1, b_hidden1, W_hidden1_to_output, b_output ] return X,network(True),network(False),parameters |

Theano doesn’t allow you to replace something in the middle of your compute tree after it’s been constructed. This is the reason for having the `network()`

function. Setting `training = True`

constructs the network with the random mask, setting it to `False`

builds a network where all the hidden layer outputs are halved.

The full code is available here. If anyone manages to push what the neural can do even further, do let me know!

Thanks for the code! Why are you multiplying by 0.5 the hidden units when you are not training??

That’s a standard thing to do when applying dropout.

Check out this lecture for more information: https://class.coursera.org/neuralnets-2012-001/lecture/119

Couple of clarifications:

1. If you want to compute dropout on GPU, instead of on CPU, use:

from theano.sandbox.rng_mrg import MRG_RandomStreams as RandomStreams

instead of

shared_randomstreams.RandomStreams

2. On GPU, for some reason, your original implementation (layer*mask) runs about 4 times faster than T.switch implementation on my Maxwell Titan X and CUDA 7.5.

Thanks Michael!