Dropout using Theano

A month ago I tried my hand at the Higgs Boson Challenge on Kaggle. I tried using an approach neural networks that got me pretty far initially, but other techniques seemed to have won out.

Update

Seeing how much traffic this particular post gets, I’d like to update it after playing around with some of the functions available. I’m now convinced the following is a better way to do dropout given a batch:

hidden1 = T.nnet.relu(T.dot(X,W_input_to_hidden1) + b_hidden1)
if training:
	hidden1 = T.switch(srng.binomial(size=hidden1.shape,p=0.5),hidden1,0)
else:
	hidden1 = 0.5 * hidden1

Using T.switch seems to be much faster than multiplying, and T.nnet.relu now exists to get a reLU activation function. Enjoy!

Original Post

The model is a simple neural network with one hidden layer. I started with using a sigmoid activation in the hidden layer, and then eventually switched to using rectified linear units. One simple way to implement this in Theano is first doing the linear step, and then setting all negative values to 0.

hidden1 = T.dot(X,W_input_to_hidden1) + b_hidden1 # linear step
hidden1 = hidden1 * (hidden1 > 0) # has effect of setting negative values to 0.

Doing all of that got me pretty far, but I decided to see if I could implement the dropout technique in a simple way. Turns out it isn’t too hard. Theano provides RandomStreams which can be used to sample from various distributions. The approach I use is to sample a vector of the same size as the hidden layer from a binomial distribution with a probability of 0.5. What happens when you then multiply this with the reLUs is that it randomly sets half of the outputs to 0, sort of like a random mask, which is what we need for dropout. When the gradient is calculated, since the values are not fed forward to the output layer, there is no error for the unit to be back propagated, and everything works as it should.

Putting it all together, the function that constructs the network looks like this:

def build_network(input_size,hidden_size):
	srng = RandomStreams(seed=12345)
 
	X = T.fmatrix('X')
	W_input_to_hidden1  = U.create_shared(U.initial_weights(input_size,hidden_size))
	b_hidden1 = U.create_shared(U.initial_weights(hidden_size))
	W_hidden1_to_output = U.create_shared(U.initial_weights(hidden_size))
	b_output = U.create_shared(U.initial_weights(1)[0])
 
	def network(training):
		hidden1 = T.dot(X,W_input_to_hidden1) + b_hidden1
		hidden1 = hidden1 * (hidden1 > 0)
		if training:
			hidden1 = hidden1 * srng.binomial(size=(hidden_size,),p=0.5)
		else:
			hidden1 = 0.5 * hidden1
 
		output = T.nnet.sigmoid(T.dot(hidden1,W_hidden1_to_output) + b_output)
		return output
 
		parameters = [
			W_input_to_hidden1,
			b_hidden1,
			W_hidden1_to_output,
			b_output
		]
 
		return X,network(True),network(False),parameters

Theano doesn’t allow you to replace something in the middle of your compute tree after it’s been constructed. This is the reason for having the network() function. Setting training = True constructs the network with the random mask, setting it to False builds a network where all the hidden layer outputs are halved.

The full code is available here. If anyone manages to push what the neural can do even further, do let me know!

Also read...

Comments

  1. Thanks for the code! Why are you multiplying by 0.5 the hidden units when you are not training??

    Reply
  2. Couple of clarifications:

    1. If you want to compute dropout on GPU, instead of on CPU, use:
    from theano.sandbox.rng_mrg import MRG_RandomStreams as RandomStreams
    instead of
    shared_randomstreams.RandomStreams

    2. On GPU, for some reason, your original implementation (layer*mask) runs about 4 times faster than T.switch implementation on my Maxwell Titan X and CUDA 7.5.

    Reply

Leave a Reply

Your email address will not be published. Required fields are marked *