I’m not particularly familiar with NCAA Men’s Division I Basketball Championship, but I’ve seen the  March Machine Learning Madness challenge come up for a few years now, and I’ve decided to try my hand at it today.

I also haven’t tried a machine learning task like this one. At it’s simplest (assuming you don’t harvest more data about each team and their players), all you have is a set of game data: who won, who lost, and their respective scores. Intuitively, we should be able to look at tables like these and get a rough sense of who the better teams are. But how do we model it as a machine learning problem?

A quick search brought up Danny Tarlow’s blog and a post that describes a simple model that looked really easy to implement.

## The Model

This section is just a complete rehash of what Danny already covers in his post.

Each team is described by two vectors: one that represents their offense and another that represents their defense.

So a team $i$ would be represented as such:
$$o_i = \left[\begin{matrix} 5\\ 10\\ 4 \end{matrix}\right], d_i = \left[\begin{matrix} 2\\ 3\\ 10 \end{matrix}\right]$$
and team $j$:
$$o_j = \left[\begin{matrix} 3\\ 2\\ 4 \end{matrix}\right], d_j = \left[\begin{matrix} 2\\ 5\\ 5 \end{matrix}\right]$$
and the dot products between the team’s offense vector and the it’s opponent’s defense vector would be the team’s predicted score,
\begin{align*} \hat{S}(i,j)_i &= \left[\begin{matrix} 5\\ 10\\ 4 \end{matrix}\right] \cdot \left[\begin{matrix} 2\\ 5\\ 5 \end{matrix}\right] = 80 \\ \hat{S}(i,j)_j &= \left[\begin{matrix} 3\\ 2\\ 4 \end{matrix}\right] \cdot \left[\begin{matrix} 2\\ 3\\ 10 \end{matrix}\right] = 52 \end{align*}
So in this case, our model predicts Team $i$ wins!

These vector pairs for each team will be learnt from the score data for past games. As always, it’s an optimisation problem and we need a cost function. Danny suggests the squared error function, I however, implemented the mean squared error:
$$\frac{1}{|\text{Games}|}\sum_{(i,j) \in \text{Games}} (\hat{S}(i,j)_i – S(i,j)_i)^2 + (\hat{S}(i,j)_j – S(i,j)_j)^2$$
The difference between the squared errors and the mean squared errors after differentiation is a constant factor. This is usually absorbed into the gradient descent step-size variable $alpha$. Using mean squared error simply means I take smaller steps at each update.

Okay, let’s get our hands dirty!

Let’s get the boring stuff over and done with.

Kaggle provides the data in CSV format, and my weapon of choice for dealing with CSV files is pandas. So,

import pandas as pd from itertools import izip df = pd.read_csv( 'regular_season_results.csv', usecols = ['wteam','wscore','lteam','lscore'] # import only what we need ) team_ids = df['wteam'].append(df['lteam']).unique() # have an ID for each team that starts from 0 # this will help us later with indexing mapping = dict(izip(team_ids,range(team_ids.shape[0]))) df['wteam'] = [ mapping[i] for i in df['wteam'] ] # use those new IDs df['lteam'] = [ mapping[i] for i in df['lteam'] ]

Now, df holds a DataFrame containing just the data we need.

## Theano

If you don’t already know, Theano is a library developed at the University of Montreal and used heavily in their neural networks research. It optimises and compiles symbolic trees of mathematical expressions to CPU or GPU code. I’m not aware how well it does this, but I like being able to express these mathematical equations (like those above) in nearly the same exact form, and not have to worry about writing code optimised for computation.

First, let’s set up the variables. We need two shared variables to hold our data and our parameters that need to be tuned. df.values creates a numpy array from the DataFrame. I chose to have a size 10 vector to represent defense and offense, and these vectors are organised in a tensor in the following way: id $rightarrow$ vector-pair $rightarrow$ 0 for offense, 1 for defense. This calls for a three-dimensional tensor to hold all our parameters.

data = theano.shared(np.asarray(df.values,dtype=np.int16)) W = theano.shared(np.random.random([len(mapping),2,10]))

The parameters are initialised as positive random numbers from 0 to 1. I can’t think of a reason not to try sampling random numbers from a normal distribution centred around 0, but I figured since the dot products have to be positive (they’re predicting scores), then the elements should preferably be positive.

Next, we initialise the theano variables. A matrix for the score table, and a 3D tensor for the parameters to be tuned.

matches = T.wmatrix('matches') weights = T.dtensor3('weights')

Here’s the fun part: the cost function.

wteam_tensor = weights[matches[:,0]] # extract the vector-pairs of the winning teams wteam_score = matches[:,1] # extract the scores of the winning teams as an array lteam_tensor = weights[matches[:,2]] # extract the vector-pairs of the losing teams lteam_score = matches[:,3] # extract the scores of the losing teams as an array cost = T.mean( ( (wteam_tensor[:,0]*lteam_tensor[:,1]).sum(1) - wteam_score )**2 + ( (wteam_tensor[:,1]*lteam_tensor[:,0]).sum(1) - lteam_score )**2 )

Broadcasting makes the expression of row-wise dot products simple. To break it down: First, an element wise multiplication between of the winning offense vectors and the losing defense vectors. Next, sum up the rows to get an array of dot products. Ditto for the losing team. The rest expresses the cost function discussed earlier.

In order to perform gradient descent, we need the gradient! Since all we’ve done so far is express cost as a symbolic tree of mathematical expressions, Theano can perform symbolic differentiation with that tree with just one simple call:

grad = T.grad(cost,wrt=weights)

Now we have everything we need to start the training! Let’s compile the function,

train = theano.function( inputs = [], outputs = cost, givens = [ (matches, data), (weights, W) ], updates = [ (W, W - grad) ] )

Binding data and W to the symbolic variables matches and weights, and doing updates at every call allows us to reduce transfer between GPU memory and main memory, assuming we are running this computation on a GPU. This is the reason for declaring the shared variables earlier.

Now, everytime train() is called, the gradient is computed and the parameters are updated based on the data. Run this for as many epochs as you like!

That’s all for now! The code presented here is all up on Github. Hopefully, I’ll have the time to read more of Danny’s articles about how he does his predictions and learn a thing or two. With any luck, I’ll finally make a competitive submission to Kaggle.