## Finding Maximum Dot (or Inner) Product

A problem that often arises in machine learning tasks is trying to find a row in a matrix that gives the highest dot product given a query vector. Some examples of such situations:

- You’ve performed some kind of matrix factorisation for collaborative filtering for say, a movie recommendation system, and now, given a new user, you want to be able to specify a couple of movies that your system would predict he would rate highly.
- A neural network where the final softmax predictive layer is huge (but you managed to train it, somehow).

In both these cases, the problem boils down to trying to search a collection of vectors to find the one that gives the highest (or the $k$ highest) dot product(s).

A simple way to do this would be to perform a matrix multiplication, and then to find the best scoring vector by scanning through the values. This is effectively performing $N$ dot product computations for a matrix with $N$ rows. Can we do better?

## Tree Structures

I figured a good place to start would be to look at how we can store the vectors in some form of tree structure so that in the search for the best vector, we’d be able to prune out sub-trees that are not worth looking at. This problem may look really similar to a nearest neighbour search, and it might follow that using a $k$d-tree here might make sense. But while points that are close by would have a similar dot-product score, far away points may also have an equal score, or higher.

If we consider geometrically what a dot-product means, it’s

$$u \dot{} v = |u||v|\cos \theta$$

where $\theta$ is the angle between the vectors $u$ and $v$. This means, is in order to find a high scoring vector, we need to look in the same direction that the query vector is pointing in, but also as far as possible in that direction. Additionally, a vector that’s pointing in some angle away from the query vector may score even higher if it has the magnitude to make up for the loss.

I haven’t figured out the best way to do this yet. I’ve been thinking about this in terms of making a database to which new vectors can be added and removed, so I figured something that wouldn’t be fixed after being built would be a good choice. Some form of index which can account for both angle and magnitudes of the vectors would be great as well. Since how we perform our search also plays a big role in deciding how best to do this, let’s get into that.

## Search

If we’re going to be able to prune sub-trees from the search, we need a heuristic, a score that gives an upper bound for the maximum dot-product in that sub-tree. This is so if you already have a candidate for the maximum and the heuristic gives you an upper bound that’s *lower* than the candidate, we can safely ignore that sub-tree.

So consider the following method:

Notice that for the non-leaf nodes, the information stored are the minimum and maximum values for each dimension of every vector in the sub-tree. Then, given a query vector $v$, we can define a heuristic function:

$$h(u,v) = \sum_i \max(v_i u_{\text{max},i},v_i u_{\text{min},i})$$

where $u_{\text{max},i}$ and $u_{\text{min},i}$ are the $i$-th values of the the maximum and minimum values respectively. Notice that, negative values in the query vectors will cause the minimum value to be picked, and positive values the maximum. This will give us a really optimistic score that we can use to decide if the sub-tree is worth exploring.

As an example, let’s see what happens when we run a query using $(1,-1)$.

- Evaluating $h(v)$ on the left sub-tree gives us 2.2, and -1.7 on the right sub-tree.
- We then decide to explore the left one, we find that the best score there just so happens to be 2.2. This is now our best score.
- Since the heuristic for the right sub-tree is less than that of our best score, we can ignore it.

You can try calculating all the scores in the right sub-tree as an exercise, and you’ll find they’re all lesser than -1.7.

## Parting notes…

I expected there to be more academic literature pertaining to this topic, but I found just one in my search for a solution. This is particularly odd to me given that it’s a problem that I’ve found coming up fairly often. The paper discusses a different heuristic based on the best possible inner product given a region in space. I haven’t spent the time to figure out how it works yet, but I’ll update this space once I do.

There’s also the problem of finding out what a good tree partitioning scheme would be to make the most of the heuristic I’ve specified, while still allowing fast insertions and deletions.

It seems like a problem worth solving to me, not to mention it combines both data structures and vector math topics, which makes it pretty interesting.

I’d appreciate pointers to any related work or improvements to what I currently have, so leave a comment!

I know this is an old post, but if you haven’t somehow stumbled upon this in the mean time: locality sensitive hashing deals with this. There are some nice pictures to explain here: https://www.cs.jhu.edu/~vandurme/papers/VanDurmeLallACL10-slides.pdf

Thanks for this!