## Naive Bayes Categorisation (with some help from Elasticsearch)

Back in November, I gave a talk during one of the Friday Hackers and Painters sessions at Plug-in@Block 71, aptly titled “How I do categorisation and some naive bayes sh*t” by Calvin Cheng. I promised I’d write a follow-up blog post with the materials I presented during the talk, so here it is.

First off, I’d like to start with a little motivation about why we need to do this.

The Semantics3 Product API provides data from different sites about each one of the 35 million products we index. One of the biggest problems we face is being able to tell that a product from two different sites are the same thing: a process we call **disambiguation**.

For example, check out the following pages on newegg.com,

and this one, from amazon.com,

They’re pages from two different sites, about the same product. A lot of the work that we do goes into trying to aggregate these products into one single record so that tasks like price comparisons become easy.

Assuming we already have a function that reliably tells us if two products are the same thing, it still wouldn’t be feasible to run a data from a newly discovered product against all 35 million records in our database.

This is where categorisation comes in.

Being able to estimate, say, the top three categories which a given product may be in can drastically narrow down our search space of products we need to check against for disambiguation. As it turns out, using a simple naive bayes as a model works remarkably well. There are plenty of explanations about how naive bayes works, and I’ll attempt to provide my own here.

## Some probability

Generally, the probability of an event $A$ occurring based on past experience is given by:

$$P(A) = \frac{\text{No. of times event A occurred}}{\text{No. of times any event occurred}}$$

If, however, I am only interested in the probability of event A occurring when I know event B has already happened, then this is written as:

$$P(A|B) = \frac{\text{No. of times event A and event B both occurred}}{\text{No. of times event B occurred}}$$

This can be thought of as the probability of the event A occurring within the “B already happened” universe and therefore should not be confused with the probability that event A and event B occur, which is given by:

$$P(A,B) = \frac{\text{No. of times event A and B both occurred}}{\text{No. of times any event occured}}$$

Notice that in $P(A)$ and $P(A,B)$, the denominators are the same, this means they cancel out when divided:

$$\frac{P(A,B)}{P(A)} = \frac{\text{No. of times event A and B both occurred}}{\text{No. of times event A occurred}} = P(B|A)$$

which also means $P(A|B)P(B) = P(A,B)$.

This is how, with $P(A)$ and $P(B)$, you can reverse $P(A|B)$ to $P(B|A)$.

$$P(B|A)=\frac{P(A|B)P(B)}{P(A)}$$

This is known as Bayes’ Rule, and is at the heart of how the naive bayes algorithm works.

## An example

Consider the task of classifying a product either into “Laptops” or “Mobile phones”.

We know the words that commonly appear in previously seen instances of mobile phones and laptops. Therefore, with the data we have, we can build a table like the following:

```
<td>
Laptops
</td>
<td>
Mobile Phones
</td>
```

```
<td>
</td>
<td>
23
</td>
```

```
<td>
46
</td>
<td>
56
</td>
```

```
<td>
91
</td>
<td>
1
</td>
```

```
<td>
40
</td>
<td>
</td>
```

```
<td>
51
</td>
<td>
52
</td>
```

Word |

motorola |

samsung |

lenovo |

hewlett |

LG |

With that, we can intuitively tell that any product that has the word “lenovo” or “hewlett” in it has an extremely high probability to be a laptop. We knew this before through our own domain knowledge acquired from personal experience, but we now have data that we can use to let our model reason about automatically.

How is any of this related to naive bayes?

- We have the counts for the individual words given the category, so we can compute the probability that the word would occur in that category, this gives us $P(\text{word}|\text{category})$.
- We have the counts of number of items from each category, that gives us $P(\text{category})$.
- We have the counts of number of words for all words, that gives us $P(\text{word})$.
- We want to know the probability that a product comes from a category, given that we observe a certain word. So we need to know $P(\text{category}|\text{word})$

Seems like we have everything we need in numbers 1-3 in order for us to get 4! Not quite.

There are more signals given a product’s name — “Lenovo 4G no contract” tells us that its far likelier to be a mobile phone than a laptop, but if we take only the word “lenovo” into consideration, then the product would be categorised as such. We need to be able to take the other words into consideration, and that’s where the ‘naive’ assumption in naive bayes comes in.

The probability that the words ‘no’ and ‘contract’ occur often together may be pretty high, but for current purposes, it is far easier (and more efficient) to assume that they are independent events on their own, and the probability that they appear is only dependent on the category that the data is from. Then, probability theory tells us that the probability of two independent events $A$ and $B$ occurring together is simply the product of their individual probabilities:

$$P(A,B) = P(A)P(B)$$

If we apply this idea to the word probabilities, the probability of seeing a set of words given the categories becomes:

$$P(w_1,w_2,\ldots,w_k|\text{category}) = \prod_i P(w_i|\text{category})$$

*Now* we have everything.

We know everything we need to know now to compute $P(\text{category}|w_1,w_2,\ldots,w_k)$. We just need to find the category that gives us the highest value. In nerdspeak:

$$\mathop{\arg\max}\limits_c P(c|w_1,w_2,\ldots,w_k) = \mathop{\arg\max}\limits_c \frac{P(c)\prod_i P(w_i|c)}{\prod_i P(w_i)}$$

## Computational tricks

Notice that the denominator on the right-hand side above is independent of $c$ (the category we’re considering). This means that quantity will always remain the same regardless of $c$ and does not have to be considered if we are simply trying to find the classification that gives the maximum probability. This yields a simpler formula:

$$\mathop{\arg\max}\limits_c P(c|w_1,w_2,\ldots,w_k) = \mathop{\arg\max}\limits_c P(c)\prod_i P(w_i|c)$$

With large counts also comes underflow issues. In such cases, the standard technique is to use log probabilities. Since taking log does not affect order, we can just safely apply log on both sides:

$$\mathop{\arg\max}\limits_c \log P(c|w_1,w_2,\ldots,w_k) = \mathop{\arg\max}\limits_c \log P(c) + \sum_i \log P(w_i|c)$$

For retrieving normalised probabilities after making the prediction, check out the post I wrote about logsumexp

## Counts by faceting with Elasticsearch

One way to obtain the count table shown above would be to loop through and count every entry in the database. However in our case, that table is already computed and stored, just in the form of the Elasticsearch index.

Faceting allows you to get the counts of a field of a search term. We routinely query our database for these counts, and use them in our categorisation model.

An example of the query we do is the following:

```
{
"query": { "match_all": {} },
"facets" : {
"name": {
'terms' : {
'field': "description",
'size' : 10
}
},
},
"size":0
}
```

This gives us the top 10 most common words in our sample database. Since we separate each category by index, running this query on each index will give us the equivalent of a column of the table shown above.

```
{'_shards': {'failed': 0, 'successful': 3, 'total': 3},
'facets': {'name': {'_type': 'terms',
'missing': 5,
'other': 995,
'terms': [{'count': 23, 'term': 'from'},
{'count': 21, 'term': '4'},
{'count': 20, 'term': 'your'},
{'count': 19, 'term': 'up'},
{'count': 18, 'term': 'iphon'},
{'count': 18, 'term': 'ani'},
{'count': 14, 'term': 'stay'},
{'count': 11, 'term': 'provid'},
{'count': 11, 'term': 'built'},
{'count': 11, 'term': '4s'}],
'total': 1161}},
'hits': {'hits': [], 'max_score': 1.0, 'total': 28},
'timed_out': False,
'took': 239}
```

It’s important to note that these words have been lowercased and stemmed, all preprocessing steps that can be configured using Elasticsearch. Of course after building a naive bayes model with this data, any query fed into the model must also be put through the same preprocessing steps.

This gives you the count table you need without having to iterate through every entry you have, tokenising them and counting the tokens category by category. It’s all been done by Elasticsearch, so use it!

Hope you find this post useful, and have a great 2014 ahead!