Schaun Wheeler
October 5, 2022
Data Science

An implementation of Complement Naive Bayes for Google BigQuery

It can be a pain to batch and export data from BigQuery just to train a Naive Bayes classifier. We fixed that.

This is a technical post. If your eyes glaze over from talk of supervised learning and class balancing and the variance-bias tradeoff, you should go read something else (we offer a whole lot of choices!). 

If those more technical topics interest you, however, then this post is for you.

Let’s dive in.

The field of data science is currently on a “deep learning” kick. 

To some extent, that's warranted — network models have done some amazing things in the fields of text, image, and sound generation, for example. However, if the task is something a bit more modest (say you have a bunch of text, a few labels, and you want to decide which label best characterizes each piece of text), then pulling out something like a Long Short Term Memory Network or Generative Adversarial Network is like bringing a low-yield tactical nuke to a knife fight. There's just no call for it.

Deep learning requires a whole lot of data to work well, and partially because of that (but also partially because the methods are just really pretty complicated), it also requires a whole lot of computational power. More conventional models like random forests or support vector machines are less computationally intensive and require a little less data to work well, but Naive Bayes can work well on ridiculously small amounts of data, and's literally just a series of basic math problems.

That's because any particular Naive Bayes classifier is just a convenience wrapper around Bayes Theorem, a formulation independently developed at least twice since the 1700s to estimate conditional probability.

A gentle explanation of conditional probability

You may have heard of the duck test:

“If it looks like a duck, swims like a duck, and quacks like a duck, then it probably is a duck.”

That’s conditional probability. For any given set of animals, the probability that any one animal is a duck can be estimated by:

P(is a duck) = [total # of ducks] / [total # of animals]

And the probability that any one animal, say, swims like a duck, can be estimated by:

P(swims like a duck) = [total # of animals who swim like a duck] / [total # of animals]

So our conditional probability is the probability that an animal is a duck if we already know that the animal swims like a duck. That’s calculated by,  first, taking the joint probability (the percentage of animals that both are a duck and swim like a duck):

P(is a duck and swims like a duck) = P(is a duck) * P(swims like a duck)

And, second, dividing it by the percentage of animals who swim like a duck. So this is the joint probability:

P(is a duck, given that it swims like a duck) = P(is a duck and swims like a duck) / P(swims like a duck)

In other words, reduce our universe from the total number of animals to just the percentage of animals that swim like a duck. By using a percentage instead of a count, we’re continuing to use the original probabilities (based on total number of animals), but we’re forcing our calculation to only consider those animals that meet the condition.

What does this have to do with classification? 

Well, if we have a lot of animals, and we want to figure out which animals are ducks, we can combine conditional probabilities. For example, if one animal looks like a duck, swims like a duck, and quacks like a duck, then then:

P(animal is a duck) = P(is a duck, given that it looks like a duck) * P(is a duck, given that it swims like a duck) * P(is a duck, given that it quacks like a duck)

We could compare that the probability that an animal was, say, a goose:

P(animal is a goose) = P(is a goose, given that it looks like a duck) * P(is a goose, given that it swims like a duck) * P(is a goose, given that it quacks like a duck)

A goose could conceivably look a little like a duck, and could very likely swim like a duck, most likely wouldn’t quack like a duck. Therefore, if an animal looked, swam, and quacked like a duck, the probability of it being a duck would be higher than the probability of it being a goose, and therefore we would classify it as a duck.

Naive Bayes is powerful because it’s basic. 

Every-day, normal-run-of-business data science faces tons of situations where you need to look at a bunch of stuff that quacks, swims, walks, etc. and estimate the probability that it’s a duck. 

Need to classify tweets, business names, or any other kind of document? That’s a classification problem. 

Have a bunch of customer behaviors you care about and want to know which customer is likely to do each behavior? Another classification problem. 

Naive Bayes is a very easy and inexpensive way to set up a good baseline solution to a classification problem: before investing in a more data-hungry or computationally-expensive approach to classification, set up a Naive Bayes classifier, and then compare the performance of any subsequent classifiers to that more basic one. 

You’ll be surprised at how often the basic approach performs just as well as, and sometimes better than, more complicated methods.

A Naive Bayes Implementation for BigQuery

At Aampe, we use Google BigQuery as our data warehouse, and, generally, we’ve found it wonderfully scalable and easy to use. 

BigQuery comes with a bunch of different machine learning algorithm implementations available right out of the box (in fact, we use a couple of those implementations for our core learning systems), but there is currently no implementation of Naive Bayes. 

There are many other implementations available — for example, Python’s Scikit-Learn library has several variations to choose from — and those models can even be fit iteratively to accommodate training data that are too big to fit into memory at one time.

That said, even though a lot of our workflow happens in Python, it can be a pain to batch and export data from BigQuery just to train the model, and it’s an even bigger pain to have to export data in batches to then use the model to make predictions and then load those predictions back into the database to be used in other parts of the system.

That’s why we adapted the Scikit-Learn implementation of ComplementNB to run in BigQuery (We wrapped the whole thing in a Python class for convenience, but all of the heavy lifting actually happens within Bigquery itself). The implementation has the following methods:

  • stage_training_data This method ingests a query statement and saves the results of that query as a separate temporary table. The table needs to have a field that uniquely identifies records (if it were a spreadsheet, this would be the line numbers), and arrays of structs that contain the feature names (column headers) and values (cells). Also, of course, this table needs to have a column of labels that you want the algorithm to learn. Optionally, this table can also contain a column of weights, if you want to weight your samples in a specific way.
  • train_model This method processes the training data using the Complement Naive Bayes algorithm. It creates a new table that contains a log-probability for every combination of label and features. This is the representation of everything the model learned.
  • stage_test_data This method can take a query that defines a test dataset, the same way `stage_training_data` can. Alternatively, it can take one of two keywords in place of the query. The keyword `evaluation` will just load the training data as the test data (This is useful for evaluating model accuracy). The keyword “features” will create a new dataset where each “row” contains only one of the features (This is convenient for assigning a probability to each feature (which can be used similar to model coefficients from, say, a logistic regression)).
  • predict_probabilities Using the data loaded from `stage_test_data`, this method uses the log probabilities to estimate the probability that each record belong to each label. These probabilities are stored in a `probabilities` struct, and the label with the highest probability is stored in a nother column.
  • evaluate_model This outputs precision, recall, and f1-score measures for a particular label, as well as a confusion matrix.

If you’d like to look at the code — or, even better, take it for a spin — you can find it here.