bayes classifier

What is a Naive Bayes Classifier?

A naive bayes classifier is used to classify items into different categories based on the presence or absence of features. Using the cumulative probability of each feature appearing we are able to classify an unknown item. In order for this to work we must train our classifier with known data.

Note:

A critical assumption is feature independence, this means that the presence of one feature is independent of another feature. i.e. the fact that an apple is red is independent of the fact that it has seeds.

Example Usage

A good example is a spam filter (they aren't this simple any more), a spam filter needs to classify an email as either spam or not spam. A very simple break down of the classification process:

  • Parsing: Split and count the individual words in the email we are trying to learn from or classify.
  • Learning: Sample previously classified emails and count the words that show up in spam vs non-spam emails. (wiki: supervised learning)
  • Classifying: Compare the frequency of spamy words vs non-spamy words, choose the more likely classification

Implementation:

This isn't optimize for speed or anything like that, but it does a good job of outlining what is involved with the training and classification steps. For the full code go to github.

Parsing

  • Split the document into individual words
  • Filter out words with less than 3 letters
  • Group and count duplicate words

Training

  • Parse the document into word counts
  • Update the category specific count for each word that appeared in the document. e.g. spam[viagra] = 20
  • Update how many words we have counted for the given category.
for data, category in docs:
    words = self.get_words(data)
    self.categories[category].update(words)
    self.word_count[category] += sum(words.values())

Classifying

  • Parse the document into word counts
  • Setup a counter for each possible classification in your classifier.
  • Loop through each word in the document
    • Loop through each category add the log(category.word / category.total) to your counter. We use log to avoid underflows with very small numbers, (I'm not sure if it is overkill)
  • Loop through each category and add log(word_count[category]/word_count[total]), to each each categories score.
  • Return the category with the highest score
words = self.get_words(string) #a counter of the words in the string
score = defaultdict(float) #score for each category

for word in words:
  for category, category_words in self.categories.items():
    if category_words[word]:
      score[category] += log(category_words[word] / 
        self.word_count[category])
    else:
      score[category] += log(.01 / self.word_count[category])

total = sum(self.word_count.values())
for category in self.word_count:
  score[category] += log(self.word_count[category] / total)

if score:
  return max(score, key = lambda x: score[x])

return None

Usage

import bayes
classifier = bayes.Classifier()

data = [
    ('training string 1', 'type_1'),
    ('training string 2', 'type_1'),
    ('training string 3', 'type_1'),
    ('training string 4', 'type_2'),
    ('training string 5', 'type_2')
    ...
]
classifier.train(data)

#you can save the data for later
classifier.save('demo.bayes')

print classifier.classify('this is a test string')
full source on github