Machine Learning Technique -> Naive Bayes - BunksAllowed

BunksAllowed is an effort to facilitate Self Learning process through the provision of quality tutorials.

Community

Machine Learning Technique -> Naive Bayes

Share This
In the context of spam filtering the actual text of the e-mail x corresponds to the test and the label y is equivalent to the diagnosis. We could use the latter to infer
We may have a good estimate of p(y), that is, the probability of receiving a spam or ham mail. Denote by mham and mspam the number of ham and spam e-mails in X. In this case we can estimate
The key problem, however, is that we do not know p(x|y) or p(x). We may dispose of the requirement of knowing p(x) by settling for a likelihood ratio

Whenever L(x) exceeds a given threshold c we decide that x is spam and consequently reject the e-mail. If c is large then our algorithm is conservative and classifies an email as spam only if p(spam |x) >> p(ham | x). On the other hand, if c is small then the algorithm aggressively classifies emails as spam.

The key obstacle is that we have no access to p(x | y). This is where we make our key approximation. In order to model the distribution of the test outcomes T1 and T2 we made the assumption that they are conditionally independent of each other given the diagnosis. Analogously, we may now treat the occurrence of each word in a document as a separate test and combine the outcomes in a naive fashion by assuming that

where wj denotes the j-th word in document x. This amounts to the assumption that the probability of occurrence of a word in a document is independent of all other words given the category of the document.

This assumption reduces the dificulty of knowing p(x | y) to that of estimating the probabilities of occurrence of individual words w. Estimates for p(w|y) can be obtained, for instance, by simply counting the frequency occurrence of the word within documents of a given class. That is, we estimate

Here {yi = spam and wji = w} equals 1 if and only if xi is labeled as spam and w occurs as the j-th word in xi. The denominator is simply the total number of words in spam documents. Similarly one can compute p(w|ham). In principle we could perform the above summation whenever we see a new document x. This would be terribly inefficient, since each such computation requires a full pass through X and Y. Instead, we can perform a single pass through X and Y and store the resulting statistics as a good estimate of the conditional probabilities. The following algorithm has details of an implementation.

Note that we performed a number of optimizations: Firstly, the normalization by m-1spam and m-1ham respectively is independent of x, hence we incorporate it as a fixed offset. Secondly, since we are computing a product over a large number of factors the numbers might lead to numerical overflow or under flow. This can be addressed by summing over the logarithm of terms rather than computing products. Thirdly, we need to address the issue of estimating p(w|y) for words w which we might not have seen before. One way of dealing with this is to increment all counts by 1. This method is commonly referred to as Laplace smoothing.


This simple algorithm is known to perform surprisingly well, and variants of it can be found in most modern spam filters. It amounts to what is commonly known as "Bayesian spam filtering". Obviously, we may apply it to problems other than document categorization, too.


Happy Exploring!

No comments:

Post a Comment