In the context of spam filtering the actual text of the e-mail x corresponds to the test and the label y is equivalent to the diagnosis. We could use the latter to infer
We may have a good estimate of
p(y)
, that is, the probability of receiving a spam
or ham
mail. Denote by mham
and mspam
the number of ham
and spam
e-mails in X
. In this case we can estimateThe key problem, however, is that we do not know
p(x|y)
or p(x)
. We may dispose of the requirement of knowing p(x)
by settling for a likelihood ratioWhenever
L(x)
exceeds a given threshold c
we decide that x
is spam and consequently reject the e-mail. If c
is large then our algorithm is conservative and classifies an email as spam only if p(spam |x) >> p(ham | x)
. On the other hand, if c
is small then the algorithm aggressively classifies emails as spam.The key obstacle is that we have no access to
p(x | y)
. This is where we make our key approximation. In order to model the distribution of the test outcomes T1
and T2
we made the assumption that they are conditionally independent of each other given the diagnosis. Analogously, we may now treat the occurrence of each word in a document as a separate test and combine the outcomes in a naive fashion by assuming thatwhere
wj
denotes the j-th
word in document x
. This amounts to the assumption that the probability of occurrence of a word in a document is independent of all other words given the category of the document.This assumption reduces the dificulty of knowing
p(x | y)
to that of estimating the probabilities of occurrence of individual words w
. Estimates for p(w|y)
can be obtained, for instance, by simply counting the frequency occurrence of the word within documents of a given class. That is, we estimateHere
{yi = spam
and wji = w}
equals 1 if and only if xi
is labeled as spam and w
occurs as the j-th
word in xi
. The denominator is simply the total number of words in spam documents. Similarly one can compute p(w|ham)
. In principle we could perform the above summation whenever we see a new document x
. This would be terribly inefficient, since each such computation requires a full pass through X
and Y
. Instead, we can perform a single pass through X
and Y
and store the resulting statistics as a good estimate of the conditional probabilities. The following algorithm has details of an implementation.Note that we performed a number of optimizations: Firstly, the normalization
by
m-1spam
and m-1ham
respectively is independent of x
, hence we incorporate it as a fixed offset. Secondly, since we are computing a product over a large number of factors the numbers might lead to numerical overflow or under flow. This can be addressed by summing over the logarithm of terms rather than computing products. Thirdly, we need to address the issue of estimating p(w|y)
for words w
which we might not have seen before. One way of dealing with this is to increment all counts by 1. This method is commonly referred to as Laplace smoothing.This simple algorithm is known to perform surprisingly well, and variants of it can be found in most modern spam filters. It amounts to what is commonly known as "Bayesian spam filtering". Obviously, we may apply it to problems other than document categorization, too.
No comments:
Post a Comment
Note: Only a member of this blog may post a comment.