Machine Learning Technique -> Naive Bayes

In the context of spam filtering the actual text of the e-mail x corresponds to the test and the label y is equivalent to the diagnosis. We could use the latter to infer

We may have a good estimate of p(y), that is, the probability of receiving a spam or ham mail. Denote by mham and mspam the number of ham and spam e-mails in X. In this case we can estimate

The key problem, however, is that we do not know p(x|y) or p(x). We may dispose of the requirement of knowing p(x) by settling for a likelihood ratio

Whenever L(x) exceeds a given threshold c we decide that x is spam and consequently reject the e-mail. If c is large then our algorithm is conservative and classifies an email as spam only if p(spam |x) >> p(ham | x). On the other hand, if c is small then the algorithm aggressively classifies emails as spam.

The key obstacle is that we have no access to p(x | y). This is where we make our key approximation. In order to model the distribution of the test outcomes T1 and T2 we made the assumption that they are conditionally independent of each other given the diagnosis. Analogously, we may now treat the occurrence of each word in a document as a separate test and combine the outcomes in a naive fashion by assuming that

where wj denotes the j-th word in document x. This amounts to the assumption that the probability of occurrence of a word in a document is independent of all other words given the category of the document.

This assumption reduces the dificulty of knowing p(x | y) to that of estimating the probabilities of occurrence of individual words w. Estimates for p(w|y) can be obtained, for instance, by simply counting the frequency occurrence of the word within documents of a given class. That is, we estimate

Here {y_i = spam and w^j_i = w} equals 1 if and only if xi is labeled as spam and w occurs as the j-th word in xi. The denominator is simply the total number of words in spam documents. Similarly one can compute p(w|ham). In principle we could perform the above summation whenever we see a new document x. This would be terribly inefficient, since each such computation requires a full pass through X and Y. Instead, we can perform a single pass through X and Y and store the resulting statistics as a good estimate of the conditional probabilities. The following algorithm has details of an implementation.

Note that we performed a number of optimizations: Firstly, the normalization by m^-1_spam and m^-1_ham respectively is independent of x, hence we incorporate it as a fixed offset. Secondly, since we are computing a product over a large number of factors the numbers might lead to numerical overflow or under flow. This can be addressed by summing over the logarithm of terms rather than computing products. Thirdly, we need to address the issue of estimating p(w|y) for words w which we might not have seen before. One way of dealing with this is to increment all counts by 1. This method is commonly referred to as Laplace smoothing.

This simple algorithm is known to perform surprisingly well, and variants of it can be found in most modern spam filters. It amounts to what is commonly known as "Bayesian spam filtering". Obviously, we may apply it to problems other than document categorization, too.

BunksAllowed

Community

Join WhatsApp Grpup using https://chat.whatsapp.com/EAcqRurEOXb52Ax7Tlmj9I

Machine Learning Technique -> Naive Bayes

Happy Exploring!

No comments:

Post a Comment

About BunksAllowed

Coding Challenges

Socialize

Categories

Followers

BunksAllowed

Comments

Report Abuse

Total Pageviews

Blog Archive

Categories

Recent Posts

Popular Posts

Quick Contact

Translate

Smart City Use-Case: Traffic Management in Smart City

Why SCADA was Adopted for IP (Internet Protocol)?

How Apache Spark, Kafka, and Hadoop Work Together

Lambda Architecture Overview

Need for Apache Kafka in Real-Time Data Processing

Smart City Use-Case: Traffic Management in Smart City

Archive

Follow Us

We Acknowledge

PEXELS

Recent Tutorials

Smart City Use-Case: Traffic Management in Smart City

Why SCADA was Adopted for IP (Internet Protocol)?

How Apache Spark, Kafka, and Hadoop Work Together

Lambda Architecture Overview

Need for Apache Kafka in Real-Time Data Processing

Contact Form

Categories

Community

Join WhatsApp Grpup using https://chat.whatsapp.com/EAcqRurEOXb52Ax7Tlmj9I

Machine Learning Technique -> Naive Bayes

Happy Exploring!

No comments:

Post a Comment

About BunksAllowed

Coding Challenges

Socialize

Categories

Followers

BunksAllowed

Comments

Subscribe To

Total Pageviews

Blog Archive

Categories

Recent Posts

Popular Posts

Subscribe Us

Quick Contact

Translate

Archive

Follow Us

We Acknowledge

PEXELS

Recent Tutorials

Contact Form

Categories