PHP Naive Bayesian Filter 2005-03-30


Thanks to Bitflux Blog for this link to a PHP bayesian filter.

The linked page also points to James Seng's plugin for Movable Type to do Bayesian filtering of comments.

In case you don't read French, I've done a quick (and rough, my French is bad - I need to use it more) translation (feel free to correct me in the comments...):

This is about filtering comments, pingbacks or other trackbacks to your site. I don't play much with that, but the idea of a filter based on the Bayes theorem intrigued me too much to resist doing a PHP implementation.

Simple and efficient

The Bayes theorem is a simple relationship between probabilities. For example if you have a document and two categories spam and nospam, it is difficult to learn the probability that the document belongs in one category or another directly. On the other hand it is simple to learn them by analysing each word of the document.

For the theory, a simple search on Google for "naive bayes theorem" give you numerous references. And if English doesn't stop you, you ought to read
Machine Learning in Automated Text Categorization by Fabrizio Sebastiani. If you prefer Perl to PHP, look at the CPAN modules of Ken Williams like Algorithm::NaiveBayes.

The interest in the naive Bayes algorithm is because it is fast and globally useful. You could for example utilise it for the classification of comments on your site. For example, see the filter for MT that motivated me for making it all in PHP.

Utilisation in practice

In the archive you find a script which allow you to train your database and make a query. It is meant for implementation in a larger system like your blog system.

At first, use the file "mysql.sql" to initialise the database. You should afterwards use the script to create at least two categories, for example "spam" and "nonspam". Afterwards you must train the filter a bit before testing it.

Important functions:

1. train() : To train the filter
2. untrain(): To untrain the filter
3. categorize() : To classify a document
4. updateProbabilities() : To update the probabilities in the database after a series of train() or untrain().

The use of categorize() does not add any information to the database. It only returns the result of the probability calculation.

Update: Replaced the Machine Learning URL with a working one provided by Audun. Thanks!

The correct url to the "Machine Learning in Automated Text Categorization" article is: http://www.math.unipd.it/fabseb60/Publications/ACMCS02.pdf

Audun,

Thanks a lot. I've updated the entry with the URL you provided :)


blog comments powered by Disqus