Mailing List Archive: Re: fighting spam paper & links / naive bayes / anybody ?

[REBOL] Re: fighting spam paper & links / naive bayes / anybody ?

From: gchiu:compkarori at: 11-Sep-2002 8:15

On Mon, 26 Aug 2002 11:30:30 +1000
  "Brett Handley" <[brett--codeconscious--com]> wrote:

>Paul Graham quoted 4000 messages, I only worked with a
>couple of hundred
>good emails and 14 bad (all I've kept) so with such a low
>sample size it is
>likely that my tests of the filter will be suspect.

I've been playing around with Brett's code.

It took 17 mins to tokenise 770 sample spam messages, and
516 "good" messages ( email was first pulled from my email
server, and then saved to local storage before starting
the test).

I ended up with 34052 unique tokens from the good mail,
and 60516 tokens from the spam.

I then ran a test on the same body of good and bad emails.
The script detected one of the "good" email as being spam,
and looking at that email, I found that I had incorrectly
misclassified that message as good whereas it was infact
spam!

The script only detected 604 of the 770 as being spam.

I suspect others will have better results than this.  My
email is already heavily filtered - I have about 40
filters running on my mail server, so the tests were run
on messages that had got thru the filters.  Also, a lot of
what I consider spam actually looks like my good mail.

The only significant changes I made to Brett's code were
to strip out attachments before tokenising the message.
However, I need to still decode text/html base64 encoded
messages and tokenise them rather than discarding these
attachments.

--
Graham Chiu