Mailing List Archive: 49091 messages
  • Home
  • Script library
  • AltME Archive
  • Mailing list
  • Articles Index
  • Site search
 

[REBOL] Re: fighting spam paper & links / naive bayes / anybody ?

From: gchiu:compkarori at: 11-Sep-2002 8:15

On Mon, 26 Aug 2002 11:30:30 +1000 "Brett Handley" <[brett--codeconscious--com]> wrote:
>Paul Graham quoted 4000 messages, I only worked with a >couple of hundred >good emails and 14 bad (all I've kept) so with such a low >sample size it is >likely that my tests of the filter will be suspect.
I've been playing around with Brett's code. It took 17 mins to tokenise 770 sample spam messages, and 516 "good" messages ( email was first pulled from my email server, and then saved to local storage before starting the test). I ended up with 34052 unique tokens from the good mail, and 60516 tokens from the spam. I then ran a test on the same body of good and bad emails. The script detected one of the "good" email as being spam, and looking at that email, I found that I had incorrectly misclassified that message as good whereas it was infact spam! The script only detected 604 of the 770 as being spam. I suspect others will have better results than this. My email is already heavily filtered - I have about 40 filters running on my mail server, so the tests were run on messages that had got thru the filters. Also, a lot of what I consider spam actually looks like my good mail. The only significant changes I made to Brett's code were to strip out attachments before tokenising the message. However, I need to still decode text/html base64 encoded messages and tokenise them rather than discarding these attachments. -- Graham Chiu