Mailing List Archive: 49091 messages
  • Home
  • Script library
  • AltME Archive
  • Mailing list
  • Articles Index
  • Site search
 

[REBOL] Re: Rebol indexer

From: hallvard:ystad:helpinhand at: 25-Aug-2003 23:31

Dixit Andreas Bolka (21.14 25.08.2003):
>Saturday, August 23, 2003, 5:56:50 PM, Hallvard wrote: >> http://folk.uio.no/hallvary/rix/ >looks nice :)
Thanks.
>> * The robot obeys robots.txt (agent id: RixBot) >would you like to factor that code out, so that future writers of bot >coul reuse that bit :) ?
It's on my rebsite, Diddeley-do, available from /view desktop. Also (also? It's the very same file!) on my website: http://folk.uio.no/hallvary/rebol/server.r. I peeked a bit on the ht://dig package, and constructed this as an object. It has got a function to see if a url is permitted or not: forbidden? "/some/path/with/file.html" Forbidden paths are stored in a hash! I reconstruct such objects from a mysql database every now and then. I think probably it would be less memory consuming to have the forbidden? function as a global word, and keep nothing but the object's hash! of forbidden paths. Some guru might have a qualified opinion on this? I'm nothing but a script kiddie myself. The robots.txt standard is unclear about one thing. Is this allowed: user-agent: someAgent user:agent: someOtherAgent user-agent: someOtherAgentsAunt disallow: / Or must it be like this: user-agent: someAgent disallow: / user:agent: someOtherAgent disallow: / user-agent: someOtherAgentsAunt disallow: / I believe there really _should_ be one disallow: line for each user-agent, but I've seen examples of the first approach (http://www.rebol.org/robots.txt, for instance), so the script also accepts those. Comments are welcome. Regards, Hallvard