[REBOL] Re: Rebol indexer
From: hallvard:ystad:helpinhand at: 25-Aug-2003 23:31
Dixit Andreas Bolka (21.14 25.08.2003):
>Saturday, August 23, 2003, 5:56:50 PM, Hallvard wrote:
>> http://folk.uio.no/hallvary/rix/
>looks nice :)
Thanks.
>> * The robot obeys robots.txt (agent id: RixBot)
>would you like to factor that code out, so that future writers of bot
>coul reuse that bit :) ?
It's on my rebsite, Diddeley-do, available from /view desktop. Also (also? It's the very
same file!) on my website: http://folk.uio.no/hallvary/rebol/server.r.
I peeked a bit on the ht://dig package, and constructed this as an object. It has got
a function to see if a url is permitted or not:
forbidden? "/some/path/with/file.html"
Forbidden paths are stored in a hash!
I reconstruct such objects from a mysql database every now and then. I think probably
it would be less memory consuming to have the forbidden? function as a global word, and
keep nothing but the object's hash! of forbidden paths. Some guru might have a qualified
opinion on this? I'm nothing but a script kiddie myself.
The robots.txt standard is unclear about one thing. Is this allowed:
user-agent: someAgent
user:agent: someOtherAgent
user-agent: someOtherAgentsAunt
disallow: /
Or must it be like this:
user-agent: someAgent
disallow: /
user:agent: someOtherAgent
disallow: /
user-agent: someOtherAgentsAunt
disallow: /
I believe there really _should_ be one disallow: line for each user-agent, but I've seen
examples of the first approach (
http://www.rebol.org/robots.txt, for instance), so the
script also accepts those.
Comments are welcome.
Regards,
Hallvard