Mailing List Archive: Re: Rebol indexer

[REBOL] Re: Rebol indexer

From: hallvard:ystad:helpinhand at: 25-Aug-2003 23:31


Dixit Andreas Bolka (21.14 25.08.2003):
>Saturday, August 23, 2003, 5:56:50 PM, Hallvard wrote:
>> http://folk.uio.no/hallvary/rix/
>looks nice :)

Thanks.

>> * The robot obeys robots.txt (agent id: RixBot)
>would you like to factor that code out, so that future writers of bot
>coul reuse that bit :) ?


It's on my rebsite, Diddeley-do, available from /view desktop. Also (also? It's the very 
same file!) on my website: http://folk.uio.no/hallvary/rebol/server.r.


I peeked a bit on the ht://dig package, and constructed this as an object. It has got 
a function to see if a url is permitted or not:

forbidden? "/some/path/with/file.html"

Forbidden paths are stored in a hash!


I reconstruct such objects from a mysql database every now and then. I think probably 
it would be less memory consuming to have the forbidden? function as a global word, and 
keep nothing but the object's hash! of forbidden paths. Some guru might have a qualified 
opinion on this? I'm nothing but a script kiddie myself.

The robots.txt standard is unclear about one thing. Is this allowed:
user-agent: someAgent
user:agent: someOtherAgent
user-agent: someOtherAgentsAunt
disallow: /

Or must it be like this:
user-agent: someAgent
disallow: /

user:agent: someOtherAgent
disallow: /

user-agent: someOtherAgentsAunt
disallow: /


I believe there really _should_ be one disallow: line for each user-agent, but I've seen 
examples of the first approach  (http://www.rebol.org/robots.txt, for instance), so the 
script also accepts those.

Comments are welcome.

Regards,
Hallvard