Mailing List Archive: Re: Need some url purification functions

[REBOL] Re: Need some url purification functions

From: al::bri::xtra::co::nz at: 11-Jan-2001 15:49


> I have made rules that will check whether the url extracted has http:// at
the beginning or not, if not then it appends the domain name that was being
used. while crawling. the code snippet for the same is given below.
> This algo definitely has tons of issues (viz. what happens if the url that
is being crawling is not http://www.yahoo.com but
http://www.yahoo.com/temp.html, will the new relative url ./demo become
http://www.yahoo.com/temp.html/demo, what should happen if the relative url
is ../temp worse ../../../temp and then the same relative urls can be
written as ./demo or demo or /demo etc)
> I wanted to know if I am thinking in the right direction and is there a
simpler way of achieving what I want or do I have to write rules for each
condition. Are there any readymade functions or Rebol code that might give
me a purified absolute url based on certain inputs)

You're roughly on the right track. Use 'load/markup to automatically split
the HTML into tag! and string! datatypes -- this saves a lot of time.
Consider also absolute URLS inside Javascript (need to scan inside
javascript code). Also, you need to write parse code to handle URIs. Check
out RFCs, (forgotten the number) there's several on URL, URI and email that
are very helpful. Also, the construct:
        base/:File
    is very useful for forming absolute URLs.

I've got a script which handles this all, but it's written under contract.
It's private, not for free use.

I hope that helps!

Andrew Martin
ICQ: 26227169 http://members.nbci.com/AndrewMartin/