Mailing List Archive: 49091 messages
  • Home
  • Script library
  • AltME Archive
  • Mailing list
  • Articles Index
  • Site search
 

[REBOL] Re: Need some url purification functions

From: al::bri::xtra::co::nz at: 11-Jan-2001 15:49

> I have made rules that will check whether the url extracted has http:// at
the beginning or not, if not then it appends the domain name that was being used. while crawling. the code snippet for the same is given below.
> This algo definitely has tons of issues (viz. what happens if the url that
is being crawling is not http://www.yahoo.com but http://www.yahoo.com/temp.html, will the new relative url ./demo become http://www.yahoo.com/temp.html/demo, what should happen if the relative url is ../temp worse ../../../temp and then the same relative urls can be written as ./demo or demo or /demo etc)
> I wanted to know if I am thinking in the right direction and is there a
simpler way of achieving what I want or do I have to write rules for each condition. Are there any readymade functions or Rebol code that might give me a purified absolute url based on certain inputs) You're roughly on the right track. Use 'load/markup to automatically split the HTML into tag! and string! datatypes -- this saves a lot of time. Consider also absolute URLS inside Javascript (need to scan inside javascript code). Also, you need to write parse code to handle URIs. Check out RFCs, (forgotten the number) there's several on URL, URI and email that are very helpful. Also, the construct: base/:File is very useful for forming absolute URLs. I've got a script which handles this all, but it's written under contract. It's private, not for free use. I hope that helps! Andrew Martin ICQ: 26227169 http://members.nbci.com/AndrewMartin/