Mailing List Archive: 49091 messages
  • Home
  • Script library
  • AltME Archive
  • Mailing list
  • Articles Index
  • Site search
 

[REBOL] Need some url purification functions

From: gunjan:ezeenet at: 10-Jan-2001 21:56

Hi Gurus, In the endeavor to learn REBOL, I have decided to make a custom GUI crawler (Ok ok this was the most creative and challenging thing that a 5 day old in REBOL could think of :) and the good news is that it does crawl ALMOST as desired. The problem arises when my crawler extracts urls like /demo or ../demo or../../demo or /demo/index.html etc i.e. when a relative path is used instead of an absolute path. Then my poor crawlers gets absolutely confused and does not know what to do. (Imagine it trying to get a site called ../demo :( I have made rules that will check whether the url extracted has http:// at the beginning or not, if not then it appends the domain name that was being used. while crawling. the code snippet for the same is given below. This algo definitely has tons of issues (viz. what happens if the url that is being crawling is not http://www.yahoo.com but http://www.yahoo.com/temp.html, will the new relative url ./demo become http://www.yahoo.com/temp.html/demo, what should happen if the relative url is ../temp worse ../../../temp and then the same relative urls can be written as ./demo or demo or /demo etc) I wanted to know if I am thinking in the right direction and is there a simpler way of achieving what I want or do I have to write rules for each condition. Are there any readymade functions or Rebol code that might give me a purified absolute url based on certain inputs) Here is the code ;start of code snippet active-domain: make url! "" get-links: func[site [url!]][ tags: make block! 0 text: make string! 0 html-code: [ copy tag ["<" thru ">"] (append tags tag) | copy txt to "<" (append text txt) ] page: read site parse page [to "<" some html-code] foreach tag tags [ if parse tag ["<A" thru "HREF=" [{"} copy link to {"} | copy link to ">"] to end ] [ print ["Before : " link] dirs: parse link "/" ;first I check whether the link has http: in the beginning or not if not dirs/1 = "http:" [ ;if not then check whether link begins with / or not either dirs/1 = ""[ link: join active-domain link ][ link: join active-domain ["/" link] ] ] print ["After : " link ] ] ] ] site: ask "Enter the site : " site: to-url join "http://" site active-domain: site print ["Just set the active domain to " active-domain] get-links site ;end of code snippet ----------------------------------------- Gunjan Karun Technical Presales Consultant Zycus, Mumbai, India Tel: +91-22-8730591/8760625/8717251 Extension: 120 Fax: +91-22-8717251 URL: http://www.zycus.com