Need some url purification functions

[1/7] from: gunjan:ezeenet at: 10-Jan-2001 21:56

Hi Gurus, In the endeavor to learn REBOL, I have decided to make a custom GUI crawler (Ok ok this was the most creative and challenging thing that a 5 day old in REBOL could think of :) and the good news is that it does crawl ALMOST as desired. The problem arises when my crawler extracts urls like /demo or ../demo or../../demo or /demo/index.html etc i.e. when a relative path is used instead of an absolute path. Then my poor crawlers gets absolutely confused and does not know what to do. (Imagine it trying to get a site called ../demo :( I have made rules that will check whether the url extracted has http:// at the beginning or not, if not then it appends the domain name that was being used. while crawling. the code snippet for the same is given below. This algo definitely has tons of issues (viz. what happens if the url that is being crawling is not http://www.yahoo.com but http://www.yahoo.com/temp.html, will the new relative url ./demo become http://www.yahoo.com/temp.html/demo, what should happen if the relative url is ../temp worse ../../../temp and then the same relative urls can be written as ./demo or demo or /demo etc) I wanted to know if I am thinking in the right direction and is there a simpler way of achieving what I want or do I have to write rules for each condition. Are there any readymade functions or Rebol code that might give me a purified absolute url based on certain inputs) Here is the code ;start of code snippet active-domain: make url! "" get-links: func[site [url!]][ tags: make block! 0 text: make string! 0 html-code: [ copy tag ["<" thru ">"] (append tags tag) | copy txt to "<" (append text txt) ] page: read site parse page [to "<" some html-code] foreach tag tags [ if parse tag ["<A" thru "HREF=" [{"} copy link to {"} | copy link to ">"] to end ] [ print ["Before : " link] dirs: parse link "/" ;first I check whether the link has http: in the beginning or not if not dirs/1 = "http:" [ ;if not then check whether link begins with / or not either dirs/1 = ""[ link: join active-domain link ][ link: join active-domain ["/" link] ] ] print ["After : " link ] ] ] ] site: ask "Enter the site : " site: to-url join "http://" site active-domain: site print ["Just set the active domain to " active-domain] get-links site ;end of code snippet ----------------------------------------- Gunjan Karun Technical Presales Consultant Zycus, Mumbai, India Tel: +91-22-8730591/8760625/8717251 Extension: 120 Fax: +91-22-8717251 URL: http://www.zycus.com

[2/7] from: al::bri::xtra::co::nz at: 11-Jan-2001 15:49

> I have made rules that will check whether the url extracted has http:// at

the beginning or not, if not then it appends the domain name that was being used. while crawling. the code snippet for the same is given below.

> This algo definitely has tons of issues (viz. what happens if the url that

is being crawling is not http://www.yahoo.com but http://www.yahoo.com/temp.html, will the new relative url ./demo become http://www.yahoo.com/temp.html/demo, what should happen if the relative url is ../temp worse ../../../temp and then the same relative urls can be written as ./demo or demo or /demo etc)

> I wanted to know if I am thinking in the right direction and is there a

simpler way of achieving what I want or do I have to write rules for each condition. Are there any readymade functions or Rebol code that might give me a purified absolute url based on certain inputs) You're roughly on the right track. Use 'load/markup to automatically split the HTML into tag! and string! datatypes -- this saves a lot of time. Consider also absolute URLS inside Javascript (need to scan inside javascript code). Also, you need to write parse code to handle URIs. Check out RFCs, (forgotten the number) there's several on URL, URI and email that are very helpful. Also, the construct: base/:File is very useful for forming absolute URLs. I've got a script which handles this all, but it's written under contract. It's private, not for free use. I hope that helps! Andrew Martin ICQ: 26227169 http://members.nbci.com/AndrewMartin/

[3/7] from: rebol:techscribe at: 10-Jan-2001 22:15

Hi, if I understand your problem correctly, then the following code - adapted from REBOL's built-in net-utils (source net-utils) - should be helpful (Code follows after my sign-off). Usage:

>> print mold url-parser/parse-url http://www.yahoo.com/sub-directory/filename.html

make object! [ scheme: "http" username: none password: none host: "www.yahoo.com" port: none path: "sub-directory/" target: "temp.html" tag: none ] Hope this helps, Elan REBOL [] URL-Parser: make object! [ scheme: none user: none pass: none host: none port-id: none path: none target: none tag: none p2: none digit: make bitset! #{ 000000000000FF03000000000000000000000000000000000000000000000000 } alpha-num: make bitset! #{ 000000000000FF03FEFFFF07FEFFFF0700000000000000000000000000000000 } scheme-char: make bitset! #{ 000000000068FF03FEFFFF07FEFFFF0700000000000000000000000000000000 } path-char: make bitset! #{ 00000000F17FFFA7FFFFFFAFFEFFFF5700000000000000000000000000000000 } user-char: make bitset! #{ 00000000F87CFF23FEFFFF87FEFFFF1700000000000000000000000000000000 } pass-char: make bitset! #{ FFF9FFFFFEFFFFFFFEFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF } url-rules: [scheme-part user-part host-part path-part file-part tag-part] scheme-part: [copy scheme some scheme-char #":" ["//" | none]] user-part: [copy user uchars [#":" pass-part | none] #"@" | none (user: pass: none)] pass-part: [copy pass to #"@" [skip copy p2 to "@" (append append pass "@" p2) | none]] host-part: [copy host uchars [#":" copy port-id digits | none]] path-part: [slash copy path path-node | none] path-node: [pchars slash path-node | none] file-part: [copy target pchars | none] tag-part: [#"#" copy tag pchars | none] uchars: [some user-char | none] pchars: [some path-char | none] digits: [1 5 digit] parse-url: func [ {Return url dataset or cause an error if not a valid URL} url ][ parse/all url url-rules return make object! compose [ scheme: (scheme) username: (user) password: (pass) host: (host) port: (port-id) path: (path) target: (target) tag: (tag) ] ] ]

[4/7] from: al:bri:xtra at: 11-Jan-2001 19:25

> then the following code - adapted from REBOL's built-in net-utils (source

net-utils) - should be helpful (Code follows after my sign-off). I learn something new every day! Thanks for showing this, Elan! Andrew Martin ICQ: 26227169 http://members.nbci.com/AndrewMartin/

[5/7] from: rebol:techscribe at: 10-Jan-2001 23:30

Thanks Andrew. I just realized that the net-utils version is right in setting the object's values to none before beginning to parse, otherwise old values may continue to live. In short, we need a vars block, and the parse-url function should be modified as follows: make object! [ .... .... vars: [scheme user pass host port-id path target tag] parse-url: func [ {Return url dataset or cause an error if not a valid URL} url ][ set vars none ... .... ] The reason is that the included parse rules do not under all circumstances clear out values that result from previous parsing. My vars block is probably overkill - compare to the original in parse-url in net-utils/URL-parser - but it can't hurt. Take Care, Elan

[6/7] from: gunjan:ezeenet at: 11-Jan-2001 21:19

Thanks folks for all the great tips, Lemme work on it tonight and tomorrow I'll have some more querries. Gunjan ----------------------------------------- Gunjan Karun Technical Presales Consultant Zycus, Mumbai, India Tel: +91-22-8730591/8760625/8717251 Extension: 120 Fax: +91-22-8717251 URL: http://www.zycus.com

[7/7] from: g:santilli:tiscalinet:it at: 11-Jan-2001 19:13

Hello Gunjan! On 10-Gen-01, you wrote: GK> The problem arises when my crawler extracts urls like /demo GK> or ../demo or../../demo or /demo/index.html etc i.e. when a GK> relative path is used instead of an absolute path. Then my GK> poor crawlers gets absolutely confused and does not know what GK> to do. Some hints that should help you start out:

>> url-obj: make object! [user: pass: host: port-id: path: target: none] >> net-utils/url-parser/parse-url url-obj http://www.yahoo.com/temp.html

== "temp.html"

>> print mold url-obj

make object! [ user: none pass: none host: "www.yahoo.com" port-id: none path: none target: "temp.html" ]

>> net-utils/url-parser/parse-url url-obj http://www.yahoo.com/dir/temp.html

== "temp.html"

>> print mold url-obj

make object! [ user: none pass: none host: "www.yahoo.com" port-id: none path: "dir/" target: "temp.html" ]

>> clean-path join %/ [url-obj/path %./demo]

== %/dir/demo

>> clean-path join %/ [url-obj/path %../demo]

== %/demo HTH, Gabriele. -- Gabriele Santilli <[giesse--writeme--com]> - Amigan - REBOL programmer Amiga Group Italia sez. L'Aquila -- http://www.amyresource.it/AGI/