Need some url purification functions
[1/7] from: gunjan:ezeenet at: 10-Jan-2001 21:56
Hi Gurus,
In the endeavor to learn REBOL, I have decided to make a custom GUI crawler (Ok ok this
was the most creative and challenging thing that a 5 day old in REBOL could think of
:) and the good news is that it does crawl ALMOST as desired.
The problem arises when my crawler extracts urls like /demo or ../demo or../../demo or
/demo/index.html etc i.e. when a relative path is used instead of an absolute path. Then
my poor crawlers gets absolutely confused and does not know what to do. (Imagine it trying
to get a site called ../demo :(
I have made rules that will check whether the url extracted has http:// at the beginning
or not, if not then it appends the domain name that was being used. while crawling. the
code snippet for the same is given below.
This algo definitely has tons of issues (viz. what happens if the url that is being crawling
is not http://www.yahoo.com but http://www.yahoo.com/temp.html, will the new relative
url ./demo become http://www.yahoo.com/temp.html/demo, what should happen if the relative
url is ../temp worse ../../../temp and then the same relative urls can be written as
./demo or demo or /demo etc)
I wanted to know if I am thinking in the right direction and is there a simpler way of
achieving what I want or do I have to write rules for each condition. Are there any readymade
functions or Rebol code that might give me a purified absolute url based on certain inputs)
Here is the code
;start of code snippet
active-domain: make url! ""
get-links: func[site [url!]][
tags: make block! 0
text: make string! 0
html-code: [
copy tag ["<" thru ">"] (append tags tag) | copy txt to "<" (append text txt)
]
page: read site
parse page [to "<" some html-code]
foreach tag tags [
if parse tag ["<A" thru "HREF="
[{"} copy link to {"} | copy link to ">"]
to end
]
[
print ["Before : " link]
dirs: parse link "/"
;first I check whether the link has http: in the beginning or not
if not dirs/1 = "http:" [
;if not then check whether link begins with / or not
either dirs/1 = ""[
link: join active-domain link
][
link: join active-domain ["/" link]
]
]
print ["After : " link ]
]
]
]
site: ask "Enter the site : "
site: to-url join "http://" site
active-domain: site
print ["Just set the active domain to " active-domain]
get-links site
;end of code snippet
-----------------------------------------
Gunjan Karun
Technical Presales Consultant
Zycus, Mumbai, India
Tel: +91-22-8730591/8760625/8717251
Extension: 120
Fax: +91-22-8717251
URL: http://www.zycus.com
[2/7] from: al::bri::xtra::co::nz at: 11-Jan-2001 15:49
> I have made rules that will check whether the url extracted has http:// at
the beginning or not, if not then it appends the domain name that was being
used. while crawling. the code snippet for the same is given below.
> This algo definitely has tons of issues (viz. what happens if the url that
is being crawling is not http://www.yahoo.com but
http://www.yahoo.com/temp.html, will the new relative url ./demo become
http://www.yahoo.com/temp.html/demo, what should happen if the relative url
is ../temp worse ../../../temp and then the same relative urls can be
written as ./demo or demo or /demo etc)
> I wanted to know if I am thinking in the right direction and is there a
simpler way of achieving what I want or do I have to write rules for each
condition. Are there any readymade functions or Rebol code that might give
me a purified absolute url based on certain inputs)
You're roughly on the right track. Use 'load/markup to automatically split
the HTML into tag! and string! datatypes -- this saves a lot of time.
Consider also absolute URLS inside Javascript (need to scan inside
javascript code). Also, you need to write parse code to handle URIs. Check
out RFCs, (forgotten the number) there's several on URL, URI and email that
are very helpful. Also, the construct:
base/:File
is very useful for forming absolute URLs.
I've got a script which handles this all, but it's written under contract.
It's private, not for free use.
I hope that helps!
Andrew Martin
ICQ: 26227169 http://members.nbci.com/AndrewMartin/
[3/7] from: rebol:techscribe at: 10-Jan-2001 22:15
Hi,
if I understand your problem correctly, then the following code -
adapted from REBOL's built-in net-utils (source net-utils) - should be
helpful (Code follows after my sign-off).
Usage:
>> print mold url-parser/parse-url http://www.yahoo.com/sub-directory/filename.html
make object! [
scheme: "http"
username: none
password: none
host: "www.yahoo.com"
port: none
path: "sub-directory/"
target: "temp.html"
tag: none
]
Hope this helps,
Elan
REBOL []
URL-Parser:
make object! [
scheme: none
user: none
pass: none
host: none
port-id: none
path: none
target: none
tag: none
p2: none
digit: make bitset! #{
000000000000FF03000000000000000000000000000000000000000000000000
}
alpha-num: make bitset! #{
000000000000FF03FEFFFF07FEFFFF0700000000000000000000000000000000
}
scheme-char: make bitset! #{
000000000068FF03FEFFFF07FEFFFF0700000000000000000000000000000000
}
path-char: make bitset! #{
00000000F17FFFA7FFFFFFAFFEFFFF5700000000000000000000000000000000
}
user-char: make bitset! #{
00000000F87CFF23FEFFFF87FEFFFF1700000000000000000000000000000000
}
pass-char: make bitset! #{
FFF9FFFFFEFFFFFFFEFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF
}
url-rules: [scheme-part user-part host-part path-part file-part
tag-part]
scheme-part: [copy scheme some scheme-char #":" ["//" | none]]
user-part: [copy user uchars [#":" pass-part | none] #"@" | none
(user: pass: none)]
pass-part: [copy pass to #"@" [skip copy p2 to "@" (append
append pass "@" p2) | none]]
host-part: [copy host uchars [#":" copy port-id digits | none]]
path-part: [slash copy path path-node | none]
path-node: [pchars slash path-node | none]
file-part: [copy target pchars | none]
tag-part: [#"#" copy tag pchars | none]
uchars: [some user-char | none]
pchars: [some path-char | none]
digits: [1 5 digit]
parse-url: func [
{Return url dataset or cause an error if not a valid URL}
url
][
parse/all url url-rules
return make object! compose [
scheme: (scheme)
username: (user)
password: (pass)
host: (host)
port: (port-id)
path: (path)
target: (target)
tag: (tag)
]
]
]
[4/7] from: al:bri:xtra at: 11-Jan-2001 19:25
> then the following code - adapted from REBOL's built-in net-utils (source
net-utils) - should be helpful (Code follows after my sign-off).
I learn something new every day! Thanks for showing this, Elan!
Andrew Martin
ICQ: 26227169 http://members.nbci.com/AndrewMartin/
[5/7] from: rebol:techscribe at: 10-Jan-2001 23:30
Thanks Andrew.
I just realized that the net-utils version is right in setting the
object's values to none before beginning to parse, otherwise old values
may continue to live. In short, we need a vars block, and the parse-url
function should be modified as follows:
make object! [
....
....
vars: [scheme user pass host port-id path target tag]
parse-url: func [
{Return url dataset or cause an error if not a valid URL}
url
][
set vars none
...
....
]
The reason is that the included parse rules do not under all
circumstances clear out values that result from previous parsing. My
vars block is probably overkill - compare to the original in parse-url
in net-utils/URL-parser - but it can't hurt.
Take Care,
Elan
[6/7] from: gunjan:ezeenet at: 11-Jan-2001 21:19
Thanks folks for all the great tips, Lemme work on it tonight and tomorrow
I'll have some more querries.
Gunjan
-----------------------------------------
Gunjan Karun
Technical Presales Consultant
Zycus, Mumbai, India
Tel: +91-22-8730591/8760625/8717251
Extension: 120
Fax: +91-22-8717251
URL: http://www.zycus.com
[7/7] from: g:santilli:tiscalinet:it at: 11-Jan-2001 19:13
Hello Gunjan!
On 10-Gen-01, you wrote:
GK> The problem arises when my crawler extracts urls like /demo
GK> or ../demo or../../demo or /demo/index.html etc i.e. when a
GK> relative path is used instead of an absolute path. Then my
GK> poor crawlers gets absolutely confused and does not know what
GK> to do.
Some hints that should help you start out:
>> url-obj: make object! [user: pass: host: port-id: path: target: none]
>> net-utils/url-parser/parse-url url-obj http://www.yahoo.com/temp.html
== "temp.html"
>> print mold url-obj
make object! [
user: none
pass: none
host: "www.yahoo.com"
port-id: none
path: none
target: "temp.html"
]
>> net-utils/url-parser/parse-url url-obj http://www.yahoo.com/dir/temp.html
== "temp.html"
>> print mold url-obj
make object! [
user: none
pass: none
host: "www.yahoo.com"
port-id: none
path: "dir/"
target: "temp.html"
]
>> clean-path join %/ [url-obj/path %./demo]
== %/dir/demo
>> clean-path join %/ [url-obj/path %../demo]
== %/demo
HTH,
Gabriele.
--
Gabriele Santilli <[giesse--writeme--com]> - Amigan - REBOL programmer
Amiga Group Italia sez. L'Aquila -- http://www.amyresource.it/AGI/