[REBOL] Need some url purification functions
From: gunjan:ezeenet at: 10-Jan-2001 21:56
Hi Gurus,
In the endeavor to learn REBOL, I have decided to make a custom GUI crawler (Ok ok this
was the most creative and challenging thing that a 5 day old in REBOL could think of
:) and the good news is that it does crawl ALMOST as desired.
The problem arises when my crawler extracts urls like /demo or ../demo or../../demo or
/demo/index.html etc i.e. when a relative path is used instead of an absolute path. Then
my poor crawlers gets absolutely confused and does not know what to do. (Imagine it trying
to get a site called ../demo :(
I have made rules that will check whether the url extracted has http:// at the beginning
or not, if not then it appends the domain name that was being used. while crawling. the
code snippet for the same is given below.
This algo definitely has tons of issues (viz. what happens if the url that is being crawling
is not http://www.yahoo.com but http://www.yahoo.com/temp.html, will the new relative
url ./demo become http://www.yahoo.com/temp.html/demo, what should happen if the relative
url is ../temp worse ../../../temp and then the same relative urls can be written as
./demo or demo or /demo etc)
I wanted to know if I am thinking in the right direction and is there a simpler way of
achieving what I want or do I have to write rules for each condition. Are there any readymade
functions or Rebol code that might give me a purified absolute url based on certain inputs)
Here is the code
;start of code snippet
active-domain: make url! ""
get-links: func[site [url!]][
tags: make block! 0
text: make string! 0
html-code: [
copy tag ["<" thru ">"] (append tags tag) | copy txt to "<" (append text txt)
]
page: read site
parse page [to "<" some html-code]
foreach tag tags [
if parse tag ["<A" thru "HREF="
[{"} copy link to {"} | copy link to ">"]
to end
]
[
print ["Before : " link]
dirs: parse link "/"
;first I check whether the link has http: in the beginning or not
if not dirs/1 = "http:" [
;if not then check whether link begins with / or not
either dirs/1 = ""[
link: join active-domain link
][
link: join active-domain ["/" link]
]
]
print ["After : " link ]
]
]
]
site: ask "Enter the site : "
site: to-url join "http://" site
active-domain: site
print ["Just set the active domain to " active-domain]
get-links site
;end of code snippet
-----------------------------------------
Gunjan Karun
Technical Presales Consultant
Zycus, Mumbai, India
Tel: +91-22-8730591/8760625/8717251
Extension: 120
Fax: +91-22-8717251
URL: http://www.zycus.com