[REBOL] ANN: SiteCrawl
From: ryan::christiansen::intellisol::com at: 24-Jul-2001 11:43
Following is the function 'SiteCrawl and its dependent functions 'linkURL
and 'pageLinks.
'SiteCrawl will crawl an entire web site and gather the URLs in a block!
for each page on the site.
USAGE:
rebol-pages: copy []
SiteCrawl http://www.rebol.com rebol-pages
I need feedback on this. Do you have a small site where you can test
'SiteCrawl for me?
Thanks.
-Ryan
pageLinks: func [
"Return a block of links from an HTML page"
page [string!] "The code for an HTML page as a string"
/local links
][
links: copy []
parse page [any [thru {<A HREF="} copy text to {"} (append links
text)]]
return links
]
linkURL: func [
"Create a complete url for a link based on a link's relativity to the
URL of the HTML page where the link appears"
link "The url of a link parsed from an HTML page"
page-url "The url for the HTML page where the link appears"
/local protocol domain path-branch
][
link: make string! link
page-url: make string! page-url
parse page-url [copy text thru "://" (protocol: copy text)]
parse page-url [thru "://" copy text [to "/" | to end] (domain: copy
text)]
path-branch: parse/all page-url "/"
either find link "://" [
return link
][
either link/1 = #"/" [
insert link (rejoin [protocol domain])
return link
][
either (last path-branch) = domain [
insert link (rejoin [page-url "/"])
return link
][
replace page-url (last path-branch) ""
insert link page-url
return link
]
]
]
]
SiteCrawl: func [
"Crawl an entire web site"
site [url!] "The url of the site to crawl"
pages [block!] "A block where all found pages will be gathered"
/local page links link target find-page site-domain target-domain
][
page: read site
site: make string! site
links: pageLinks page
foreach link links [
target: LinkURL link site
either find target "mailto:" [
; nothing
][
parse site [thru "://" copy text [to "/" | to end]
(site-domain: copy text)]
parse target [thru "://" copy text [to "/" | to end]
(target-domain: copy text)]
either site-domain == target-domain [
find-page: select pages target
if (find-page == none) [
append/only pages target
if error? try [SiteCrawl make url! target][links: next
links]
]
][
; nothing
]
]
]
return pages
]
Ryan C. Christiansen
Web Developer
Intellisol International
4733 Amber Valley Parkway
Fargo, ND 58104
701-235-3390 ext. 6671
FAX: 701-235-9940
http://www.intellisol.com
Global Leader in People Performance Software
_____________________________________
Confidentiality Notice
This message may contain privileged and confidential information. If you
think, for any reason, that this message may have been addressed to you in
error, you must not disseminate, copy or take any action in reliance on it,
and we would ask you to notify us immediately by return email to
[ryan--christiansen--intellisol--com]