Mailing List Archive: 49091 messages
  • Home
  • Script library
  • AltME Archive
  • Mailing list
  • Articles Index
  • Site search
 

[REBOL] ANN: SiteCrawl

From: ryan::christiansen::intellisol::com at: 24-Jul-2001 11:43

Following is the function 'SiteCrawl and its dependent functions 'linkURL and 'pageLinks. 'SiteCrawl will crawl an entire web site and gather the URLs in a block! for each page on the site. USAGE: rebol-pages: copy [] SiteCrawl http://www.rebol.com rebol-pages I need feedback on this. Do you have a small site where you can test 'SiteCrawl for me? Thanks. -Ryan pageLinks: func [ "Return a block of links from an HTML page" page [string!] "The code for an HTML page as a string" /local links ][ links: copy [] parse page [any [thru {<A HREF="} copy text to {"} (append links text)]] return links ] linkURL: func [ "Create a complete url for a link based on a link's relativity to the URL of the HTML page where the link appears" link "The url of a link parsed from an HTML page" page-url "The url for the HTML page where the link appears" /local protocol domain path-branch ][ link: make string! link page-url: make string! page-url parse page-url [copy text thru "://" (protocol: copy text)] parse page-url [thru "://" copy text [to "/" | to end] (domain: copy text)] path-branch: parse/all page-url "/" either find link "://" [ return link ][ either link/1 = #"/" [ insert link (rejoin [protocol domain]) return link ][ either (last path-branch) = domain [ insert link (rejoin [page-url "/"]) return link ][ replace page-url (last path-branch) "" insert link page-url return link ] ] ] ] SiteCrawl: func [ "Crawl an entire web site" site [url!] "The url of the site to crawl" pages [block!] "A block where all found pages will be gathered" /local page links link target find-page site-domain target-domain ][ page: read site site: make string! site links: pageLinks page foreach link links [ target: LinkURL link site either find target "mailto:" [ ; nothing ][ parse site [thru "://" copy text [to "/" | to end] (site-domain: copy text)] parse target [thru "://" copy text [to "/" | to end] (target-domain: copy text)] either site-domain == target-domain [ find-page: select pages target if (find-page == none) [ append/only pages target if error? try [SiteCrawl make url! target][links: next links] ] ][ ; nothing ] ] ] return pages ] Ryan C. Christiansen Web Developer Intellisol International 4733 Amber Valley Parkway Fargo, ND 58104 701-235-3390 ext. 6671 FAX: 701-235-9940 http://www.intellisol.com Global Leader in People Performance Software _____________________________________ Confidentiality Notice This message may contain privileged and confidential information. If you think, for any reason, that this message may have been addressed to you in error, you must not disseminate, copy or take any action in reliance on it, and we would ask you to notify us immediately by return email to [ryan--christiansen--intellisol--com]