Download a whole website

[1/7] from: belkasri::1stlegal::com at: 25-Jul-2002 15:36

Hi eveyone, I am kind of new to REBOL. I know how to download a web page and save it to the disk, that if I know the name of the page. But if I want to download a whole website and save it to my hard drive? Thanks. --Abdel.

[2/7] from: oliva:david:seznam:cz at: 2-Aug-2002 14:54

Hello Abdel, Thursday, July 25, 2002, 10:36:13 PM, you wrote: AB> I know how to download a web page and save it to the disk, that if I know AB> the name of the page. But if I want to download a whole website and save it AB> to my hard drive? I had one reb-bot for travelling on the net and searching for images but he is really old (born in 2000) was not saving pages and had some bugs inside so I decided to make new generation. Here is what I have now (excuse the function 'uprav-url - it's from the old bot and needs to be improved (and translated) as well) What it does? Simply parses the html and sorts found urls to blocks: images, stylesheets, linked scripts - I will add apllets and embedded objects as well.... There are two things that want to discuss: 1. How to save page from url: http://localhost/ :(may be index.html default.html or what is specified on the server side:( 2. Way how to encode file names of dynamic documents as: http://127.0.0.1:85/cgi-bin/getboard.r?boardID=default&lang=cz and here is the script: <code> rebol [ title: "Site downloader" purpose: {To download pages from some url with all content} author: {Oldes} email: [oliva--david--seznam--cz] comment: {This is not finished version... now it just parses the page and returns sorted types of urls. Need to make saving the content and recursion for traveling from one page to another} version: 0.0.1 ] page-url: to url! ask "start URL: " ;page: read/binary page-url page-markup: load/markup page-url ;to-string page purl: decode-url page-url if none? purl/path [purl/path: "/"] purl/port-id: either none? purl/port-id [""][purl/port-id: join ":" purl/port-id] base-href: rejoin [http:// purl/host purl/port-id purl/path] images: make block! 50 links: make block! 500 scripts: make block! 10 stylesheets: make block! 10 tag-rules: [ "img" copy x thru {src=} copy url [to { } | to end ] y: to end ( tag-name: "img" url: uprav-url url if all [none? find images url not none? url] [insert images url] ) | "link" copy x thru {href=} copy url [to { } | to end ] y: to end ( tag-name: "link" url: uprav-url url if all [none? find stylesheets url not none? url] [insert stylesheets url ] ) | "script" copy x thru {src=} copy url [to { } | to end ] y: to end ( tag-name: "script" url: uprav-url url if all [none? find scripts url not none? url] [insert scripts url ] ) | "BASE" copy x thru {href=} copy url [to { } | to end ] y: to end ( tag-name: "base" base-href: uprav-url url ;print rejoin ["new base-href: " base-href] ) | "EMBED" copy x thru {src=} copy url [to { } | to end ] y: to end (tag-name: "EMBED") | "a" copy x thru {href=} copy url [to { } | to end ] y: to end ( tag-name: "a" url: uprav-url url if all [none? find links url not none? url] [insert links url ] ) ] uprav-url: func [path [string!] /local u q w new-url][ path: trim/with path {"} if find path "javascript:" [return none] if find path "mailto:" [return path] either found? find path "://" [ return path ] [ either path/1 = #"/" [ parse base-href [copy w thru "://" copy q [to "/" | to end]] return newurl: rejoin [w q path] ] [ site: tail parse (to-string skip base-href 7) "/" path: parse path "/" foreach p path [ either p = ".." [ if error? try [remove back site] [print "Spatny relativni odkaz"] ] [ if p <> "." [append site p] ] ] newurl: make string! "" foreach p head site [append newurl join p "/"] newurl: head clear back tail newurl replace/all newurl "//" "/" insert head newurl "http://" return head newurl ] ] ] parse/all page-markup [ some [ set tag tag! ( if parse/all tag tag-rules [ ; print reform [x url y tag-name] ] ) | any-type! ] ] probe stylesheets probe images probe scripts probe links </code>

[3/7] from: gscottjones:mchsi at: 3-Aug-2002 16:50

Hi, Oldes, From: "RebOldes"

<snip> > There are two things that want to discuss: > > 1. > How to save page from url: http://localhost/ > :(may be index.html default.html or what is > specified on the server side:(

Unfortunately, I do not believe that http protocol gives a way of knowing the target in this case. It is the server software that fills in a default name, such as "index.html" or default.htm" if only given a directory path, such as http://localhost/ . Warning, I am not an expert on this topic, but as a confirmation I have experimented to verify that browser clients (including REBOL) do not receive this information automatically in the protocol header. I believe that an exception to this will be if the reference is *forwarded* to a fully qualified target (like hotmail). In this case, then embedded in the http scheme is a local variable named target that contains the path and file. With a hacked version of the http scheme, this information can be used, if needed.

> 2. > Way how to encode file names of dynamic documents as: > http://127.0.0.1:85/cgi-bin/getboard.r?boardID=default&lang=cz

I am unsure what you are asking. Do you mean how to create a static file name for what is a dynamically created web page? I would be tempted to take a shortcut, unless I really needed to preserve the embeded data in the url. Something like: my-url: http://127.0.0.1:85/cgi-bin/getboard.r?boardID=default&lang=cz split-url: split-path my-url new-file-name: to-file join checksum split-url/2 ".html" ;=====yielding %11808560.html It is just a thought. Hope I've helped. --Scott Jones

[4/7] from: oliva:david:seznam:cz at: 4-Aug-2002 16:11

Hello G., Saturday, August 3, 2002, 11:50:03 PM, you wrote:

>> 2. >> Way how to encode file names of dynamic documents as: >> http://127.0.0.1:85/cgi-bin/getboard.r?boardID=default&lang=cz

GSJ> I am unsure what you are asking. Do you mean how to create a static file GSJ> name for what is a dynamically created web page? I would be tempted to take GSJ> a shortcut, unless I really needed to preserve the embeded data in the url. GSJ> Something like: GSJ> my-url: http://127.0.0.1:85/cgi-bin/getboard.r?boardID=default&lang=cz GSJ> split-url: split-path my-url GSJ> new-file-name: to-file join checksum split-url/2 ".html" GSJ> ;=====yielding %11808560.html first I thought storing the files in more readable form as it's in the Rebol/view structure of public/ dir, but I quite like your (crypted) version as well, at least there are no problems with port ids (see my "path-thru" email) and not allowed chars at paths... in both cases I will have to make some testion of content-type of the dynamic created document to add correct extension (not all document are simple html:-) --oldes

[5/7] from: oliva:david:seznam:cz at: 6-Aug-2002 1:38

Hello Rebol-list, from this url you can download first beta of my Reb-Web-Bot.... now it just download one page but it's not so difficult to improve it:-) http://oldes.multimedia.cz/utils/reb-web-bot.r

[6/7] from: oliva:david:seznam:cz at: 12-Aug-2002 23:52

Hello Rebol-list, it's not perfect yet, but a lot of fixes: do load-thru/update http://oldes.multimedia.cz/utils/reb-web-bot.r to download all Internet :(because I have no limits so it will be downloading and downloading until you fill your drive or stop IT:)

[7/7] from: rebol665:ifrance at: 13-Aug-2002 9:18

Hi I have tested it and it worked fine for me. One suggestion. May be it could prompt before going one level up. Patrick ----- Original Message ----- From: "RebOldes" <[oliva--david--seznam--cz]> To: "RebOldes" <[rebol-list--rebol--com]> Sent: Monday, August 12, 2002 11:52 PM Subject: [REBOL] Re: Download a whole website

> Hello Rebol-list, > it's not perfect yet, but a lot of fixes:

<<quoted lines omitted: 4>>

> -- > >>do [send to-email join 'oliva [--david--seznam--cz] "BESsssT REgArrrDssss,

RebOldes"]

> -- > To unsubscribe from this list, please send an email to > [rebol-request--rebol--com] with "unsubscribe" in the > subject, without the quotes. >

______________________________________________________________________________ Pour mieux recevoir vos emails, utilisez un PC plus performant ! D�couvrez la nouvelle gamme DELL en exclusivit� sur i (france) http://www.ifrance.com/_reloc/signedell

Notes

Quoted lines have been omitted from some messages.
View the message alone to see the lines that have been omitted