Download a whole website
[1/7] from: belkasri::1stlegal::com at: 25-Jul-2002 15:36
Hi eveyone,
I am kind of new to REBOL.
I know how to download a web page and save it to the disk, that if I know
the name of the page. But if I want to download a whole website and save it
to my hard drive?
Thanks.
--Abdel.
[2/7] from: oliva:david:seznam:cz at: 2-Aug-2002 14:54
Hello Abdel,
Thursday, July 25, 2002, 10:36:13 PM, you wrote:
AB> I know how to download a web page and save it to the disk, that if I know
AB> the name of the page. But if I want to download a whole website and save it
AB> to my hard drive?
I had one reb-bot for travelling on the net and searching for images
but he is really old (born in 2000) was not saving pages and had some
bugs inside so I decided to make new generation. Here is what I have
now (excuse the function 'uprav-url - it's from the old bot and needs
to be improved (and translated) as well)
What it does? Simply parses the html and sorts found urls to blocks:
images, stylesheets, linked scripts - I will add apllets and embedded
objects as well....
There are two things that want to discuss:
1.
How to save page from url: http://localhost/
:(may be index.html default.html or what is specified on the server
side:(
2.
Way how to encode file names of dynamic documents as:
http://127.0.0.1:85/cgi-bin/getboard.r?boardID=default&lang=cz
and here is the script:
<code>
rebol [
title: "Site downloader"
purpose: {To download pages from some url with all content}
author: {Oldes}
email: [oliva--david--seznam--cz]
comment: {This is not finished version... now it just parses the page and returns sorted
types of urls. Need to make saving the content and recursion for traveling from one page
to another}
version: 0.0.1
]
page-url: to url! ask "start URL: "
;page: read/binary page-url
page-markup: load/markup page-url ;to-string page
purl: decode-url page-url
if none? purl/path [purl/path: "/"]
purl/port-id: either none? purl/port-id [""][purl/port-id: join ":" purl/port-id]
base-href: rejoin [http:// purl/host purl/port-id purl/path]
images: make block! 50
links: make block! 500
scripts: make block! 10
stylesheets: make block! 10
tag-rules: [
"img" copy x thru {src=} copy url [to { } | to end ] y: to end (
tag-name: "img"
url: uprav-url url
if all [none? find images url not none? url] [insert images url]
)
| "link" copy x thru {href=} copy url [to { } | to end ] y: to end (
tag-name: "link"
url: uprav-url url
if all [none? find stylesheets url not none? url] [insert stylesheets url ]
)
| "script" copy x thru {src=} copy url [to { } | to end ] y: to end (
tag-name: "script"
url: uprav-url url
if all [none? find scripts url not none? url] [insert scripts url ]
)
| "BASE" copy x thru {href=} copy url [to { } | to end ] y: to end (
tag-name: "base"
base-href: uprav-url url
;print rejoin ["new base-href: " base-href]
)
| "EMBED" copy x thru {src=} copy url [to { } | to end ] y: to end (tag-name: "EMBED")
| "a" copy x thru {href=} copy url [to { } | to end ] y: to end (
tag-name: "a"
url: uprav-url url
if all [none? find links url not none? url] [insert links url ]
)
]
uprav-url: func [path [string!] /local u q w new-url][
path: trim/with path {"}
if find path "javascript:" [return none]
if find path "mailto:" [return path]
either found? find path "://" [
return path
] [
either path/1 = #"/" [
parse base-href [copy w thru "://" copy q [to "/" | to end]]
return newurl: rejoin [w q path]
] [
site: tail parse (to-string skip base-href 7) "/"
path: parse path "/"
foreach p path [
either p = ".." [
if error? try [remove back site] [print "Spatny relativni odkaz"]
] [
if p <> "." [append site p]
]
]
newurl: make string! ""
foreach p head site [append newurl join p "/"]
newurl: head clear back tail newurl
replace/all newurl "//" "/"
insert head newurl "http://"
return head newurl
]
]
]
parse/all page-markup [
some [
set tag tag! (
if parse/all tag tag-rules [
; print reform [x url y tag-name]
]
)
| any-type!
]
]
probe stylesheets
probe images
probe scripts
probe links
</code>
[3/7] from: gscottjones:mchsi at: 3-Aug-2002 16:50
Hi, Oldes,
From: "RebOldes"
<snip>
> There are two things that want to discuss:
>
> 1.
> How to save page from url: http://localhost/
> :(may be index.html default.html or what is
> specified on the server side:(
Unfortunately, I do not believe that http protocol gives a way of knowing
the target in this case. It is the server software that fills in a default
name, such as "index.html" or default.htm" if only given a directory path,
such as http://localhost/ . Warning, I am not an expert on this topic, but
as a confirmation I have experimented to verify that browser clients
(including REBOL) do not receive this information automatically in the
protocol header. I believe that an exception to this will be if the
reference is *forwarded* to a fully qualified target (like hotmail). In
this case, then embedded in the http scheme is a local variable named target
that contains the path and file. With a hacked version of the http scheme,
this information can be used, if needed.
> 2.
> Way how to encode file names of dynamic documents as:
> http://127.0.0.1:85/cgi-bin/getboard.r?boardID=default&lang=cz
I am unsure what you are asking. Do you mean how to create a static file
name for what is a dynamically created web page? I would be tempted to take
a shortcut, unless I really needed to preserve the embeded data in the url.
Something like:
my-url: http://127.0.0.1:85/cgi-bin/getboard.r?boardID=default&lang=cz
split-url: split-path my-url
new-file-name: to-file join checksum split-url/2 ".html"
;=====yielding %11808560.html
It is just a thought.
Hope I've helped.
--Scott Jones
[4/7] from: oliva:david:seznam:cz at: 4-Aug-2002 16:11
Hello G.,
Saturday, August 3, 2002, 11:50:03 PM, you wrote:
>> 2.
>> Way how to encode file names of dynamic documents as:
>> http://127.0.0.1:85/cgi-bin/getboard.r?boardID=default&lang=cz
GSJ> I am unsure what you are asking. Do you mean how to create a static file
GSJ> name for what is a dynamically created web page? I would be tempted to take
GSJ> a shortcut, unless I really needed to preserve the embeded data in the url.
GSJ> Something like:
GSJ> my-url: http://127.0.0.1:85/cgi-bin/getboard.r?boardID=default&lang=cz
GSJ> split-url: split-path my-url
GSJ> new-file-name: to-file join checksum split-url/2 ".html"
GSJ> ;=====yielding %11808560.html
first I thought storing the files in more readable form as it's in the
Rebol/view structure of public/ dir, but I quite like your (crypted) version as
well, at least there are no problems with port ids (see my "path-thru"
email) and not allowed chars at paths... in both cases I will have to
make some testion of content-type of the dynamic created document to
add correct extension (not all document are simple html:-)
--oldes
[5/7] from: oliva:david:seznam:cz at: 6-Aug-2002 1:38
Hello Rebol-list,
from this url you can download first beta of my Reb-Web-Bot....
now it just download one page but it's not so difficult to improve
it:-)
http://oldes.multimedia.cz/utils/reb-web-bot.r
[6/7] from: oliva:david:seznam:cz at: 12-Aug-2002 23:52
Hello Rebol-list,
it's not perfect yet, but a lot of fixes:
do load-thru/update http://oldes.multimedia.cz/utils/reb-web-bot.r
to download all Internet :(because I have no limits so it will
be downloading and downloading until you fill your drive or stop
IT:)
[7/7] from: rebol665:ifrance at: 13-Aug-2002 9:18
Hi
I have tested it and it worked fine for me.
One suggestion. May be it could prompt before going one level up.
Patrick
----- Original Message -----
From: "RebOldes" <[oliva--david--seznam--cz]>
To: "RebOldes" <[rebol-list--rebol--com]>
Sent: Monday, August 12, 2002 11:52 PM
Subject: [REBOL] Re: Download a whole website
> Hello Rebol-list,
> it's not perfect yet, but a lot of fixes:
<<quoted lines omitted: 4>>
> --
> >>do [send to-email join 'oliva [--david--seznam--cz] "BESsssT REgArrrDssss,
RebOldes"]
> --
> To unsubscribe from this list, please send an email to
> [rebol-request--rebol--com] with "unsubscribe" in the
> subject, without the quotes.
>
______________________________________________________________________________
Pour mieux recevoir vos emails, utilisez un PC plus performant !
Découvrez la nouvelle gamme DELL en exclusivité sur i (france)
http://www.ifrance.com/_reloc/signedell
Notes
- Quoted lines have been omitted from some messages.
View the message alone to see the lines that have been omitted