[REBOL] Re: Downloading patents?
From: bpaddock:csonline at: 1-Sep-2001 21:11
On Saturday 01 September 2001 09:32 am, Bob Paddock wrote:
> On Saturday 01 September 2001 03:31 am, Anton wrote:
> > Now, it's up to you to stick it all together into a nice
> > program. :)
I did get the following to download the 7 pages of this patent. Any ways to
do this better, than I have?
Watch out for line-wraping...
REBOL [
Title: "Get Patent, Page Download test"
Date: 31-Aug-2001/21:07:00:00-4:00
]
url:
http://l2.espacenet.com/dips/bnsviewnav?CY=gb&LG=en&DB=EPD&PN=US4215330&ID=US+++4215330A1+I+
page: read url
print copy/part find page "PG" 10
print copy/part find page "TOTPG" 10
LastPage: second parse copy/part find page "TOTPG=" 10 "=&"
print LastPage
pdf-url:
l2.espacenet.com/dips/bns.pdf?&PN=US4215330&ID=US+++4215330A1+I+&PG
repeat CurrentPageNumber to-integer LastPage [
get-page-url: to-url rejoin[ "http://" pdf-url to-string CurrentPageNumber ]
print get-page-url
output-file-name: to-file rejoin[ "US4215330pg" to-string CurrentPageNumber
.pdf
]
print output-file-name
write/binary output-file-name read/binary get-page-url
]
comment {
From: "Anton" <[arolls--bigpond--net--au]>
To: <[rebol-list--rebol--com]>
Subject: [REBOL] Re: Downloading patents?
Date: Sat, 1 Sep 2001 17:31:23 +1000
Your example has 7 pages, right?
If you look in the source to your example link below, you see that there are
two frames. The first one is the navigation bar, which you are clicking the
right arrow button all the time to get to the next page.
If you alter your url below, replacing "bnsviewer" with "bnsviewnav" (and
keeping the rest of the query junk on the end) you have the link to the
specific navbar for your specific patent.
So, in rebol (watch out for line wrapping):
url:
http://l2.espacenet.com/dips/bnsviewnav?CY=gb&LG=en&DB=EPD&PN=US4215330&ID=US+++4215330A1+I+
print page: read url
If you look at that html code and search for TOTPG you can see "TOTPG=7".
So now you can find out how many pages there are.
find page "TOTPG"
You should not find it difficult to grab the first number from that string.
Now to construct the urls that point to each page.
If we look back a little bit from we can see another nice variable "PG":
find page "PG="
Great. Now we can simply modify url, adding PG=x, where x is your desired
page number, for example, to go to page 3 (watch wrap):
http://l2.espacenet.com/dips/bnsviewer?CY=gb&LG=en&DB=EPD&PN=US4215330&ID=US+++4215330A1+I+&PG=3
Now to find out how to get directly to each pdf file.
find page ".pdf"
We can see this relative link:
/dips/bns.pdf?CY=gb&LG=en&PN=US4215330&ID=US+++4215330A1+I+&PG=1
So, here is an absolute link that returns the pdf file for page 3 (watch
wrap):
pdf-url:
http://l2.espacenet.com/dips/bns.pdf?&PN=US4215330&ID=US+++4215330A1+I+&PG=3
This has (my thinking) the essentials, PN, ID and PG. (You can keep CY and
LG if it causes problems).
write/binary %test.pdf read/binary pdf-url
browse %test.pdf
}