Downloading patents?
[1/8] from: bpaddock::csonline::net at: 31-Aug-2001 19:16
Every now and then I need to get a patent from one of the Internet Patent
Servers.
Unfortunately they make you download one page at a time. Downloading a 38
page patent this way took over a hour. Wait-for page to finish loading,
click 'next page' button, repeat cycle till no more pages.
Here is a Representative URL:
http://l2.espacenet.com/dips/bnsviewer?CY=gb&LG=en&DB=EPD&PN=US4215330&ID=US+++4215330A1+I+
I've played with proxy.r some to see what information the server and browsers
are exchanging.
What I was wondering is if any one has already done a patent downloading
program? I don't want to reinvent the wheel. I know there are a few
commercial ones out there but they have to many zeros in the price tag for my
occasional usage budget.
I've done little in the way of Rebol coding, so any one have any tips that
might help me get this done?
Need to parse out the maximum number of pages from the first header, then ask
the server for each subsequent page.
[2/8] from: arolls:bigpond:au at: 1-Sep-2001 17:31
Your example has 7 pages, right?
If you look in the source to your
example link below, you see that there
are two frames. The first one is the
navigation bar, which you are clicking
the right arrow button all the time
to get to the next page.
If you alter your url below, replacing
bnsviewer
with "bnsviewnav" (and keeping
the rest of the query junk on the end)
you have the link to the specific navbar
for your specific patent.
So, in rebol (watch out for line wrapping):
url:
http://l2.espacenet.com/dips/bnsviewnav?CY=gb&LG=en&DB=EPD&PN=US4215330&ID=U
S+++4215330A1+I+
print page: read url
If you look at that html code and search for TOTPG
you can see "TOTPG=7".
So now you can find out how many pages there are.
find page "TOTPG"
You should not find it difficult to grab the
first number from that string.
Now to construct the urls that point to each page.
If we look back a little bit from we can see
another nice variable "PG":
find page "PG="
Great. Now we can simply modify url, adding
PG=x, where x is your desired page number, for example,
to go to page 3 (watch wrap):
http://l2.espacenet.com/dips/bnsviewer?CY=gb&LG=en&DB=EPD&PN=US4215330&ID=US
+++4215330A1+I+&PG=3
Now to find out how to get directly to each pdf file.
find page ".pdf"
We can see this relative link:
/dips/bns.pdf?CY=gb&LG=en&PN=US4215330&ID=US+++4215330A1+I+&PG=1
So, here is an absolute link that returns the pdf file for page 3 (watch
wrap):
pdf-url:
http://l2.espacenet.com/dips/bns.pdf?&PN=US4215330&ID=US+++4215330A1+I+&PG=3
This has (my thinking) the essentials, PN, ID and PG.
(You can keep CY and LG if it causes problems).
write/binary %test.pdf read/binary pdf-url
browse %test.pdf
Now, it's up to you to stick it all together into a nice
program. :)
[3/8] from: bpaddock:csonline at: 1-Sep-2001 9:32
On Saturday 01 September 2001 03:31 am, Anton wrote:
> Now, it's up to you to stick it all together into a nice
> program. :)
Just working out the looping through all of the pages parts, this is what I
came up with. Gets the max page count and simply prints the page number at
this point.
I expect it is more complex than it needs to be since this is my first real
Rebol program, how do I do better?:
Watch for line wrapping in the URL: line.
REBOL [
Title: "Get Patent, looping test"
Date: 31-Aug-2001/06:39:00:00-4:00
]
url:
http://l2.espacenet.com/dips/bnsviewnav?CY=gb&LG=en&DB=EPD&PN=US4215330&ID=US+++4215330A1+I+
page: read url
LastPage: second parse copy/part find page "TOTPG=" 10 "=&"
print LastPage
repeat CurrentPageNumber to-integer LastPage [ print CurrentPageNumber ]
[4/8] from: bpaddock:csonline at: 1-Sep-2001 21:11
On Saturday 01 September 2001 09:32 am, Bob Paddock wrote:
> On Saturday 01 September 2001 03:31 am, Anton wrote:
> > Now, it's up to you to stick it all together into a nice
> > program. :)
I did get the following to download the 7 pages of this patent. Any ways to
do this better, than I have?
Watch out for line-wraping...
REBOL [
Title: "Get Patent, Page Download test"
Date: 31-Aug-2001/21:07:00:00-4:00
]
url:
http://l2.espacenet.com/dips/bnsviewnav?CY=gb&LG=en&DB=EPD&PN=US4215330&ID=US+++4215330A1+I+
page: read url
print copy/part find page "PG" 10
print copy/part find page "TOTPG" 10
LastPage: second parse copy/part find page "TOTPG=" 10 "=&"
print LastPage
pdf-url:
l2.espacenet.com/dips/bns.pdf?&PN=US4215330&ID=US+++4215330A1+I+&PG
repeat CurrentPageNumber to-integer LastPage [
get-page-url: to-url rejoin[ "http://" pdf-url to-string CurrentPageNumber ]
print get-page-url
output-file-name: to-file rejoin[ "US4215330pg" to-string CurrentPageNumber
.pdf
]
print output-file-name
write/binary output-file-name read/binary get-page-url
]
comment {
From: "Anton" <[arolls--bigpond--net--au]>
To: <[rebol-list--rebol--com]>
Subject: [REBOL] Re: Downloading patents?
Date: Sat, 1 Sep 2001 17:31:23 +1000
Your example has 7 pages, right?
If you look in the source to your example link below, you see that there are
two frames. The first one is the navigation bar, which you are clicking the
right arrow button all the time to get to the next page.
If you alter your url below, replacing "bnsviewer" with "bnsviewnav" (and
keeping the rest of the query junk on the end) you have the link to the
specific navbar for your specific patent.
So, in rebol (watch out for line wrapping):
url:
http://l2.espacenet.com/dips/bnsviewnav?CY=gb&LG=en&DB=EPD&PN=US4215330&ID=US+++4215330A1+I+
print page: read url
If you look at that html code and search for TOTPG you can see "TOTPG=7".
So now you can find out how many pages there are.
find page "TOTPG"
You should not find it difficult to grab the first number from that string.
Now to construct the urls that point to each page.
If we look back a little bit from we can see another nice variable "PG":
find page "PG="
Great. Now we can simply modify url, adding PG=x, where x is your desired
page number, for example, to go to page 3 (watch wrap):
http://l2.espacenet.com/dips/bnsviewer?CY=gb&LG=en&DB=EPD&PN=US4215330&ID=US+++4215330A1+I+&PG=3
Now to find out how to get directly to each pdf file.
find page ".pdf"
We can see this relative link:
/dips/bns.pdf?CY=gb&LG=en&PN=US4215330&ID=US+++4215330A1+I+&PG=1
So, here is an absolute link that returns the pdf file for page 3 (watch
wrap):
pdf-url:
http://l2.espacenet.com/dips/bns.pdf?&PN=US4215330&ID=US+++4215330A1+I+&PG=3
This has (my thinking) the essentials, PN, ID and PG. (You can keep CY and
LG if it causes problems).
write/binary %test.pdf read/binary pdf-url
browse %test.pdf
}
[5/8] from: arolls:bigpond:au at: 2-Sep-2001 16:41
Well, it's working isn't it? :) Good job.
You don't need to convert CurrentPageNumber
to an integer inside the repeat loop, as
you can see, the integer gets converted
automatically by rejoin:
>> rejoin ["astring" 23]
== "astring23"
I am also of the opinion that such long variable
names end up not being that helpful. I find that
it gets hard to read because the code is too long.
Obviously, you should keep it how it looks best
to you, but I would name my variables much shorter:
LastPage -> pages (or totpg, showing where it came from)
CurrentPageNumber -> page (or pg)
Also, once you have figured out how to print PG
and TOTPG, you can remove the print statement and
use probe instead:
LastPage: second parse probe copy/part find page "TOTPG=" 10 "=&"
^
Probe is cool, it allows you to print the value to the right
but returns that same value to the left, so it has no effect
on the function of the code, it just prints out stuff at a
particular point, or "probes" a point. I use it all the time.
[6/8] from: bpaddock:csonline at: 3-Sep-2001 18:07
On Sunday 02 September 2001 02:41 am, Anton wrote:
> Well, it's working isn't it? :) Good job.
Now, its working better. I've got the look and feel how I want it with one
exception.
How do I get the 'cancel' button to work in the GetPatent function? Is
there a Event tutorial some place, or some simpler way?
Also I need to handle faults gracefully, like network time outs. On
tutorials on that out there?
> I am also of the opinion that such long variable
> names end up not being that helpful. I find that
> it gets hard to read because the code is too long.
Lets just say on that point we don't agree.
> > Watch out for line-wraping...
REBOL [
Title: "Get Patent"
Date: 03-Sep-2001/17:56:00:00-4:00
File: %patent.r
Author: "Bob Paddock"
Version: 1.0.0
]
GetPatent: func [
{Request a Patent Number to download from the net. Show progress. Return
none on error.}
PatentNumber
/local url page pdf-url LastPage CurrentPage GetPageURL OutputNameFILE stop
PatentDownload
] [
;Examples PatentNumber: 4215330, 6163242
;http://l2.espacenet.com/dips/bnsviewnav?CY=ep&LG=en&DB=EPD&PN=US6163242&ID=U
S+++6163242A1+I+
;http://l2.espacenet.com/dips/bnsviewnav?CY=gb&LG=en&DB=EPD&PN=US4215330&ID=U
S+++4215330A1+I+
; Read the 'navbar' to find out how many pages that there are to download:
url: to-url rejoin ["http://l2.espacenet.com/dips/bnsviewnav?DB=EPD&PN=US"
PatentNumber "&ID=US+++" PatentNumber "A1+I+"]
page: read url
LastPage: to-integer second parse copy/part find page "TOTPG=" 10 "=&"
; Need to test LastPage for zero
;LastPage: 3 ; testing
print rejoin ["There are " LastPage " pages to this Patent, downloading
now:"]
pdf-url: rejoin ["l2.espacenet.com/dips/bns.pdf?&PN=US" PatentNumber
&ID=US+++
PatentNumber "A1+I+&PG="]
; Download all of the pages in the following loop,
; printout the URL and the name of the file being saved as we go:
view/new center-face PatentDownload: layout [
title: text 300 bold red black center
ProgressBar: progress 300x30
across
button 90 "Cancel" [stop: true]
stat: text 240 bold red black middle
return
ElapsedTimeText: text 240 bold red black center
return
EstimatedTimeText: text 240 bold red black center
return
RemainingTimeText: text 240 bold red black center
]
stop: false
ProgressBar/data: 0
title/text: reform ["Patent " PatentNumber " has " LastPage "pages"]
show title
StartTime: now/time
ElapsedTimeText/text: reform ["Start Time: " StartTime]
show ElapsedTimeText
repeat CurrentPageNumber LastPage [
if stop [break]
stat/text: reform ["Downloading Page " CurrentPageNumber " Now"]
show stat
GetPageURL: to-url rejoin ["http://" pdf-url to-string CurrentPageNumber]
print GetPageURL
OutputNameFILE: probe to-file rejoin ["US" PatentNumber "pg"
CurrentPageNumber ".pdf"]
; Don't get pages that we do not need:
if not exists? OutputNameFILE [
write/binary OutputNameFILE read/binary GetPageURL
]
ProgressBar/data: ProgressBar/data + (1 / LastPage)
elapsed: now/time - StartTime
estimated: elapsed * ((LastPage + 1) / CurrentPageNumber)
ElapsedTimeText/text: reform ["Elapsed Time: " elapsed]
EstimatedTimeText/text: reform ["Estimated Time: " estimated]
RemainingTimeText/text: reform ["Remaining Time: " estimated - elapsed]
show [stat ProgressBar ElapsedTimeText EstimatedTimeText RemainingTimeText]
] ; Repeat
unview/only PatentDownload
print "Leaving GetPatent"
] ; GetPatent
; Derived from emailsend.r:
view layout [
backdrop 30.40.100 effect [grid 10x10]
origin 40x20
h2 white "Download Patent:"
msg: field "Enter Patent Number here..." 210
text white "By Your Command:"
across return
button "Get Patent" [GetPatent msg/text]
return
button "Quit" [quit]
]
;do-events
[7/8] from: sanghabum:aol at: 4-Sep-2001 6:47
[bpaddock--csonline--net] writes:
> How do I get the 'cancel' button to work in the GetPatent function? Is
> there a Event tutorial some place, or some simpler way?
>
Hi there,
to give other buttons (like your cancel button) a look in, you need to put a
Wait into the main loop, eg:
repeat CurrentPageNumber LastPage [
wait 0
if stop [break]
....
Wait 0 should work, if not try a very short time like wait 0.01.
It's a trick I learnt from the masters on this list.
--Colin
[8/8] from: bpaddock:csonline at: 22-Nov-2001 7:27
On Friday 31 August 2001 07:16 pm, you wrote:
> Every now and then I need to get a patent from one of the Internet Patent
> Servers.
>
> Unfortunately they make you download one page at a time. Downloading a 38
> page patent this way took over a hour. Wait-for page to finish loading,
> click 'next page' button, repeat cycle till no more pages.
At long last I got my patent downloading script in a workable state. If
nothing else it will ack as example of how to do progress bars and downloads.
-- Attached file included as plaintext by Listar --
-- File: patent.r
REBOL [
Title: "Get Patent"
Date: "Nov-21-2001 20:48"
File: %patent.r
Author: "Bob Paddock and Astrid Sindle"
Version: 1.0.7
Purpose: {
Downloads various types of patents from the l2.espacenet.com server.
Normally espace forces you to download the patents one page at a time.
This script gets all of the pages for you automatically.
Shows various progress bars and time estimates.
There may be patents that this does not get because I can not
find any documentation on how l2.espacenet encodes its URL's.
If you know how to encode a specific URL please let me know
so that I can add support for it. - [bpaddock--csonline--net]
}
]
GetPatent: func [
{
Request a Patent Number to download from the net.
Show progress. Displays alert box then aborts script on error.
}
PatentServer
PatentNumber
/local url page pdf-url LastPage CurrentPage GetPageURL OutputNameFILE stop PatentDownload
] [
url: probe to-url rejoin ["http://l2.espacenet.com/dips/bnsviewnav?DB=EPD&PN=" PatentServer
PatentNumber "&ID=" PatentServer "+++" PatentNumber "A1+I+"]
page: read url
; Uncomment the following to see what the page we just got looks like:
; print PatentServer
; Print PatentNumber
; print page
not-now1: "Service is temporarily unavailable"
if find page not-now1 [alert not-now1 quit]
not-now2: "The document request could not be processed"
if find page not-now2 [alert not-now2 quit]
; Copy 10 chars after "TOPPG=" to find the number of pages in this patent
LastPage: to-integer second parse copy/part find page "TOTPG=" 10 "=&"
;LastPage: 3 ; testing
if LastPage < 1 [alert "Zero Pages to This Patent?" quit]
print rejoin ["There are " LastPage " pages to this Patent, downloading now:"]
pdf-url: rejoin ["l2.espacenet.com/dips/bns.pdf?&PN=" PatentServer PatentNumber "&ID="
PatentServer "+++" PatentNumber "A1+I+&PG="]
; Download all of the pages in the following loop,
; printout the URL and the name of the file being saved as we go:
view/new center-face PatentDownload: layout [
title: text 300 bold red black center
ProgressBar: progress 300x30
across
toggle 90 "Cancel" "Stop" [stop: true]
stat: text 240 bold red black middle
return
ElapsedTimeText: text 240 bold red black center
return
EstimatedTimeText: text 240 bold red black center
return
RemainingTimeText: text 240 bold red black center
]
stop: false
ProgressBar/data: 0
title/text: reform ["Patent " PatentNumber " has " LastPage "pages"]
show title
StartTime: now/time
ElapsedTimeText/text: reform ["Start Time: " StartTime]
show ElapsedTimeText
; Do{}While CurrentPageNumber <= LastPage:
repeat CurrentPageNumber LastPage [
wait 1 ; Required to get the 'cancel' button to work
if stop [break]
stat/text: reform ["Downloading Page " CurrentPageNumber " Now"]
show stat
GetPageURL: probe to-url rejoin ["http://" pdf-url to-string CurrentPageNumber]
OutputNameFILE: probe to-file rejoin [PatentServer PatentNumber "pg" CurrentPageNumber
".pdf"]
; Don't get pages that we do not need:
if not exists? OutputNameFILE [
local-file: OutputNameFILE
if not request-download/to GetPageURL local-file [
alert "Download failed or canceled." quit
]
]
ProgressBar/data: ProgressBar/data + (1 / LastPage)
elapsed: now/time - StartTime
estimated: elapsed * ((LastPage + 1) / CurrentPageNumber)
ElapsedTimeText/text: reform ["Elapsed Time: " elapsed]
EstimatedTimeText/text: reform ["Estimated Time: " estimated]
RemainingTimeText/text: reform ["Remaining Time: " estimated - elapsed]
show [stat ProgressBar ElapsedTimeText EstimatedTimeText RemainingTimeText]
] ; Repeat
unview/only PatentDownload
print "Leaving GetPatent"
] ; GetPatent
; Derived from emailsend.r:
view layout [
backdrop 30.40.100 effect [grid 10x10]
origin 40x20
help-lbl: h2 white "Select Patent Server:"
help-lbl-2: h3 white "" 200
PatentServer: choice "Select" "EP" "US" "WO"
[
switch PatentServer/text [
"Select" [ help-lbl/text: "Select patent server:"
help-lbl-2/text: ""
]
"US" [ help-lbl/text: "Download US Patent:"
help-lbl-2/text: "e.g. 4215330 or 6163242"
]
"WO" [ help-lbl/text: "Download PCT Application [WO]:"
help-lbl-2/text: "e.g. 0177456 or 9912345"
]
"EP" [ help-lbl/text: "Download EP Application:"
help-lbl-2/text: "e.g. 0234567 (7 digit)"
]
]
show help-lbl
show help-lbl-2
]
msg: field "Enter number here..." 210
text white "Press button to retrieve patent:"
across return
button "Get Patent" [
if all [not equal? msg/text "Enter number here..." not equal? PatentServer/Text "Select"]
GetPatent PatentServer/text msg/text
]
]
]
return
button "Quit" [quit]
]
do-events