Mailing List Archive: 49091 messages
  • Home
  • Script library
  • AltME Archive
  • Mailing list
  • Articles Index
  • Site search
 

Downloading patents?

 [1/8] from: bpaddock::csonline::net at: 31-Aug-2001 19:16


Every now and then I need to get a patent from one of the Internet Patent Servers. Unfortunately they make you download one page at a time. Downloading a 38 page patent this way took over a hour. Wait-for page to finish loading, click 'next page' button, repeat cycle till no more pages. Here is a Representative URL: http://l2.espacenet.com/dips/bnsviewer?CY=gb&LG=en&DB=EPD&PN=US4215330&ID=US+++4215330A1+I+ I've played with proxy.r some to see what information the server and browsers are exchanging. What I was wondering is if any one has already done a patent downloading program? I don't want to reinvent the wheel. I know there are a few commercial ones out there but they have to many zeros in the price tag for my occasional usage budget. I've done little in the way of Rebol coding, so any one have any tips that might help me get this done? Need to parse out the maximum number of pages from the first header, then ask the server for each subsequent page.

 [2/8] from: arolls:bigpond:au at: 1-Sep-2001 17:31


Your example has 7 pages, right? If you look in the source to your example link below, you see that there are two frames. The first one is the navigation bar, which you are clicking the right arrow button all the time to get to the next page. If you alter your url below, replacing bnsviewer with "bnsviewnav" (and keeping the rest of the query junk on the end) you have the link to the specific navbar for your specific patent. So, in rebol (watch out for line wrapping): url: http://l2.espacenet.com/dips/bnsviewnav?CY=gb&LG=en&DB=EPD&PN=US4215330&ID=U S+++4215330A1+I+ print page: read url If you look at that html code and search for TOTPG you can see "TOTPG=7". So now you can find out how many pages there are. find page "TOTPG" You should not find it difficult to grab the first number from that string. Now to construct the urls that point to each page. If we look back a little bit from we can see another nice variable "PG": find page "PG=" Great. Now we can simply modify url, adding PG=x, where x is your desired page number, for example, to go to page 3 (watch wrap): http://l2.espacenet.com/dips/bnsviewer?CY=gb&LG=en&DB=EPD&PN=US4215330&ID=US +++4215330A1+I+&PG=3 Now to find out how to get directly to each pdf file. find page ".pdf" We can see this relative link: /dips/bns.pdf?CY=gb&LG=en&PN=US4215330&ID=US+++4215330A1+I+&PG=1 So, here is an absolute link that returns the pdf file for page 3 (watch wrap): pdf-url: http://l2.espacenet.com/dips/bns.pdf?&PN=US4215330&ID=US+++4215330A1+I+&PG=3 This has (my thinking) the essentials, PN, ID and PG. (You can keep CY and LG if it causes problems). write/binary %test.pdf read/binary pdf-url browse %test.pdf Now, it's up to you to stick it all together into a nice program. :)

 [3/8] from: bpaddock:csonline at: 1-Sep-2001 9:32


On Saturday 01 September 2001 03:31 am, Anton wrote:
> Now, it's up to you to stick it all together into a nice > program. :)
Just working out the looping through all of the pages parts, this is what I came up with. Gets the max page count and simply prints the page number at this point. I expect it is more complex than it needs to be since this is my first real Rebol program, how do I do better?: Watch for line wrapping in the URL: line. REBOL [ Title: "Get Patent, looping test" Date: 31-Aug-2001/06:39:00:00-4:00 ] url: http://l2.espacenet.com/dips/bnsviewnav?CY=gb&LG=en&DB=EPD&PN=US4215330&ID=US+++4215330A1+I+ page: read url LastPage: second parse copy/part find page "TOTPG=" 10 "=&" print LastPage repeat CurrentPageNumber to-integer LastPage [ print CurrentPageNumber ]

 [4/8] from: bpaddock:csonline at: 1-Sep-2001 21:11


On Saturday 01 September 2001 09:32 am, Bob Paddock wrote:
> On Saturday 01 September 2001 03:31 am, Anton wrote: > > Now, it's up to you to stick it all together into a nice > > program. :)
I did get the following to download the 7 pages of this patent. Any ways to do this better, than I have? Watch out for line-wraping... REBOL [ Title: "Get Patent, Page Download test" Date: 31-Aug-2001/21:07:00:00-4:00 ] url: http://l2.espacenet.com/dips/bnsviewnav?CY=gb&LG=en&DB=EPD&PN=US4215330&ID=US+++4215330A1+I+ page: read url print copy/part find page "PG" 10 print copy/part find page "TOTPG" 10 LastPage: second parse copy/part find page "TOTPG=" 10 "=&" print LastPage pdf-url: l2.espacenet.com/dips/bns.pdf?&PN=US4215330&ID=US+++4215330A1+I+&PG repeat CurrentPageNumber to-integer LastPage [ get-page-url: to-url rejoin[ "http://" pdf-url to-string CurrentPageNumber ] print get-page-url output-file-name: to-file rejoin[ "US4215330pg" to-string CurrentPageNumber .pdf ] print output-file-name write/binary output-file-name read/binary get-page-url ] comment { From: "Anton" <[arolls--bigpond--net--au]> To: <[rebol-list--rebol--com]> Subject: [REBOL] Re: Downloading patents? Date: Sat, 1 Sep 2001 17:31:23 +1000 Your example has 7 pages, right? If you look in the source to your example link below, you see that there are two frames. The first one is the navigation bar, which you are clicking the right arrow button all the time to get to the next page. If you alter your url below, replacing "bnsviewer" with "bnsviewnav" (and keeping the rest of the query junk on the end) you have the link to the specific navbar for your specific patent. So, in rebol (watch out for line wrapping): url: http://l2.espacenet.com/dips/bnsviewnav?CY=gb&LG=en&DB=EPD&PN=US4215330&ID=US+++4215330A1+I+ print page: read url If you look at that html code and search for TOTPG you can see "TOTPG=7". So now you can find out how many pages there are. find page "TOTPG" You should not find it difficult to grab the first number from that string. Now to construct the urls that point to each page. If we look back a little bit from we can see another nice variable "PG": find page "PG=" Great. Now we can simply modify url, adding PG=x, where x is your desired page number, for example, to go to page 3 (watch wrap): http://l2.espacenet.com/dips/bnsviewer?CY=gb&LG=en&DB=EPD&PN=US4215330&ID=US+++4215330A1+I+&PG=3 Now to find out how to get directly to each pdf file. find page ".pdf" We can see this relative link: /dips/bns.pdf?CY=gb&LG=en&PN=US4215330&ID=US+++4215330A1+I+&PG=1 So, here is an absolute link that returns the pdf file for page 3 (watch wrap): pdf-url: http://l2.espacenet.com/dips/bns.pdf?&PN=US4215330&ID=US+++4215330A1+I+&PG=3 This has (my thinking) the essentials, PN, ID and PG. (You can keep CY and LG if it causes problems). write/binary %test.pdf read/binary pdf-url browse %test.pdf }

 [5/8] from: arolls:bigpond:au at: 2-Sep-2001 16:41


Well, it's working isn't it? :) Good job. You don't need to convert CurrentPageNumber to an integer inside the repeat loop, as you can see, the integer gets converted automatically by rejoin:
>> rejoin ["astring" 23]
== "astring23" I am also of the opinion that such long variable names end up not being that helpful. I find that it gets hard to read because the code is too long. Obviously, you should keep it how it looks best to you, but I would name my variables much shorter: LastPage -> pages (or totpg, showing where it came from) CurrentPageNumber -> page (or pg) Also, once you have figured out how to print PG and TOTPG, you can remove the print statement and use probe instead: LastPage: second parse probe copy/part find page "TOTPG=" 10 "=&" ^ Probe is cool, it allows you to print the value to the right but returns that same value to the left, so it has no effect on the function of the code, it just prints out stuff at a particular point, or "probes" a point. I use it all the time.

 [6/8] from: bpaddock:csonline at: 3-Sep-2001 18:07


On Sunday 02 September 2001 02:41 am, Anton wrote:
> Well, it's working isn't it? :) Good job.
Now, its working better. I've got the look and feel how I want it with one exception. How do I get the 'cancel' button to work in the GetPatent function? Is there a Event tutorial some place, or some simpler way? Also I need to handle faults gracefully, like network time outs. On tutorials on that out there?
> I am also of the opinion that such long variable > names end up not being that helpful. I find that > it gets hard to read because the code is too long.
Lets just say on that point we don't agree.
> > Watch out for line-wraping...
REBOL [ Title: "Get Patent" Date: 03-Sep-2001/17:56:00:00-4:00 File: %patent.r Author: "Bob Paddock" Version: 1.0.0 ] GetPatent: func [ {Request a Patent Number to download from the net. Show progress. Return none on error.} PatentNumber /local url page pdf-url LastPage CurrentPage GetPageURL OutputNameFILE stop PatentDownload ] [ ;Examples PatentNumber: 4215330, 6163242 ;http://l2.espacenet.com/dips/bnsviewnav?CY=ep&LG=en&DB=EPD&PN=US6163242&ID=U S+++6163242A1+I+ ;http://l2.espacenet.com/dips/bnsviewnav?CY=gb&LG=en&DB=EPD&PN=US4215330&ID=U S+++4215330A1+I+ ; Read the 'navbar' to find out how many pages that there are to download: url: to-url rejoin ["http://l2.espacenet.com/dips/bnsviewnav?DB=EPD&PN=US" PatentNumber "&ID=US+++" PatentNumber "A1+I+"] page: read url LastPage: to-integer second parse copy/part find page "TOTPG=" 10 "=&" ; Need to test LastPage for zero ;LastPage: 3 ; testing print rejoin ["There are " LastPage " pages to this Patent, downloading now:"] pdf-url: rejoin ["l2.espacenet.com/dips/bns.pdf?&PN=US" PatentNumber &ID=US+++ PatentNumber "A1+I+&PG="] ; Download all of the pages in the following loop, ; printout the URL and the name of the file being saved as we go: view/new center-face PatentDownload: layout [ title: text 300 bold red black center ProgressBar: progress 300x30 across button 90 "Cancel" [stop: true] stat: text 240 bold red black middle return ElapsedTimeText: text 240 bold red black center return EstimatedTimeText: text 240 bold red black center return RemainingTimeText: text 240 bold red black center ] stop: false ProgressBar/data: 0 title/text: reform ["Patent " PatentNumber " has " LastPage "pages"] show title StartTime: now/time ElapsedTimeText/text: reform ["Start Time: " StartTime] show ElapsedTimeText repeat CurrentPageNumber LastPage [ if stop [break] stat/text: reform ["Downloading Page " CurrentPageNumber " Now"] show stat GetPageURL: to-url rejoin ["http://" pdf-url to-string CurrentPageNumber] print GetPageURL OutputNameFILE: probe to-file rejoin ["US" PatentNumber "pg" CurrentPageNumber ".pdf"] ; Don't get pages that we do not need: if not exists? OutputNameFILE [ write/binary OutputNameFILE read/binary GetPageURL ] ProgressBar/data: ProgressBar/data + (1 / LastPage) elapsed: now/time - StartTime estimated: elapsed * ((LastPage + 1) / CurrentPageNumber) ElapsedTimeText/text: reform ["Elapsed Time: " elapsed] EstimatedTimeText/text: reform ["Estimated Time: " estimated] RemainingTimeText/text: reform ["Remaining Time: " estimated - elapsed] show [stat ProgressBar ElapsedTimeText EstimatedTimeText RemainingTimeText] ] ; Repeat unview/only PatentDownload print "Leaving GetPatent" ] ; GetPatent ; Derived from emailsend.r: view layout [ backdrop 30.40.100 effect [grid 10x10] origin 40x20 h2 white "Download Patent:" msg: field "Enter Patent Number here..." 210 text white "By Your Command:" across return button "Get Patent" [GetPatent msg/text] return button "Quit" [quit] ] ;do-events

 [7/8] from: sanghabum:aol at: 4-Sep-2001 6:47


[bpaddock--csonline--net] writes:
> How do I get the 'cancel' button to work in the GetPatent function? Is > there a Event tutorial some place, or some simpler way? >
Hi there, to give other buttons (like your cancel button) a look in, you need to put a Wait into the main loop, eg: repeat CurrentPageNumber LastPage [ wait 0 if stop [break] .... Wait 0 should work, if not try a very short time like wait 0.01. It's a trick I learnt from the masters on this list. --Colin

 [8/8] from: bpaddock:csonline at: 22-Nov-2001 7:27


On Friday 31 August 2001 07:16 pm, you wrote:
> Every now and then I need to get a patent from one of the Internet Patent > Servers. > > Unfortunately they make you download one page at a time. Downloading a 38 > page patent this way took over a hour. Wait-for page to finish loading, > click 'next page' button, repeat cycle till no more pages.
At long last I got my patent downloading script in a workable state. If nothing else it will ack as example of how to do progress bars and downloads. -- Attached file included as plaintext by Listar -- -- File: patent.r REBOL [ Title: "Get Patent" Date: "Nov-21-2001 20:48" File: %patent.r Author: "Bob Paddock and Astrid Sindle" Version: 1.0.7 Purpose: { Downloads various types of patents from the l2.espacenet.com server. Normally espace forces you to download the patents one page at a time. This script gets all of the pages for you automatically. Shows various progress bars and time estimates. There may be patents that this does not get because I can not find any documentation on how l2.espacenet encodes its URL's. If you know how to encode a specific URL please let me know so that I can add support for it. - [bpaddock--csonline--net] } ] GetPatent: func [ { Request a Patent Number to download from the net. Show progress. Displays alert box then aborts script on error. } PatentServer PatentNumber /local url page pdf-url LastPage CurrentPage GetPageURL OutputNameFILE stop PatentDownload ] [ url: probe to-url rejoin ["http://l2.espacenet.com/dips/bnsviewnav?DB=EPD&PN=" PatentServer PatentNumber "&ID=" PatentServer "+++" PatentNumber "A1+I+"] page: read url ; Uncomment the following to see what the page we just got looks like: ; print PatentServer ; Print PatentNumber ; print page not-now1: "Service is temporarily unavailable" if find page not-now1 [alert not-now1 quit] not-now2: "The document request could not be processed" if find page not-now2 [alert not-now2 quit] ; Copy 10 chars after "TOPPG=" to find the number of pages in this patent LastPage: to-integer second parse copy/part find page "TOTPG=" 10 "=&" ;LastPage: 3 ; testing if LastPage < 1 [alert "Zero Pages to This Patent?" quit] print rejoin ["There are " LastPage " pages to this Patent, downloading now:"] pdf-url: rejoin ["l2.espacenet.com/dips/bns.pdf?&PN=" PatentServer PatentNumber "&ID=" PatentServer "+++" PatentNumber "A1+I+&PG="] ; Download all of the pages in the following loop, ; printout the URL and the name of the file being saved as we go: view/new center-face PatentDownload: layout [ title: text 300 bold red black center ProgressBar: progress 300x30 across toggle 90 "Cancel" "Stop" [stop: true] stat: text 240 bold red black middle return ElapsedTimeText: text 240 bold red black center return EstimatedTimeText: text 240 bold red black center return RemainingTimeText: text 240 bold red black center ] stop: false ProgressBar/data: 0 title/text: reform ["Patent " PatentNumber " has " LastPage "pages"] show title StartTime: now/time ElapsedTimeText/text: reform ["Start Time: " StartTime] show ElapsedTimeText ; Do{}While CurrentPageNumber <= LastPage: repeat CurrentPageNumber LastPage [ wait 1 ; Required to get the 'cancel' button to work if stop [break] stat/text: reform ["Downloading Page " CurrentPageNumber " Now"] show stat GetPageURL: probe to-url rejoin ["http://" pdf-url to-string CurrentPageNumber] OutputNameFILE: probe to-file rejoin [PatentServer PatentNumber "pg" CurrentPageNumber ".pdf"] ; Don't get pages that we do not need: if not exists? OutputNameFILE [ local-file: OutputNameFILE if not request-download/to GetPageURL local-file [ alert "Download failed or canceled." quit ] ] ProgressBar/data: ProgressBar/data + (1 / LastPage) elapsed: now/time - StartTime estimated: elapsed * ((LastPage + 1) / CurrentPageNumber) ElapsedTimeText/text: reform ["Elapsed Time: " elapsed] EstimatedTimeText/text: reform ["Estimated Time: " estimated] RemainingTimeText/text: reform ["Remaining Time: " estimated - elapsed] show [stat ProgressBar ElapsedTimeText EstimatedTimeText RemainingTimeText] ] ; Repeat unview/only PatentDownload print "Leaving GetPatent" ] ; GetPatent ; Derived from emailsend.r: view layout [ backdrop 30.40.100 effect [grid 10x10] origin 40x20 help-lbl: h2 white "Select Patent Server:" help-lbl-2: h3 white "" 200 PatentServer: choice "Select" "EP" "US" "WO" [ switch PatentServer/text [ "Select" [ help-lbl/text: "Select patent server:" help-lbl-2/text: "" ] "US" [ help-lbl/text: "Download US Patent:" help-lbl-2/text: "e.g. 4215330 or 6163242" ] "WO" [ help-lbl/text: "Download PCT Application [WO]:" help-lbl-2/text: "e.g. 0177456 or 9912345" ] "EP" [ help-lbl/text: "Download EP Application:" help-lbl-2/text: "e.g. 0234567 (7 digit)" ] ] show help-lbl show help-lbl-2 ] msg: field "Enter number here..." 210 text white "Press button to retrieve patent:" across return button "Get Patent" [ if all [not equal? msg/text "Enter number here..." not equal? PatentServer/Text "Select"] GetPatent PatentServer/text msg/text ] ] ] return button "Quit" [quit] ] do-events