Mailing List Archive: 49091 messages
  • Home
  • Script library
  • AltME Archive
  • Mailing list
  • Articles Index
  • Site search

[REBOL] Re: Downloading patents?

From: bpaddock:csonline at: 1-Sep-2001 21:11

On Saturday 01 September 2001 09:32 am, Bob Paddock wrote:
> On Saturday 01 September 2001 03:31 am, Anton wrote: > > Now, it's up to you to stick it all together into a nice > > program. :)
I did get the following to download the 7 pages of this patent. Any ways to do this better, than I have? Watch out for line-wraping... REBOL [ Title: "Get Patent, Page Download test" Date: 31-Aug-2001/21:07:00:00-4:00 ] url: page: read url print copy/part find page "PG" 10 print copy/part find page "TOTPG" 10 LastPage: second parse copy/part find page "TOTPG=" 10 "=&" print LastPage pdf-url: repeat CurrentPageNumber to-integer LastPage [ get-page-url: to-url rejoin[ "http://" pdf-url to-string CurrentPageNumber ] print get-page-url output-file-name: to-file rejoin[ "US4215330pg" to-string CurrentPageNumber .pdf ] print output-file-name write/binary output-file-name read/binary get-page-url ] comment { From: "Anton" <[arolls--bigpond--net--au]> To: <[rebol-list--rebol--com]> Subject: [REBOL] Re: Downloading patents? Date: Sat, 1 Sep 2001 17:31:23 +1000 Your example has 7 pages, right? If you look in the source to your example link below, you see that there are two frames. The first one is the navigation bar, which you are clicking the right arrow button all the time to get to the next page. If you alter your url below, replacing "bnsviewer" with "bnsviewnav" (and keeping the rest of the query junk on the end) you have the link to the specific navbar for your specific patent. So, in rebol (watch out for line wrapping): url: print page: read url If you look at that html code and search for TOTPG you can see "TOTPG=7". So now you can find out how many pages there are. find page "TOTPG" You should not find it difficult to grab the first number from that string. Now to construct the urls that point to each page. If we look back a little bit from we can see another nice variable "PG": find page "PG=" Great. Now we can simply modify url, adding PG=x, where x is your desired page number, for example, to go to page 3 (watch wrap): Now to find out how to get directly to each pdf file. find page ".pdf" We can see this relative link: /dips/bns.pdf?CY=gb&LG=en&PN=US4215330&ID=US+++4215330A1+I+&PG=1 So, here is an absolute link that returns the pdf file for page 3 (watch wrap): pdf-url: This has (my thinking) the essentials, PN, ID and PG. (You can keep CY and LG if it causes problems). write/binary %test.pdf read/binary pdf-url browse %test.pdf }