Mailing List Archive: 49091 messages
  • Home
  • Script library
  • AltME Archive
  • Mailing list
  • Articles Index
  • Site search

A novice question

 [1/2] from: vkmodgil::yahoo::com at: 19-Aug-2000 15:27

I was trying to modify the web parser code from the User's Guide. The original code is like this: tag-parser: make object! [ tags: make block! 100 text: make string! 8000 html-code: [ copy tag ["<" thru ">"] (append tags tag) | copy txt to "<" (append text txt) ] parse-tags: func [site[url!]] [ clear tags clear text parse read site [to "<" some html-code] print text ] ] My aim is to pick up listings from the web site to pick up jobs which begin with "keyword_one" and end with "keyword_two", but I would still like to get rid of the tags. So I tried this html-code: [ copy tag ["<" thru "keyword1"] (append tags tag) | copy txt to "keyword2" (append text txt) ] etc.. and then use tag-parser/parse-tags modified-url. But this now hangs. Any help welcomed by this novice. -Vik

 [2/2] from: bhandley:zip:au at: 20-Aug-2000 11:59

Hi Vik, Did you retry your program in a fresh session of Rebol? It may have been that during your writing/testing of your program you got to a point that triggered the Rebol GC bug (which I understand is being looked at by RT). Regarding the keywords are they tags or text? This might change the approach. If say your keywords are part of the text, are immediately before and after your job posting information, and are unique enough, then you could just ignore the tags completely and parse based on your keywords. Something like this maybe: parse-rules: [ some [ thru keyword-one-text copy text to keyword-two-text (print text) ] Also, it may not be relevant, but note that the parse function as used in script examples ignores spaces by default (use parse/all if you want parse to process spaces). On a different track, Rebol version 2.3 has the ability to load markup. Like this,
>> loaded-page: load/markup
loaded-page is now a block that contains values of type tag! and type string!. foreach item loaded-page [ if not tag? item [ print item ] ] or use parse in block mode rules abc-news-headlines: [ thru <!-- start insert of main story copy --> some [ thru <b> copy text to </b> (print text)] <!--end insert of copy for top stories--> to end ]
>> parse loaded-page abc-news-headlines
Supply ship approaches rescue site as hopes fade Muslim extremists collapse hostage release talks Monsoon bus tragedy in central India US bushfires not letting up Gore pulls ahead in US presidential poll Man falls overboard in crocodile-infested waters Fighting couple force jumbo jet to land Sport news This is good if you know exactly what the value of some items in the block are, but not sood good if you need to do pattern matching. For example finding the title text is easy because we know a tag <title> exists in the b lock.
>> copy/part find/tail loaded-page <title> 1
== ["ABC Online News - Latest Bulletin"] Brett.