Mailing List Archive: 49091 messages
  • Home
  • Script library
  • AltME Archive
  • Mailing list
  • Articles Index
  • Site search
 

[REBOL] Re: How to extract content of HTML table?

From: edoconnor::gmail::com at: 28-Aug-2006 16:46

Hi Jose-- Parse can be very complex, and HTML is often poorly structured. Here are some general, non-industrial strength tips. Use parse to find and/or isolate your HTML table, and then use load/markup to filter out the tags, you should be in pretty good shape.
>> page: read http://www.example.com
...
>> parse/all page [any [to {<table } copy table to </table>] to end ]
== true
>> table
== {<table><tr><td>row 1 col 1</td><td>row 1 col 2</td></tr><tr><td>row 2 col 1</td><td>row 2 col 2</td></tr></table>} Then use load/markup to iterate through the block and remove the tags. I think there's a one-liner from Carl on Rebol.com that show the simple tag-stripping technique.
>> loaded-table: load/markup table
== [<table> <tr> <td> "row 1 col 1" </td> <td> "row 1 col 2" </td> </tr> <tr> <td> "row 2 col 1" </td> <td> "row 2 col 2" </td> </t...
>> remove-each item loaded-table [tag? item]
== ["row 1 col 1" "row 1 col 2" "row 2 col 1" "row 2 col 2"] This approach is reasonable for quick-and-dirty text extraction. To get to well-defined nested structures, you'll want to use more advanced parse techniques, ideally screened through HTMLtidy first. Ed On 8/28/06, Jos=E9 Antonio <joseantoniorocha-gmail.com> wrote: