How to extract content of HTML table?

[1/9] from: joseantoniorocha:g:mail at: 28-Aug-2006 16:26

There are a fast and easy way to do that? -- nome: "Jos=E9 Antonio Meira da Rocha" tratamento: "Prof. MS." atividade: "Consultoria e treinamento em jornalismo impresso e online" googletalk: MSN: email: joseantoniorocha-gmail.com site: http://meiradarocha.jor.br ICQ: 658222 AIM: "meiradarochajor" Skype: yahoo: "meiradarocha_jor"

[2/9] from: tim-johnsons:web at: 28-Aug-2006 11:33

* Jos=E9 Antonio <joseantoniorocha-gmail.com> [060828 11:27]:

> There are a fast and easy way to do that?

To start: How about load/markup? That will convert HTML to a block of tags and strings HTH tim

> -- > nome: "Jos=E9 Antonio Meira da Rocha" tratamento: "Prof. MS."

<<quoted lines omitted: 6>>

> To unsubscribe from the list, just send an email to > lists at rebol.com with unsubscribe as the subject.

-- Tim Johnson <tim-johnsons-web.com> http://www.alaska-internet-solutions.com

[3/9] from: joseantoniorocha:gmai:l at: 28-Aug-2006 16:39

I will try. Maybe this tip can solve my problem. Thx! On 8/28/06, Tim Johnson <tim-johnsons-web.com> wrote:

> * Jos=E9 Antonio <joseantoniorocha-gmail.com> [060828 11:27]: > >

<<quoted lines omitted: 20>>

> To unsubscribe from the list, just send an email to > lists at rebol.com with unsubscribe as the subject.

-- nome: "Jos=E9 Antonio Meira da Rocha" tratamento: "Prof. MS." atividade: "Consultoria e treinamento em jornalismo impresso e online" googletalk: MSN: email: joseantoniorocha-gmail.com site: http://meiradarocha.jor.br ICQ: 658222 AIM: "meiradarochajor" Skype: yahoo: "meiradarocha_jor"

[4/9] from: joseantoniorocha:gma:il at: 28-Aug-2006 17:13

No prize for me. load/markup just made a huge block whit HTML page content. Not useful, as table elements is not nested in blocks I gess this can be complished with parse function, but parse is very complex. Anyone already make a function that do this task, extract table content as, say, comma separated values or nested blocks? On 8/28/06, Tim Johnson <tim-johnsons-web.com> wrote:

> * Jos=E9 Antonio <joseantoniorocha-gmail.com> [060828 11:27]: > >

<<quoted lines omitted: 20>>

> To unsubscribe from the list, just send an email to > lists at rebol.com with unsubscribe as the subject.

[5/9] from: edoconnor::gmail::com at: 28-Aug-2006 16:46

Hi Jose-- Parse can be very complex, and HTML is often poorly structured. Here are some general, non-industrial strength tips. Use parse to find and/or isolate your HTML table, and then use load/markup to filter out the tags, you should be in pretty good shape.

>> page: read http://www.example.com

...

>> parse/all page [any [to {<table } copy table to </table>] to end ]

== true

>> table

== {<table><tr><td>row 1 col 1</td><td>row 1 col 2</td></tr><tr><td>row 2 col 1</td><td>row 2 col 2</td></tr></table>} Then use load/markup to iterate through the block and remove the tags. I think there's a one-liner from Carl on Rebol.com that show the simple tag-stripping technique.

>> loaded-table: load/markup table

== [<table> <tr> <td> "row 1 col 1" </td> <td> "row 1 col 2" </td> </tr> <tr> <td> "row 2 col 1" </td> <td> "row 2 col 2" </td> </t...

>> remove-each item loaded-table [tag? item]

== ["row 1 col 1" "row 1 col 2" "row 2 col 1" "row 2 col 2"] This approach is reasonable for quick-and-dirty text extraction. To get to well-defined nested structures, you'll want to use more advanced parse techniques, ideally screened through HTMLtidy first. Ed On 8/28/06, Jos=E9 Antonio <joseantoniorocha-gmail.com> wrote:

[6/9] from: tim-johnsons:web at: 28-Aug-2006 14:21

* Jos=E9 Antonio <joseantoniorocha-gmail.com> [060828 12:18]:

> No prize for me. > > load/markup just made a huge block whit HTML page content. Not useful, > as table elements is not nested in blocks

'Should' be *very* useful :-) 'cuz I use it all the time - what you want to do is test datatypes for strings, also what is a bit counterintuitive is that you can test a tag for a string, as in find <table> "table" ;; beginning of table, set some boolean ;; processing-table: true OR find </table> "/table" ;; end of table, set some boolean ;; processing-table: false and then: ;; untested, incomplete code! foreach element load/markup some-document[ if all[ tag? element find element "/table" ][processing-table: false] if all[ tag? element find element "table" ][processing-table: true] if all[ processing-table string? element][ do-something-with-string ] ]

> I gess this can be complished with parse function, but parse is very comple > x.

<<quoted lines omitted: 16>>

> > > nome: "Jos=3DE9 Antonio Meira da Rocha" tratamento: "Prof. MS. > > > atividade:

Consultoria e treinamento em jornalismo impresso e online"

> > > googletalk: MSN: email: joseantoniorocha-gmail.com > > > site: http://meiradarocha.jor.br

<<quoted lines omitted: 22>>

> To unsubscribe from the list, just send an email to > lists at rebol.com with unsubscribe as the subject.

-- Tim Johnson <tim-johnsons-web.com> http://www.alaska-internet-solutions.com

[7/9] from: joseantoniorocha:gmai:l at: 28-Aug-2006 20:21

Thx, good tips. I will work on it. On 8/28/06, Tim Johnson <tim-johnsons-web.com> wrote:

> * Jos=E9 Antonio <joseantoniorocha-gmail.com> [060828 12:18]: > >

<<quoted lines omitted: 34>>

> > Anyone already make a function that do this task, extract table > > content as, say, comma separated values or nested blocks?

[8/9] from: anton::wilddsl::net::au at: 29-Aug-2006 14:14

Hi Jos�, Just a small tip: quick and dirty is better. Your code will be smaller and easier to maintain. I made some HTML extractors for weather, train timetable and TV programme guides. I found that every ~8 months or so they change the damned html layout, breaking my code in a hard to predict way. Parsing the whole html document properly, while interesting, does not make your code less susceptible to this problem, because they make changes like: - nesting the table with the main content inside another table - breaking the content into separate pages - adding cells just for layout spacing - adding markup to text like <b> - changing the titles of key fields which you are looking for etc. etc. You need artificial intelligence to reliably handle all that! So it's really not worth it to start parsing at "<html>..." You'll just make a huge parse rule which will be difficult to maintain. I use parse, not load/markup, by the way. I think I tried load/markup and found that it's too "correct", ie. it can't handle messy html very well. (But it's been a long time, maybe I don't remember well.) If you have any troubles with parse, let us know. Regards, Anton.

[9/9] from: mike::yaunish::shaw::ca at: 29-Aug-2006 17:12

I have run into the same issue. My solution has been to use a small function called delim-extract as shown below. I have found if I can break the pages I am parsing into chunks first then if something changes I am able to modify the delimiters I am using quite easily later. It's not pretty but it works. Probably not terribly fast either. delim-extract: func [ "returns a block of every string found that is surrounded by defined delimeters" source-str [string!] "Text string to extract from." left-delim [string!] "Text string delimiting the left side of the desired string." right-delim [string!] "Text string delimiting the right side of the desired string." /include-delimiters "Returned extractions will include the delimiters" /use-head "Head of string is used as left delimiter" /first "Return the first match found only" /local tags tag ] [ tag: make string! tags: make block! [] if use-head [ either include-delimiters [ parse source-str [ copy tag thru right-delim ] insert head tag left-delim ][ parse source-str [ copy tag to right-delim ] ] append tags tag ] either include-delimiters [ parse source-str [some [ [ thru left-delim copy tag to right-delim ] (append tags rejoin [ left-delim tag right-delim] )]] ][ parse source-str [some [ [ thru left-delim copy tag to right-delim ] (append tags tag)]] ] either first [ either ((length? tags) = 0 ) [ return none ][ return tags/1 ] ][ return tags ] ] test-extract: func [] [ page: read http://www.rebol.com title: delim-extract/first page "<title>" "</title>" print [ "The title is = " title ] cgi-str: "Query=REBOL&SearchView=VERBOSE&MaxResults=10&Sort=1" cgi-variable-names: delim-extract/use-head cgi-str "&" "=" print [ "cgi-variable-names = " cgi-variable-names ] ]

Notes

Quoted lines have been omitted from some messages.
View the message alone to see the lines that have been omitted