Mailing List Archive: 49091 messages
  • Home
  • Script library
  • AltME Archive
  • Mailing list
  • Articles Index
  • Site search
 

[REBOL] Re: How to extract content of HTML table?

From: anton::wilddsl::net::au at: 29-Aug-2006 14:14

Hi Josť, Just a small tip: quick and dirty is better. Your code will be smaller and easier to maintain. I made some HTML extractors for weather, train timetable and TV programme guides. I found that every ~8 months or so they change the damned html layout, breaking my code in a hard to predict way. Parsing the whole html document properly, while interesting, does not make your code less susceptible to this problem, because they make changes like: - nesting the table with the main content inside another table - breaking the content into separate pages - adding cells just for layout spacing - adding markup to text like <b> - changing the titles of key fields which you are looking for etc. etc. You need artificial intelligence to reliably handle all that! So it's really not worth it to start parsing at "<html>..." You'll just make a huge parse rule which will be difficult to maintain. I use parse, not load/markup, by the way. I think I tried load/markup and found that it's too "correct", ie. it can't handle messy html very well. (But it's been a long time, maybe I don't remember well.) If you have any troubles with parse, let us know. Regards, Anton.