World: r3wp

Join the discussions in the REBOL3 world...

[Power Mezz] Discussions of the Power Mezz

older newer	first last
Kaj 18-Dec-2010 [85]	Use "5.10 Parse HTML text into a tree" instead
PatrickP61 18-Dec-2010 [86]	Thank you Kaj. I'll check that out!
Kaj 18-Dec-2010 [87]	There's no usage documentation, though, only code documentation
Kaj 19-Dec-2010 [88]	There's also "7.4 [X][HT]ML Parser" so it's not clear to me which one is preferred
Oldes 19-Dec-2010 [89x2]	To be honest, if you just like to parse some HTML page to get some parts of it, you don't need to use Power Mezz at all.. I'm using Rebol more than 10 years and still consider PM as a too complex staff for me. If you are REBOL newbie, better to start reading REBOL doc. In your case something about parsing.
Oldes 19-Dec-2010 [89x2]	There is a lot of pages about 'parse' on net... for example this one: http://www.rebol.com/docs/core23/rebolcore-15.html
PatrickP61 19-Dec-2010 [91]	Thanks Oldes -- Will check into that
Kaj 19-Dec-2010 [92x2]	Yeah, there's some tipping point in parsing web pages and such. When the pages are consistent and the data you want to scrape is simple, I use PARSE, too, or even just string processing
Kaj 19-Dec-2010 [92x2]	But when the HTML and the data become more complex, there are so many exceptions you have to program, that a real HTML parser becomes more convenient
Henrik 19-Dec-2010 [94x2]	would a real HTML parser convert the data to a REBOL object?
Henrik 19-Dec-2010 [94x2]	hmm... not sure that is possible.
Kaj 19-Dec-2010 [96]	I don't know what the two PowerMezz ones do, but I figure they just produce blocks. It's static data, so no need for bindings
Henrik 19-Dec-2010 [97x2]	a good one would be to convert R3 rich text to HTML and vice versa.
Henrik 19-Dec-2010 [97x2]	but that is of course not related to parsing...
Kaj 19-Dec-2010 [99]	Yeah, I think someone will have to do that :-)
Anton 20-Dec-2010 [100x2]	Kaj, I think it's the other way around! I found when the HTML and the data become more complex, then a simpler "hack" parse job is more likely to survive changes to the source. This happened to me several times with a weather forecast and television guide scraper etc. that I made (and remade, and remade..).
Anton 20-Dec-2010 [100x2]	(back when I used to care about television, that is)
Kaj 20-Dec-2010 [102x2]	That's true when you have to write the parser yourself, but I'm assuming the PowerMezz parsers handle all of HTML :-)
Kaj 20-Dec-2010 [102x2]	Also, it's probably not as much within reach for novice programmers
Oldes 20-Dec-2010 [104x2]	I was using REBOL for datamining a few years ago and I can say it was easier to do string based parsing to get what I've needed.
Oldes 20-Dec-2010 [104x2]	It's always easier to do: parse html [thru "<title>" copy title to "<"] than parse complete html to something like a block structure and dig title in it.
Kaj 20-Dec-2010 [106]	For you, but my business partner wants to scrape web pages, and I don't think he would understand how to do it with parse
Oldes 20-Dec-2010 [107]	I believe that if he would not understand simple parse, than he would not understand PowerMezz as well, but maybe I'm wrong. Also it very depends what page do you parse.
Kaj 20-Dec-2010 [108]	Scraping a title is the simplest example. In reality, you get all sorts of tags with extra attributes that you need to skip, and values with extraneous newlines. He wouldn't understand how to normalise that, so his data would be left as a mess
Oldes 20-Dec-2010 [109]	Again... it's easier to do manual changes to parse malformed pages if you do string parsing where you don't care about 99% of the page content.
Kaj 20-Dec-2010 [110]	We did a simple address list as an example, even preconverting it to plain text. It took all afternoon to discover all exceptions and fix them, so in most cases, it isn't worth it
Maxim 20-Dec-2010 [111x3]	I've done quite a few html analysers and with a bit of experience I have found that going with a brute force parse on pages ends-up being very effective. I did a font downloading script for www.acidfonts.com a few years ago and it took more time to actually download all the pages than build the script. :-)
	some servers are anti-indexing and its in these cases where the brute parse is most effective. I've even had to cater an oracle web server which didn't have ANY css, type, or id fields in all pages which are driven by form. all urls can only be read once, and every page read is a redirect. only parse allowed me to cope in such a drastic anti-robot environment. it still took a day to build the robot. and in the end, it even had an auto-user-creationg step at each 200 pages which created a google gmail account for the next batch. :-) in those cases, parse is king.
	a fellow REBOL parse data miner tells me that some servers have pretty good algorithms which identify robots out of the thousands of requests they get, and you even have to put random-lenght pauses between reads which can go up to a few hours.
Kaj 20-Dec-2010 [114]	I don't see my partner doing that :-)
Maxim 20-Dec-2010 [115]	that why he has you ;-)
Kaj 20-Dec-2010 [116x2]	I'm not going to do it for him, either, thank you very much
Kaj 20-Dec-2010 [116x2]	So if he can get by with an HTML parser, that would be great for him
Gabriele 21-Dec-2010 [118x10]	About examples: if you download the repository, or the source archive, you'll find a test directory which has the test suite for (almost) all the modules. the tests are often usable as "examples" for how to use the modules themselves.
	About the HTML parser: This started out because we had a need for a HTML filter. The mail client in Qtask allows users to view HTML emails, furthermore, the wiki editor submits HTML to Qtask. Whenever you are embedding HTML from an external (untrusted) source within your HTML, you have security problems. For this reason, we had to take the untrusted HTML and 1) "filter" it so that it would be safe 2) make it embeddable (eg. only take what's inside <body>, and not the whole <html> etc.).
	This had to work with any HTML (think about the stuff that comes out of Outlook, or the stuff you get as spam, or newsletters, and all that). You can't imagine how bad that can be. That had to be turned into something that would not break our own HTML pages.
	My approach was, instead of doing what many others do (try to remove things from the HTML that are known to be "bad", eg. use regexps to remove anything that starts with "javascript:" or anything between <script>...</script> etc.), was to only pass what was known to be good, and ignore everything else. This is a bit more limiting but I consider it to be safer (you don't have to play a game with attacker where every time they find a new vector, you have to add it to the "bad" list).
	So the first HTML filter was done. It parsed the HTML (any version), and going through two finite state machines in a pipeline, rebuilt it as XHTML 1.0 strict.
	This method avoided keeping any intermediate representations in memory. However, because of that there were a number of things it could not do (eg. no look ahead, and you get an explosion of the number of states if you want to "look behind" more).
	So, as our needs became more complex (esp. because of the initial, never released version of the wiki editor), I had to change approach. Also, at that time Maarten was doing the S3 stuff and needed a XML parser as well. So, first, the Filter was split up into three modules. The first is the parser, that takes a HTML or XML string and just sends "events" to a callback function. This can be used basically for anything. (Maarten never used it in the end.) The second part was the first FSM, the HTML normalizer. You'll still find it within the Power Mezz, but it's deprecated. The third part was the actual filter and regenerator (second FSM). You can find it in the repository history.
	Then, the latter two modules were replaced by a different approach that builds a tree of blocks from the HTML and rewrites it as it is being built (to avoid doing too many passes). This is done by LOAD-HTML, that allows passing a set of rules used for filtering (so the actual filter is now a bunch of rules for LOAD-HTML). LOAD-HTML handles a lot of HTML weird cases, it's probably not at the level of a web browser, but it comes close.
	The tree is being built with the Niwashi module, which was separated as a generic way to build trees incrementally following rules etc. (Niwashi means gardener in Japanese)
	The HTML to text module has still not been rewritten to use LOAD-HTML instead of the older approach of the HTML normalizer followed by a FSM.
Kaj 21-Dec-2010 [128x2]	Thanks for the clarification
Kaj 21-Dec-2010 [128x2]	So the parser in 5.10 is the newest one? But where does the parser in 7.4 fit in?
Gabriele 22-Dec-2010 [130x4]	7.4 parses a string into a sequence of tags and text (etc.). (it also has a load-markup function that is similar to load/markup but also parses tag attributes and so on). 5.10 uses 7.4 and builds a tree from that sequence of tags and text.
	(i never got around to change wetan to show module dependencies. if you look at the script header though, you'll see that load-html.r depends on ml-parser.r)
	http://www.rebol.it/power-mezz/mezz/load-html.r
	the code is using a number of tricks to be "fast" (esp. expand-macros.r), so it's not as clean as it could be.
Kaj 22-Dec-2010 [134]	Thanks
older newer	first last