r3wp [groups: 83 posts: 189283]
  • Home
  • Script library
  • AltME Archive
  • Mailing list
  • Articles Index
  • Site search
 

World: r3wp

[Power Mezz] Discussions of the Power Mezz

Kaj
20-Dec-2010
[102x2]
That's true when you have to write the parser yourself, but I'm assuming 
the PowerMezz parsers handle all of HTML :-)
Also, it's probably not as much within reach for novice programmers
Oldes
20-Dec-2010
[104x2]
I was using REBOL for datamining a few years ago and I can say it 
was easier to do string based parsing to get what I've needed.
It's always easier to do:  parse html [thru "<title>" copy title 
to "<"]  than parse complete html to something like a block structure 
and dig title in it.
Kaj
20-Dec-2010
[106]
For you, but my business partner wants to scrape web pages, and I 
don't think he would understand how to do it with parse
Oldes
20-Dec-2010
[107]
I believe that if he would not understand simple parse, than he would 
not understand PowerMezz as well, but maybe I'm wrong. Also it very 
depends what page do you parse.
Kaj
20-Dec-2010
[108]
Scraping a title is the simplest example. In reality, you get all 
sorts of tags with extra attributes that you need to skip, and values 
with extraneous newlines. He wouldn't understand how to normalise 
that, so his data would be left as a mess
Oldes
20-Dec-2010
[109]
Again... it's easier to do manual changes to parse malformed pages 
if you do string parsing where you don't care about 99% of the page 
content.
Kaj
20-Dec-2010
[110]
We did a simple address list as an example, even preconverting it 
to plain text. It took all afternoon to discover all exceptions and 
fix them, so in most cases, it isn't worth it
Maxim
20-Dec-2010
[111x3]
I've done quite a few html analysers and with a bit of experience 
I have found that going with a brute force parse on pages ends-up 
being very effective.  I did a font downloading script for www.acidfonts.com 
a few years ago and it took more time to actually download all the 
pages than build the script.  :-)
some servers are anti-indexing and its in these cases where the brute 
parse is most effective.  I've even had to cater an oracle web server 
which didn't have ANY css, type, or id fields in all pages which 
are driven by form.  all urls can only be read once, and every page 
read is a redirect.   


only parse allowed me to cope in such a drastic anti-robot  environment. 
 it still took a day to build the robot.  and in the end, it even 
had an auto-user-creationg step at each 200 pages which created a 
google gmail account for the next batch.  :-)

in those cases, parse is king.
a fellow REBOL parse data miner tells me that some servers have pretty 
good algorithms which identify robots out of the thousands of requests 
they get, and you even have to put random-lenght pauses between reads 
which can go up to a few hours.
Kaj
20-Dec-2010
[114]
I don't see my partner doing that :-)
Maxim
20-Dec-2010
[115]
that why he has you  ;-)
Kaj
20-Dec-2010
[116x2]
I'm not going to do it for him, either, thank you very much
So if he can get by with an HTML parser, that would be great for 
him
Gabriele
21-Dec-2010
[118x10]
About examples: if you download the repository, or the source archive, 
you'll find a test directory which has the test suite for (almost) 
all the modules. the tests are often usable as "examples" for how 
to use the modules themselves.
About the HTML parser:


This started out because we had a need for a HTML filter. The mail 
client in Qtask allows users to view HTML emails, furthermore, the 
wiki editor submits HTML to Qtask. Whenever you are embedding HTML 
from an external (untrusted) source within your HTML, you have security 
problems. For this reason, we had to take the untrusted HTML and 
1) "filter" it so that it would be safe 2) make it embeddable (eg. 
only take what's inside <body>, and not the whole <html> etc.).
This *had* to work with *any* HTML (think about the stuff that comes 
out of Outlook, or the stuff you get as spam, or newsletters, and 
all that). You can't imagine how bad that can be. That had to be 
turned into something that would not break our own HTML pages.
My approach was, instead of doing what many others do (try to remove 
things from the HTML that are known to be "bad", eg. use regexps 
to remove anything that starts with "javascript:" or anything between 
<script>...</script> etc.), was to only pass what was known to be 
good, and ignore everything else. This is a bit more limiting but 
I consider it to be safer (you don't have to play a game with attacker 
where every time they find a new vector, you have to add it to the 
"bad" list).
So the first HTML filter was done. It parsed the HTML (any version), 
and going through two finite state machines in a pipeline, rebuilt 
it as XHTML 1.0 strict.
This method avoided keeping any intermediate representations in memory. 
However, because of that there were a number of things it could not 
do (eg. no look ahead, and you get an explosion of the number of 
states if you want to "look behind" more).
So, as our needs became more complex (esp. because of the initial, 
never released version of the wiki editor), I had to change approach. 
Also, at that time Maarten was doing the S3 stuff and needed a XML 
parser as well.


So, first, the Filter was split up into three modules. The first 
is the parser, that takes a HTML or XML string and just sends "events" 
to a callback function. This can be used basically for anything. 
(Maarten never used it in the end.) The second part was the first 
FSM, the HTML normalizer. You'll still find it within the Power Mezz, 
but it's deprecated. The third part was the actual filter and regenerator 
(second FSM). You can find it in the repository history.
Then, the latter two modules were replaced by a different approach 
that builds a tree of blocks from the HTML and rewrites it as it 
is being built (to avoid doing too many passes). This is done by 
LOAD-HTML, that allows passing a set of rules used for filtering 
(so the actual filter is now a bunch of rules for LOAD-HTML). LOAD-HTML 
handles a lot of HTML weird cases, it's probably not at the level 
of a web browser, but it comes close.
The tree is being built with the Niwashi module, which was separated 
as a generic way to build trees incrementally following rules etc. 
(Niwashi means gardener in Japanese)
The HTML to text module has still not been rewritten to use LOAD-HTML 
instead of the older approach of the HTML normalizer followed by 
a FSM.
Kaj
21-Dec-2010
[128x2]
Thanks for the clarification
So the parser in 5.10 is the newest one? But where does the parser 
in 7.4 fit in?
Gabriele
22-Dec-2010
[130x4]
7.4 parses a string into a sequence of tags and text (etc.). (it 
also has a load-markup function that is similar to load/markup but 
also parses tag attributes and so on). 5.10 uses 7.4 and builds a 
tree from that sequence of tags and text.
(i never got around to change wetan to show module dependencies. 
if you look at the script header though, you'll see that load-html.r 
depends on ml-parser.r)
http://www.rebol.it/power-mezz/mezz/load-html.r
the code is using a number of tricks to be "fast" (esp. expand-macros.r), 
so it's not as clean as it could be.
Kaj
22-Dec-2010
[134]
Thanks
Janko
29-Apr-2011
[135x2]
Hi, first thanks for making and open sourcing power-mezz. 


I am trying to use load-html and am getting some strange results 
if sems it makes for example recursing [ html [ html [ html  ..... 
]]] on my simple html input (and on real one that I tried). I prepared 
two examples to make the point as clear as possible.


http://paste.factorcode.org/paste?id=2263(notice the stack owerflow 
error)
I added 2 more cases to the paste (2 annotaitons). load-html seems 
quite complex since it uses many other modules (that I don't understant 
either).. so I rather see if you find something obvious in my approach 
or the bug in power-mezz
Gabriele
30-Apr-2011
[137x11]
First: you only need to import %mezz/load-html.r in your examples. 
You're not using the other modules; they will be loaded automatically 
by load-html.r - you never need to worry about dependencies.
Second: your problem is that you are trying to mold the result, which 
is a tree where each node has a reference to the parent node. (much 
like faces in R2). That's why you see the "loop".
there is a mold-tree function in %mezz/trees.r if you want to mold 
the tree. Or, you could simply use form-html to pretty print the 
tree for you.
Eg. for your first example:

t: load-html p
print mold-tree t


[root [] [html [] [head [] [title [] [text [value "t"]]]] [body [] 
[h2 [] [text [value "HEADING"]]] [p [] [text [value "first para"]]] 
[p [] [text [value "second para"]]]]]]

print form-html/with t [pretty?: yes]

<html>
    <head>
        <title>t</title>
        </head>
    <body>
        <h2>HEADING</h2>
        <p>first para</p>
        <p>second para</p>
        </body>
    </html>
(the pretty? option to form-html is something i only use for debugging, 
so it's not as pretty as it should be i guess)
You can also do things like:

>> mold-tree get-node t/childs/html/childs/head/childs/title
== {[title [] [text [value "t"]]]}
get-node and set-node are also from %mezz/trees.r ; most likely you 
don't want to mess around with %mezz/macros/trees.r , that is deep 
vodoo i use to make the html filter fast.
(if you have performance problems, we'll talk about it :)
other examples:


>> get-node t/childs/html/childs/head/childs/title/childs/text/prop/value 
== "t"


>> get-node t/childs/html/childs/body/childs/h2/childs/text/prop/value 
           
== "HEADING"
Also note that:


>> print form-html/with load-html "<p>A paragraph!" [pretty?: yes]
<html>
    <head>
        <title></title>
        </head>
    <body>
        <p>A paragraph!</p>
        </body>
    </html>
ie. load-html tries to cope with malformed input as much as possible.
Janko
30-Apr-2011
[148x3]
wow, thank you a lot! I knew this was to obvious "bug" to be real 
and I am probably doing something wrong. GREAT!


I initially imported only needed modules but got errors .. ( I will 
try and report ) the errors went away as I manually imported them. 
Just a second
very good that you cope with bad html .. I will need that functionality 
because no html is perfect.
I was planing to use beaurtifullsoup if you didn't but since you 
do that is even much better
Janko
1-May-2011
[151]
I tried now, the problem with import was that I didn't set the absolute 
path to load-module/from before.