World: r3wp
[Power Mezz] Discussions of the Power Mezz
older newer | first last |
florin 3-Oct-2010 [73] | Thanks. |
PatrickP61 15-Dec-2010 [74x3] | Hi Gabriele, I'm trying out your power-mezz for the first time. Do you have any other documentation on how to set it up properly? Here is what I'm doing: power-mezz-path: to-path e:/Projects/PT/Rebol/power-mezz-built-1.0.0/ print "Starting mezz/module.r" do power-module: to-url ajoin [power-mezz-path 'mezz/module.r] print "Returned mezz/module.r" load-module/from power-mezz-path module [ imports: [%mezz/html-to-text.r] ] --> e:/Projects/PT/Rebol/power-mezz-built-1.0.0/ --> Starting mezz/module.r ** Access Error: Invalid port spec: e:/Projects/PT/Rebol/power-mezz-built-1.0.0/mezz/module.r ** Near: do power-module: to-url ajoin [power-mezz-path 'mezz/module.r] Any ideas on what I did wrong? |
I've got a meeting to run to -- will check back in couple of hours :-) | |
Here now, Anyone have ideas on how to use Power-Mezz for the beginner? Also, what is the difference between power-mezz-1.0.0 and power-mezz-built-1.0.0? | |
PatrickP61 17-Dec-2010 [77] | Anyone have info on how to use Power-Mezz? |
Maxim 17-Dec-2010 [78x2] | I installed it yesterday, it worked pretty well for what I needed. |
do you need help on install or on what the actually mezz code does? | |
PatrickP61 17-Dec-2010 [80] | Hi Maxim, I'm still learning Rebol, but I'd like to see how I can use Power-Mezz. How do you install it? |
Gabriele 18-Dec-2010 [81] | Patrick, e:/Projects/... is not a valid rebol file path. try with something like: power-mezz-path: %/E/Projects/PT/Rebol/power-mezz-built-1.0.0/ do power-mezz-path/mezz/module.r load-modules/from power-mezz-path ; etc. |
PatrickP61 18-Dec-2010 [82x2] | Oops -- Didn't see the malformed file path!!! Where can I find examples of how to use Power Mezz? |
The particular script I am writing is called GET ADDRESS. This script takes a CSV file called contacts which has first and last name, city and state of all of my friends that I'd like to get addresses for Christmas cards, but have forgotten or misplaced. So far, the script takes each entry and sends it to SUPERPAGES.com where the HTML sent back contains the information. Right now, I'm simply saving the HTML as a file for each entry in my CSV. What I would like to do is somehow parse the HTML from it and extract out the address lines, zip code, phone number etc. But I admit that parsing through HTML is daunting to me. So after looking around on the internet, I discovered HTML-TO-TEXT in your Power Mezz. That is where I am now, trying to figure it out and see how it works. I've read some of your documentation, but I admit, I am still in the dark as to how it works -- at least for my application. Any advice you have is welcome. Thanks in advance. | |
Kaj 18-Dec-2010 [84x2] | I don't think that's a good function to use for that. It seems to me it's meant for making readable text, not processable text |
Use "5.10 Parse HTML text into a tree" instead | |
PatrickP61 18-Dec-2010 [86] | Thank you Kaj. I'll check that out! |
Kaj 18-Dec-2010 [87] | There's no usage documentation, though, only code documentation |
Kaj 19-Dec-2010 [88] | There's also "7.4 [X][HT]ML Parser" so it's not clear to me which one is preferred |
Oldes 19-Dec-2010 [89x2] | To be honest, if you just like to parse some HTML page to get some parts of it, you don't need to use Power Mezz at all.. I'm using Rebol more than 10 years and still consider PM as a too complex staff for me. If you are REBOL newbie, better to start reading REBOL doc. In your case something about parsing. |
There is a lot of pages about 'parse' on net... for example this one: http://www.rebol.com/docs/core23/rebolcore-15.html | |
PatrickP61 19-Dec-2010 [91] | Thanks Oldes -- Will check into that |
Kaj 19-Dec-2010 [92x2] | Yeah, there's some tipping point in parsing web pages and such. When the pages are consistent and the data you want to scrape is simple, I use PARSE, too, or even just string processing |
But when the HTML and the data become more complex, there are so many exceptions you have to program, that a real HTML parser becomes more convenient | |
Henrik 19-Dec-2010 [94x2] | would a real HTML parser convert the data to a REBOL object? |
hmm... not sure that is possible. | |
Kaj 19-Dec-2010 [96] | I don't know what the two PowerMezz ones do, but I figure they just produce blocks. It's static data, so no need for bindings |
Henrik 19-Dec-2010 [97x2] | a good one would be to convert R3 rich text to HTML and vice versa. |
but that is of course not related to parsing... | |
Kaj 19-Dec-2010 [99] | Yeah, I think someone will have to do that :-) |
Anton 20-Dec-2010 [100x2] | Kaj, I think it's the other way around! I found when the HTML and the data become more complex, then a simpler "hack" parse job is more likely to survive changes to the source. This happened to me several times with a weather forecast and television guide scraper etc. that I made (and remade, and remade..). |
(back when I used to care about television, that is) | |
Kaj 20-Dec-2010 [102x2] | That's true when you have to write the parser yourself, but I'm assuming the PowerMezz parsers handle all of HTML :-) |
Also, it's probably not as much within reach for novice programmers | |
Oldes 20-Dec-2010 [104x2] | I was using REBOL for datamining a few years ago and I can say it was easier to do string based parsing to get what I've needed. |
It's always easier to do: parse html [thru "<title>" copy title to "<"] than parse complete html to something like a block structure and dig title in it. | |
Kaj 20-Dec-2010 [106] | For you, but my business partner wants to scrape web pages, and I don't think he would understand how to do it with parse |
Oldes 20-Dec-2010 [107] | I believe that if he would not understand simple parse, than he would not understand PowerMezz as well, but maybe I'm wrong. Also it very depends what page do you parse. |
Kaj 20-Dec-2010 [108] | Scraping a title is the simplest example. In reality, you get all sorts of tags with extra attributes that you need to skip, and values with extraneous newlines. He wouldn't understand how to normalise that, so his data would be left as a mess |
Oldes 20-Dec-2010 [109] | Again... it's easier to do manual changes to parse malformed pages if you do string parsing where you don't care about 99% of the page content. |
Kaj 20-Dec-2010 [110] | We did a simple address list as an example, even preconverting it to plain text. It took all afternoon to discover all exceptions and fix them, so in most cases, it isn't worth it |
Maxim 20-Dec-2010 [111x3] | I've done quite a few html analysers and with a bit of experience I have found that going with a brute force parse on pages ends-up being very effective. I did a font downloading script for www.acidfonts.com a few years ago and it took more time to actually download all the pages than build the script. :-) |
some servers are anti-indexing and its in these cases where the brute parse is most effective. I've even had to cater an oracle web server which didn't have ANY css, type, or id fields in all pages which are driven by form. all urls can only be read once, and every page read is a redirect. only parse allowed me to cope in such a drastic anti-robot environment. it still took a day to build the robot. and in the end, it even had an auto-user-creationg step at each 200 pages which created a google gmail account for the next batch. :-) in those cases, parse is king. | |
a fellow REBOL parse data miner tells me that some servers have pretty good algorithms which identify robots out of the thousands of requests they get, and you even have to put random-lenght pauses between reads which can go up to a few hours. | |
Kaj 20-Dec-2010 [114] | I don't see my partner doing that :-) |
Maxim 20-Dec-2010 [115] | that why he has you ;-) |
Kaj 20-Dec-2010 [116x2] | I'm not going to do it for him, either, thank you very much |
So if he can get by with an HTML parser, that would be great for him | |
Gabriele 21-Dec-2010 [118x5] | About examples: if you download the repository, or the source archive, you'll find a test directory which has the test suite for (almost) all the modules. the tests are often usable as "examples" for how to use the modules themselves. |
About the HTML parser: This started out because we had a need for a HTML filter. The mail client in Qtask allows users to view HTML emails, furthermore, the wiki editor submits HTML to Qtask. Whenever you are embedding HTML from an external (untrusted) source within your HTML, you have security problems. For this reason, we had to take the untrusted HTML and 1) "filter" it so that it would be safe 2) make it embeddable (eg. only take what's inside <body>, and not the whole <html> etc.). | |
This *had* to work with *any* HTML (think about the stuff that comes out of Outlook, or the stuff you get as spam, or newsletters, and all that). You can't imagine how bad that can be. That had to be turned into something that would not break our own HTML pages. | |
My approach was, instead of doing what many others do (try to remove things from the HTML that are known to be "bad", eg. use regexps to remove anything that starts with "javascript:" or anything between <script>...</script> etc.), was to only pass what was known to be good, and ignore everything else. This is a bit more limiting but I consider it to be safer (you don't have to play a game with attacker where every time they find a new vector, you have to add it to the "bad" list). | |
So the first HTML filter was done. It parsed the HTML (any version), and going through two finite state machines in a pipeline, rebuilt it as XHTML 1.0 strict. | |
older newer | first last |