Mailing List Archive: 49091 messages
  • Home
  • Script library
  • AltME Archive
  • Mailing list
  • Articles Index
  • Site search
 

parse vs. load/markup (a bit long)

 [1/6] from: hallvard::ystad::helpinhand::com at: 24-Jan-2002 12:57


Hello folks, I have a script that loads a given HTML page, parses it, and triggers different actions for the different tags it comes across. For the parsing, I have a parse rule that I invoke like this (those of you who don't bother reading all the code, skip to the bottom to see some figures): parse/all site-content [ copy txt to "<" (do_txt) some html-code copy txt to end (do_txt) ] Here's the 'html-code rule: ---code start--- html-code: [ copy tag [ "<script" thru "</script>" | ;"<script" thru "</script>" (append?: false) | ;["<style" whitespace thru "</style>"] (append?: false) | copy comm [ "<!--" thru "-->" ] | [ "<title>" copy title [to "</title"] thru ">" ] ( do_title append?: false) | copy base [ "<base" thru ">"] (do_base) | [ "<tr" thru ">"] (do_tr) | [ "<pop-from-stack" thru ">"] (pop-from-stack) | [ "</tr" thru ">"] (do_tr_end) | [ "<td" thru ">"] (do_td) | [ "</td" thru ">"] (do_td_end) | [ "<table" copy hei thru ">"] ( table: join "<table" hei do_table ) | copy frame-tag insert-point: [ "<frame" whitespace thru ">" ] ( insert-point: next find insert-point ">" frame-url: ex_att frame-tag "src" if frame-url <> "none" [ to-fetch: to-url HURL/frame frame-url HURL/address_list to-fetch insert HURL/hstack to-url HURL/address frame-content: fetch to-fetch either string? frame-content [ trim/lines frame-content insert insert-point join "<source " [to-fetch ">" frame-content "<pop-from-stack>"] ] [ ; hvis det er block, s er det en error print form frame-content ] ] ) | copy tag_a [ "<a" whitespace copy tab_h ar_tag] ( ) | [ "</table>"] (do_table_end) | ["<" thru ">"]] (do_tag append?: true) | copy txt to "<" (do_txt) ] ; end html-code ---code end--- And 'html-code in turn invokes this: ---code start--- ar_tag: [ [to "<" fmark: [ ["</a" thru ">"] | "</td" (insert fmark " </a> " :fmark) ["</a" thru ">"] | "<a" (insert fmark " </a> " :fmark) ["</a" thru ">"] | "<t" (insert fmark " </a> " :fmark) ["</a" thru ">"] | thru "<" to "<" ar_tag ] ] ] ---code end--- 'Ar_tag is just a mechanism to make sure <a>-tags are properly closed. Then it struck me I should use the built-in mechanism load/markup instead. So I replaced the above rule with load/markup and made a switch: ---code start--- content-block: load/markup site-content forall content-block [ either tag? first content-block [ tag: first content-block p: parse tag none switch first p [ "script" [ forever [ if (length? content-block) = 0 [ break ] if (first content-block) = </script> [ break ] content-block: next content-block ] ] "!--" [ comm: first content-block ] "title" [ title: second content-block do_title append?: false ] "base" [ base: tag do_base ] "tr" [ do_tr ] "pop-from-stack" [ pop-from-stack ] "/tr" [ do_tr_end ] "td" [ do_td ] "/td" [ do_td_end ] "table" [ table: tag do_table ] "frame" [ frame-url: ex_att tag "src" if frame-url <> "none" [ to-fetch: to-url HURL/frame frame-url HURL/address_list to-fetch insert HURL/hstack to-url HURL/address frame-content: fetch to-fetch either string? frame-content [ trim/lines frame-content insert content-block join "<source " [to-fetch ">" frame-content "<pop-from-stack>"] ] [ ; hvis vi mottar en block, s er det en error print form frame-content ] ] ] "/table" [ do_table_end ] "a" [ ; should rearrange structures here, I believe... ] ] do_tag append?: true ] [ txt: first content-block do_txt ] ] ---code end--- Which of the two is faster? I made a little test: ---code start--- test-it: func [] [ a: now/precise/time prin "Checking" loop 10 [ prin "." process-html ] b: now/precise/time print join "^/Done in " (b - a) ] ---code end--- To get accurate results, I stored an HTML page on my harddisk, so that networking shouldn't be a factor. And here are the results: 1) The 'html-code parse rule:
>> test-it
Checking.......... Done in 0:00:02.444
>> test-it
Checking.......... Done in 0:00:02.404
>> test-it
Checking.......... Done in 0:00:02.383
>> test-it
Checking.......... Done in 0:00:02.474
>> test-it
Checking.......... Done in 0:00:02.383
>> test-it
Checking.......... Done in 0:00:02.354
>> test-it
Checking.......... Done in 0:00:02.354
>> test-it
Checking.......... Done in 0:00:02.333
>> test-it
Checking.......... Done in 0:00:02.363
>> test-it
Checking.......... Done in 0:00:02.344
>>
2) using load/markup and a switch:
>> test-it
Checking.......... Done in 0:00:03.205
>> test-it
Checking.......... Done in 0:00:03.195
>> test-it
Checking.......... Done in 0:00:03.194
>> test-it
Checking.......... Done in 0:00:03.225
>> test-it
Checking.......... Done in 0:00:03.205
>> test-it
Checking.......... Done in 0:00:03.204
>> test-it
Checking.......... Done in 0:00:03.204
>> test-it
Checking.......... Done in 0:00:03.244
>> test-it
Checking.......... Done in 0:00:03.225
>> test-it
Checking.......... Done in 0:00:03.244
>>
So, even though the example with load/markup potentially triggers fewer actions (e.g. it doesn't clean up messy <a> tags), it is slower. Does this surprise anyone? I'm not surprised myself, since parse indeed is native and my rule probably is less thorough in its html parsing than load/markup is. This is not really a problem. I'll go back to the 'html-code parse rule. Just thought I'd share this little piece of comparison with the list. ~H

 [2/6] from: joel:neely:fedex at: 24-Jan-2002 7:51


Hi, Hallvard, Interesting results! Hallvard Ystad wrote:
> I have a script that loads a given HTML page, parses it, and > triggers different actions for the different tags...
<<quoted lines omitted: 35>>
> probably is less thorough in its html parsing than load/markup > is...
Thanks for the very nice summary! OBTW, how big was the page on which you ran your test? In addition to the completeness-of-parsing-all-tags issue that you mentioned, I wonder if the SWITCH itself is contributing to the time (especially since it may be evaluated for many tags that are of no interest to your application, which requires it to search the entire block before doing nothing). It might be interesting to see the effect of the following, if you have time to fiddle with it: - Adding counters to the branches (and a default case to count searches for irrelevant tags) to see where the time is being spent. - Using that information, reordering the options to put the most frequent cases earlier in the list (e.g., there's probably only one <title> per page, but it's tested early in the order of your sample code; <td> is probably more frequent than <tr> which is probably more frequent than <table>, but the sample tests them in reverse order, etc.) - Replacing the SWITCH with the equivalent if activity: select [...] first p [do activity] to eliminate the overhead of a mezzanine call and the needless (in this case) check for a /DEFAULT option. - Eliminating the console display from the testing time... test-it: func [] [ prin "Checking" a: now/precise/time loop 10 [ process-html ] b: now/precise/time print join ": Done in " (b - a) ] ...just to avoid a bit of noise in the data. It would be *VERY* interesting (to me, at least) to know how much memory management is an issue in the differences between the two approaches that you tested. Specifically, what if there were a callback option to LOAD/MARKUP similar to the hypothetical load/markup/callback source my-callback-func such that (instead of building a block of results which the client code subsequently iterates over) MY-CALLBACK-FUNC would be invoked directly for every value as encountered in LOAD/MARKUP ? A quick experiment suggests that the memory management overhead of returning a block versus using a callback may be worth some consideration: ; reusable totalizer use [total] [ total: 0.0 init: func [] [total: 0.0] incr: func [n [number!]] [total: total + n] curr: func [] [total] ] ; filter that creates a block result evens-only: func [b [block!] /local result] [ result: make block! length? b foreach item b [if even? item [append result item]] result ] ; function to operate on filter result block sum-block: func [b [block!]] [ init foreach item b [incr item] curr ] ; function to perform work directly on original block sum-evens: func [b [block!]] [ init foreach item b [if even? item [incr item]] curr ] ; function to "call back" supplied function on elements ; that would have passed the filter test do-for-evens: func [b [block!] f [any-function!]] [ foreach item b [if even? item [f item]] ] ; now create a big block-o-numbers bigblock: make block! 100000 repeat i 100000 [append bigblock i] The resulting timings...
>> do [
[ t: now/time/precise [ es: sum-evens bigblock [ t: now/time/precise - t [ print [es t] [ ] 2500050000 0:00:01.48
>> do [
[ t: now/time/precise [ es: sum-block evens-only bigblock [ t: now/time/precise - t [ print [es t] [ ] 2500050000 0:00:03.35
>> do [
[ t: now/time/precise [ init do-for-evens bigblock :incr es: curr [ t: now/time/precise - t [ print [es t] [ ] 2500050000 0:00:01.59 ...seem to indicate that the time penalty for a callback pattern (versus doing all the work in line) is substantially less than the time penalty for creating a result block with subsequent equivalent processing. Note that the penalty can actually be *MUCH* worse than the above timings indicate. The first version of EVENS-ONLY was allowed to burn memory (by pre-allocating the largest possible result block) in the interest of time -- not always a good strategy! A revised version of EVENS-ONLY that has to let the result block grow as it may... evens-only: func [b [block!] /local result] [ result: copy [] foreach item b [if even? item [append result item]] result ] Incurs *WAY*MORE* run-time overhead:
>> do [
[ t: now/time/precise [ es: sum-block evens-only bigblock [ t: now/time/precise - t [ print [es t] [ ] 2500050000 0:00:17.3
>> do [
[ t: now/time/precise [ es: sum-block evens-only bigblock [ t: now/time/precise - t [ print [es t] [ ] 2500050000 0:00:16.69 I wonder what assumptions built-in functions such as LOAD/MARKUP make about the memory needed for their results...? -jn- -- ; sub REBOL {}; sub head ($) {@_[0]} REBOL [] # despam: func [e] [replace replace/all e ":" "." "#" "@"] ; sub despam {my ($e) = @_; $e =~ tr/:#/.@/; return "\n$e"} print head reverse despam "moc:xedef#yleen:leoj" ;

 [3/6] from: rotenca:telvia:it at: 24-Jan-2002 18:09


Hi Joel,
> if activity: select [...] first p [do activity] > > to eliminate the overhead of a mezzanine call and the needless > (in this case) check for a /DEFAULT option.
An idea: if the select block is big, could be used an hash. I've tested this idea with the big block svv/vid-styles, here the result on one of the last field: blk: svv/vid-styles has: to-hash blk
>>length? blk
== 88 ;not so big to be sure:
>> equal? do select has 'TEXT-LIST do select blk 'TEXT-LIST
== true Now a loop of 100000 iteration:
>> timer2 [do select has 'TEXT-LIST] [do select blk 'TEXT-LIST] 100000
1 0:00:00.55 2 0:00:02.14 The time for block reduce to the start of block, so the average time should be of 1.0 -1.10 against 0.5 - 0.55. Because switch uses Select, we could use an hash also with switch. --- Ciao Romano

 [4/6] from: rotenca:telvia:it at: 24-Jan-2002 18:45


Hi all,
> Now a loop of 100000 iteration: > > >> timer2 [do select has 'TEXT-LIST] [do select blk 'TEXT-LIST] 100000 > 1 0:00:00.55 > 2 0:00:02.14
A little correction to my test, to remove the 'do:
>> timer2 [select has 'H] [select blk 'H] 100000
1 0:00:00.33 2 0:00:02.03
>> timer2 [select has 'ANIME] [select blk 'ANIME] 100000
1 0:00:00.28 2 0:00:01.98
>> timer2 [select has 'H3] [select blk 'H3] 100000
1 0:00:00.33 2 0:00:01.1
>> timer2 [select has 'IMAGE] [select blk 'IMAGE] 100000
1 0:00:00.39 2 0:00:00.38 Average 88 items: Block: 1-1.10 against Hash: 0.33-0.6 which should confirm what we know about find with hash! and block!. --- Ciao Romano

 [5/6] from: hallvard:ystad:helpinhand at: 24-Jan-2002 22:24


Joel, Romano, thanks for interesting comments. Dixit Joel Neely (14.51 24.01.2002):
>Thanks for the very nice summary! > >OBTW, how big was the page on which you ran your test?
About 93k
>It might be interesting to see the effect of the following, if >you have time to fiddle with it:
Well, I really shouldn't, but... I ran a few tests with hash! instead of block!, with 'select instead of 'switch, with nested 'either blocks, but it all turned out as bad or worse than the initial experiment. I manipulate the block too much for a hash! to be effective, and I might have misunderstood your 'select example, but anyway, it's very logical that the fastes way to do it is with the original 'html-code parse rule, since two different operations (load/markup & content manipulation) is there united into one. A /callback refinement to 'load/markup, as you suggested, would be cool. ~H

 [6/6] from: rotenca:telvia:it at: 25-Jan-2002 1:59


Hi Hallvard,
> I ran a few tests with hash! instead of block!, with 'select instead of
'switch, with nested 'either blocks, but it all turned >out as bad or worse than the initial experiment. I manipulate the block too much for a hash! to be effective, and I might >have misunderstood your 'select example, but anyway, it's very logical that the fastes way to do it is with the original
>'html-code parse rule, since two different operations (load/markup & content
manipulation) is there united into one. I think of hash like an alternative to block for select ot switch with many cases (>50 ?). Your is of les off 15 and the dfference between hash and block is too low to change the result. The idea is change: switch x [a [] b []..] with k: to hash [a [] b []..] ;make it only one time and then do select k x I had no hope on your markup example, i made the opposite in my script: started with markup and end with string because more fast and more easy to control, because checked inside a parse rule block. --- Ciao Romano

Notes
  • Quoted lines have been omitted from some messages.
    View the message alone to see the lines that have been omitted