Mailing List Archive: 49091 messages
  • Home
  • Script library
  • AltME Archive
  • Mailing list
  • Articles Index
  • Site search
 

[REBOL] Re: parse vs. load/markup (a bit long)

From: joel:neely:fedex at: 24-Jan-2002 7:51

Hi, Hallvard, Interesting results! Hallvard Ystad wrote:
> I have a script that loads a given HTML page, parses it, and > triggers different actions for the different tags... > > Then it struck me I should use the built-in mechanism > load/markup instead. So I replaced the above rule with > load/markup and made a switch... > > Which of the two is faster? I made a little test... > > To get accurate results, I stored an HTML page on my harddisk, > so that networking shouldn't be a factor. And here are the > results: > > 1) The 'html-code parse rule: > Done in 0:00:02.444 > Done in 0:00:02.404 > Done in 0:00:02.383 > Done in 0:00:02.474 > Done in 0:00:02.383 > Done in 0:00:02.354 > Done in 0:00:02.354 > Done in 0:00:02.333 > Done in 0:00:02.363 > Done in 0:00:02.344 > >> > > 2) using load/markup and a switch: > Done in 0:00:03.205 > Done in 0:00:03.195 > Done in 0:00:03.194 > Done in 0:00:03.225 > Done in 0:00:03.205 > Done in 0:00:03.204 > Done in 0:00:03.204 > Done in 0:00:03.244 > Done in 0:00:03.225 > Done in 0:00:03.244 > >> > > So, even though the example with load/markup potentially > triggers fewer actions (e.g. it doesn't clean up messy <a> > tags), it is slower. Does this surprise anyone? I'm not > surprised myself, since parse indeed is native and my rule > probably is less thorough in its html parsing than load/markup > is...
Thanks for the very nice summary! OBTW, how big was the page on which you ran your test? In addition to the completeness-of-parsing-all-tags issue that you mentioned, I wonder if the SWITCH itself is contributing to the time (especially since it may be evaluated for many tags that are of no interest to your application, which requires it to search the entire block before doing nothing). It might be interesting to see the effect of the following, if you have time to fiddle with it: - Adding counters to the branches (and a default case to count searches for irrelevant tags) to see where the time is being spent. - Using that information, reordering the options to put the most frequent cases earlier in the list (e.g., there's probably only one <title> per page, but it's tested early in the order of your sample code; <td> is probably more frequent than <tr> which is probably more frequent than <table>, but the sample tests them in reverse order, etc.) - Replacing the SWITCH with the equivalent if activity: select [...] first p [do activity] to eliminate the overhead of a mezzanine call and the needless (in this case) check for a /DEFAULT option. - Eliminating the console display from the testing time... test-it: func [] [ prin "Checking" a: now/precise/time loop 10 [ process-html ] b: now/precise/time print join ": Done in " (b - a) ] ...just to avoid a bit of noise in the data. It would be *VERY* interesting (to me, at least) to know how much memory management is an issue in the differences between the two approaches that you tested. Specifically, what if there were a callback option to LOAD/MARKUP similar to the hypothetical load/markup/callback source my-callback-func such that (instead of building a block of results which the client code subsequently iterates over) MY-CALLBACK-FUNC would be invoked directly for every value as encountered in LOAD/MARKUP ? A quick experiment suggests that the memory management overhead of returning a block versus using a callback may be worth some consideration: ; reusable totalizer use [total] [ total: 0.0 init: func [] [total: 0.0] incr: func [n [number!]] [total: total + n] curr: func [] [total] ] ; filter that creates a block result evens-only: func [b [block!] /local result] [ result: make block! length? b foreach item b [if even? item [append result item]] result ] ; function to operate on filter result block sum-block: func [b [block!]] [ init foreach item b [incr item] curr ] ; function to perform work directly on original block sum-evens: func [b [block!]] [ init foreach item b [if even? item [incr item]] curr ] ; function to "call back" supplied function on elements ; that would have passed the filter test do-for-evens: func [b [block!] f [any-function!]] [ foreach item b [if even? item [f item]] ] ; now create a big block-o-numbers bigblock: make block! 100000 repeat i 100000 [append bigblock i] The resulting timings...
>> do [
[ t: now/time/precise [ es: sum-evens bigblock [ t: now/time/precise - t [ print [es t] [ ] 2500050000 0:00:01.48
>> do [
[ t: now/time/precise [ es: sum-block evens-only bigblock [ t: now/time/precise - t [ print [es t] [ ] 2500050000 0:00:03.35
>> do [
[ t: now/time/precise [ init do-for-evens bigblock :incr es: curr [ t: now/time/precise - t [ print [es t] [ ] 2500050000 0:00:01.59 ...seem to indicate that the time penalty for a callback pattern (versus doing all the work in line) is substantially less than the time penalty for creating a result block with subsequent equivalent processing. Note that the penalty can actually be *MUCH* worse than the above timings indicate. The first version of EVENS-ONLY was allowed to burn memory (by pre-allocating the largest possible result block) in the interest of time -- not always a good strategy! A revised version of EVENS-ONLY that has to let the result block grow as it may... evens-only: func [b [block!] /local result] [ result: copy [] foreach item b [if even? item [append result item]] result ] Incurs *WAY*MORE* run-time overhead:
>> do [
[ t: now/time/precise [ es: sum-block evens-only bigblock [ t: now/time/precise - t [ print [es t] [ ] 2500050000 0:00:17.3
>> do [
[ t: now/time/precise [ es: sum-block evens-only bigblock [ t: now/time/precise - t [ print [es t] [ ] 2500050000 0:00:16.69 I wonder what assumptions built-in functions such as LOAD/MARKUP make about the memory needed for their results...? -jn- -- ; sub REBOL {}; sub head ($) {@_[0]} REBOL [] # despam: func [e] [replace replace/all e ":" "." "#" "@"] ; sub despam {my ($e) = @_; $e =~ tr/:#/.@/; return "\n$e"} print head reverse despam "moc:xedef#yleen:leoj" ;