[REBOL] Re: parse vs. load/markup (a bit long)
From: joel:neely:fedex at: 24-Jan-2002 7:51
Hi, Hallvard,
Interesting results!
Hallvard Ystad wrote:
> I have a script that loads a given HTML page, parses it, and
> triggers different actions for the different tags...
>
> Then it struck me I should use the built-in mechanism
> load/markup instead. So I replaced the above rule with
> load/markup and made a switch...
>
> Which of the two is faster? I made a little test...
>
> To get accurate results, I stored an HTML page on my harddisk,
> so that networking shouldn't be a factor. And here are the
> results:
>
> 1) The 'html-code parse rule:
> Done in 0:00:02.444
> Done in 0:00:02.404
> Done in 0:00:02.383
> Done in 0:00:02.474
> Done in 0:00:02.383
> Done in 0:00:02.354
> Done in 0:00:02.354
> Done in 0:00:02.333
> Done in 0:00:02.363
> Done in 0:00:02.344
> >>
>
> 2) using load/markup and a switch:
> Done in 0:00:03.205
> Done in 0:00:03.195
> Done in 0:00:03.194
> Done in 0:00:03.225
> Done in 0:00:03.205
> Done in 0:00:03.204
> Done in 0:00:03.204
> Done in 0:00:03.244
> Done in 0:00:03.225
> Done in 0:00:03.244
> >>
>
> So, even though the example with load/markup potentially
> triggers fewer actions (e.g. it doesn't clean up messy <a>
> tags), it is slower. Does this surprise anyone? I'm not
> surprised myself, since parse indeed is native and my rule
> probably is less thorough in its html parsing than load/markup
> is...
Thanks for the very nice summary!
OBTW, how big was the page on which you ran your test?
In addition to the completeness-of-parsing-all-tags issue that
you mentioned, I wonder if the SWITCH itself is contributing
to the time (especially since it may be evaluated for many tags
that are of no interest to your application, which requires it
to search the entire block before doing nothing).
It might be interesting to see the effect of the following, if
you have time to fiddle with it:
- Adding counters to the branches (and a default case to count
searches for irrelevant tags) to see where the time is
being spent.
- Using that information, reordering the options to put the
most frequent cases earlier in the list (e.g., there's
probably only one <title> per page, but it's tested early
in the order of your sample code; <td> is probably more
frequent than <tr> which is probably more frequent than
<table>, but the sample tests them in reverse order, etc.)
- Replacing the SWITCH with the equivalent
if activity: select [...] first p [do activity]
to eliminate the overhead of a mezzanine call and the needless
(in this case) check for a /DEFAULT option.
- Eliminating the console display from the testing time...
test-it: func [] [
prin "Checking"
a: now/precise/time
loop 10 [
process-html
]
b: now/precise/time
print join ": Done in " (b - a)
]
...just to avoid a bit of noise in the data.
It would be *VERY* interesting (to me, at least) to know how much
memory management is an issue in the differences between the two
approaches that you tested. Specifically, what if there were a
callback option to LOAD/MARKUP similar to the hypothetical
load/markup/callback source my-callback-func
such that (instead of building a block of results which the client
code subsequently iterates over) MY-CALLBACK-FUNC would be invoked
directly for every value as encountered in LOAD/MARKUP ?
A quick experiment suggests that the memory management overhead
of returning a block versus using a callback may be worth some
consideration:
; reusable totalizer
use [total] [
total: 0.0
init: func [] [total: 0.0]
incr: func [n [number!]] [total: total + n]
curr: func [] [total]
]
; filter that creates a block result
evens-only: func [b [block!] /local result] [
result: make block! length? b
foreach item b [if even? item [append result item]]
result
]
; function to operate on filter result block
sum-block: func [b [block!]] [
init
foreach item b [incr item]
curr
]
; function to perform work directly on original block
sum-evens: func [b [block!]] [
init
foreach item b [if even? item [incr item]]
curr
]
; function to "call back" supplied function on elements
; that would have passed the filter test
do-for-evens: func [b [block!] f [any-function!]] [
foreach item b [if even? item [f item]]
]
; now create a big block-o-numbers
bigblock: make block! 100000
repeat i 100000 [append bigblock i]
The resulting timings...
>> do [
[ t: now/time/precise
[ es: sum-evens bigblock
[ t: now/time/precise - t
[ print [es t]
[ ]
2500050000 0:00:01.48
>> do [
[ t: now/time/precise
[ es: sum-block evens-only bigblock
[ t: now/time/precise - t
[ print [es t]
[ ]
2500050000 0:00:03.35
>> do [
[ t: now/time/precise
[ init do-for-evens bigblock :incr es: curr
[ t: now/time/precise - t
[ print [es t]
[ ]
2500050000 0:00:01.59
...seem to indicate that the time penalty for a callback pattern
(versus doing all the work in line) is substantially less than
the time penalty for creating a result block with subsequent
equivalent processing.
Note that the penalty can actually be *MUCH* worse than the
above timings indicate. The first version of EVENS-ONLY was
allowed to burn memory (by pre-allocating the largest possible
result block) in the interest of time -- not always a good
strategy! A revised version of EVENS-ONLY that has to let the
result block grow as it may...
evens-only: func [b [block!] /local result] [
result: copy []
foreach item b [if even? item [append result item]]
result
]
Incurs *WAY*MORE* run-time overhead:
>> do [
[ t: now/time/precise
[ es: sum-block evens-only bigblock
[ t: now/time/precise - t
[ print [es t]
[ ]
2500050000 0:00:17.3
>> do [
[ t: now/time/precise
[ es: sum-block evens-only bigblock
[ t: now/time/precise - t
[ print [es t]
[ ]
2500050000 0:00:16.69
I wonder what assumptions built-in functions such as LOAD/MARKUP
make about the memory needed for their results...?
-jn-
--
; sub REBOL {}; sub head ($) {@_[0]}
REBOL []
# despam: func [e] [replace replace/all e ":" "." "#" "@"]
; sub despam {my ($e) = @_; $e =~ tr/:#/.@/; return "\n$e"}
print head reverse despam "moc:xedef#yleen:leoj" ;