[REBOL] Re: Sort by first part of line
From: joel:neely:fedex at: 7-Sep-2002 22:06
Hi, Romano, et alia,
Romano Paolo Tenca wrote:
> > 1. Parse is slow -- take it out of the critical path (as Gregg
> > and Romano do) or eliminate it (as Scott and Joel do).
> > 2. It's not the number of parse's (Greg has one per line; Romano
> > has one for the whole dataset) so much as the volume of data to
> > be parsed
>
> I do not agree. The greatest problems are in memory allocation.
>
...
> I think parse is one of the more fast constructs in Rebol.
>
Certainly memory management (especially when constructing blocks
in interpreted expressions as I did) becomes a very significant
component of the total run time as the size/number of blocks
continues to increase.
As for the rest, I must disagree with *both* you and Sunanda.
(Some trick, eh? ;-) I don't attach meaning to absolute phrases
such as "PARSE is slow" or "PARSE is fast", but instead think in
terms such as "PARSE is slower/faster than _____ for doing ____".
For example, if I replace the phrase
to-integer copy/part next item 3
in my original grouping algorithm with
to-integer first parse item none
two things happen:
1) The issues of leading whitespace and exact number of digits
in the numeric-looking prefix now go away.
2) The processing slows down by about 20% (takes about 20% longer
to run).
So, *in*this*case*, PARSE is slower than explicit copying for doing
the job of pulling out a fixed-length/position prefix. Generality
almost always has a price, and performance optimization almost
always increases brittleness.
I'd prefer to rephrase Sunanda's points as:
1) Move as much processing as possible out of inner loops.
2) Process as little data as possible to get the work done.
regardless of how the processing is invoked. In the second point,
doing a whitespace-driven parse of the entire line (or, worse, of
the entire file) is overkill, since we don't need the rest of the
line parsed/broken down for the task as specified.
Reminding myself to take a clue from the first point, we can use
the implicit looping of PARSE, and the fact that the prefixes are
only three digits long to come up with something like this:
8<--------
t0: now/time/precise
buffer: copy/deep array/initial 999 [[]]
digit: charset "0123456789"
parse/all read f [
some [
some " "
copy nr some digit
copy data [thru "^/" | to end] (
append pick buffer to-integer nr data
)
]
]
t1: now/time/precise
print to-decimal t1 - t0
8<--------
which, by my tests, actually runs a bit (~7%) faster than my
first version (which extended the buffer within the inner loop
whenever needed). Feeding that same idea back into the first
version produces something like this:
8<--------
t0: now/time/precise
buffer: copy/deep array/initial 999 [[]]
foreach item read/lines f [
nr: to-integer copy/part next item 3
append buffer/:nr item
]
t1: now/time/precise
print to-decimal t1 - t0
8<--------
which runs about 9% faster than the original.
-jn-
--
; Joel Neely joeldotneelyatfedexdotcom
REBOL [] do [ do func [s] [ foreach [a b] s [prin b] ] sort/skip
do function [s] [t] [ t: "" foreach [a b] s [repend t [b a]] t ] {
| e s m!zauafBpcvekexEohthjJakwLrngohOqrlryRnsctdtiub} 2 ]