[REBOL] Re: newlines
From: joel:neely:fedex at: 7-Nov-2001 18:09
Hello, again, Anton (and all)
> Anton Rolls wrote:
> >
> > As I wrote in my other post on this thread
> > the one line answer is:
> >
> > while [p: find s "^/^/^/"][remove p]
> >
>
> In a word, "speed". In a wide range of test scenarios, that is
> slower that some alternatives. (However, it has a cousin that
> is one of the fastest...)
>
> I'll post some benchmarks and explanations tomorrow.
>
OK, so I got done and decided to go ahead and post.
Quick summary:
There are several options for squishing runs of newlines down to
no more than two consecutive newlines. (For sake of visibility,
I've used dots in my tests instead of newlines, but the principle
is obviously the same.)
Which approach is fastest depends greatly on the length of the
string being processed and the probability of various run lengths.
However, some approaches are less sensitive to those variations
than others.
Candidates:
There are several ways to collapse dot-runs in S (a string).
The candidates are given short labels, used in the benchmark
results table later on.
FIND:
while [p: find s "..."][remove p]
Advantages - short, simple code.
Disadvantages - bogs down *badly* with longer strings and higher
frequency of collapsable runs.
REPL:
newlen: oldlen: length? s
until [
oldlen: newlen
oldlen = newlen: length? replace/all s "..." ".."
]
Advantages - OK performance unless many collapsable runs.
Disadvantages - Slightly longer code.
BINR:
newlen: length? s
until [
oldlen: newlen
oldlen = newlen: length? replace/all s ".................." ".."
]
replace/all s ".........." ".."
replace/all s "......" ".."
replace/all s "...." ".."
replace/all s "..." ".."
Advantages - Fastest on longer strings and/or more frequent
collapsable runs, but also competes well in
other cases.
Disadvantages - Design not obvious (until you think about it ;-).
COPY:
t: make string! length? s
nb: 0
foreach ch s [
either ch = #"." [
if 2 >= nb: nb + 1 [
insert tail t ch
]
][
insert tail t ch nb: 0
]
]
t
Advantages - Reasonably clear and easy to redevelop;
one of the faster choices when many collapsable
runs are present.
Disadvantages - Poor performance when little collapsing can be
done.
PASS:
p: s
while [p: find p "..."][remove p]
Advantages - Short and sweet, works best when few collapsing
can be done.
Disadvantages - Loses when lots of collapsing on long strings
is called for.
Benchmarks:
Randomly-generated strings containing only #"x" and #"." are
constructed and identical copies are given to each candidate for
processing. Lengths of 10000, 20000, and 30000 characters are
tested, with order 0 probability of dots (the collapsed character)
set at 10%, 50%, and 90%. Each mini-table is preceded by the
number of tests averaged together for that case (always 10), the
generated string length, and a string showing relative proportions
of #"x" and #"." characters.
The label for a candidate is followed by an average time and
a percentage relative to the longest time in that test.
10 10000 xxxxxxxxx.
pass 0.006044 1
repl 0.0089181 1.5
binr 0.0120582 2.1
find 0.015253 2.6
copy 0.5802009 100
10 10000 xxxxx.....
pass 0.1136382 7.2
binr 0.118034 7.5
repl 0.1985032 12.6
copy 0.5821541 36.9
find 1.5794435 100
10 10000 x.........
binr 0.1779586 4.4
copy 0.4265818 10.6
pass 0.6116223 15.2
repl 0.8901854 22.1
find 4.0279091 100
10 20000 xxxxxxxxx.
pass 0.0107016 1
repl 0.0152895 1.4
binr 0.0295117 2.6
find 0.0326937 2.9
copy 1.1235145 100
10 20000 xxxxx.....
binr 0.3812319 6.2
pass 0.4503699 7.4
repl 0.5350547 8.7
copy 1.161134 19
find 6.1152957 100
10 20000 x.........
binr 0.4514966 2.8
copy 0.7919637 4.8
repl 2.3939795 14.7
pass 2.6035071 15.9
find 16.3293515 100
10 30000 xxxxxxxxx.
pass 0.0153792 1
binr 0.0297831 1.9
repl 0.0326177 2.1
find 0.084752 5.3
copy 1.5871371 100
10 30000 xxxxx.....
binr 0.7744618 5.6
pass 0.957589 7
repl 1.1838233 8.6
copy 1.7033773 12.4
find 13.7269263 100
10 30000 x.........
binr 0.9456252 2.6
copy 1.1857443 3.3
repl 5.2805771 14.8
pass 5.7582673 16.1
find 35.7992586 100
Conclusions:
FIND is the worst of the lot. Most cases have BINR and PASS as
neck and neck for best, with COPY a close contender for large
strings with long runs. However, when PASS loses ground, it
loses big. Likewise (only more so) for REPL.
BINR is arguably the best all-purpose solution, as it does
reasonably well in a wide variety of scenarios, and even when not
the fastest, it doesn't bog down too badly.
-jn-
--
There is no such thing as exact type measurement.
-- Just van Rossum
joel]FIX]PUNCTUATION]dot]neely]at]fedex]dot]com