Mailing List Archive: 49091 messages
  • Home
  • Script library
  • AltME Archive
  • Mailing list
  • Articles Index
  • Site search
 

[REBOL] Re: newlines

From: joel:neely:fedex at: 7-Nov-2001 18:09

Hello, again, Anton (and all)
> Anton Rolls wrote: > > > > As I wrote in my other post on this thread > > the one line answer is: > > > > while [p: find s "^/^/^/"][remove p] > > > > In a word, "speed". In a wide range of test scenarios, that is > slower that some alternatives. (However, it has a cousin that > is one of the fastest...) > > I'll post some benchmarks and explanations tomorrow. >
OK, so I got done and decided to go ahead and post. Quick summary: There are several options for squishing runs of newlines down to no more than two consecutive newlines. (For sake of visibility, I've used dots in my tests instead of newlines, but the principle is obviously the same.) Which approach is fastest depends greatly on the length of the string being processed and the probability of various run lengths. However, some approaches are less sensitive to those variations than others. Candidates: There are several ways to collapse dot-runs in S (a string). The candidates are given short labels, used in the benchmark results table later on. FIND: while [p: find s "..."][remove p] Advantages - short, simple code. Disadvantages - bogs down *badly* with longer strings and higher frequency of collapsable runs. REPL: newlen: oldlen: length? s until [ oldlen: newlen oldlen = newlen: length? replace/all s "..." ".." ] Advantages - OK performance unless many collapsable runs. Disadvantages - Slightly longer code. BINR: newlen: length? s until [ oldlen: newlen oldlen = newlen: length? replace/all s ".................." ".." ] replace/all s ".........." ".." replace/all s "......" ".." replace/all s "...." ".." replace/all s "..." ".." Advantages - Fastest on longer strings and/or more frequent collapsable runs, but also competes well in other cases. Disadvantages - Design not obvious (until you think about it ;-). COPY: t: make string! length? s nb: 0 foreach ch s [ either ch = #"." [ if 2 >= nb: nb + 1 [ insert tail t ch ] ][ insert tail t ch nb: 0 ] ] t Advantages - Reasonably clear and easy to redevelop; one of the faster choices when many collapsable runs are present. Disadvantages - Poor performance when little collapsing can be done. PASS: p: s while [p: find p "..."][remove p] Advantages - Short and sweet, works best when few collapsing can be done. Disadvantages - Loses when lots of collapsing on long strings is called for. Benchmarks: Randomly-generated strings containing only #"x" and #"." are constructed and identical copies are given to each candidate for processing. Lengths of 10000, 20000, and 30000 characters are tested, with order 0 probability of dots (the collapsed character) set at 10%, 50%, and 90%. Each mini-table is preceded by the number of tests averaged together for that case (always 10), the generated string length, and a string showing relative proportions of #"x" and #"." characters. The label for a candidate is followed by an average time and a percentage relative to the longest time in that test. 10 10000 xxxxxxxxx. pass 0.006044 1 repl 0.0089181 1.5 binr 0.0120582 2.1 find 0.015253 2.6 copy 0.5802009 100 10 10000 xxxxx..... pass 0.1136382 7.2 binr 0.118034 7.5 repl 0.1985032 12.6 copy 0.5821541 36.9 find 1.5794435 100 10 10000 x......... binr 0.1779586 4.4 copy 0.4265818 10.6 pass 0.6116223 15.2 repl 0.8901854 22.1 find 4.0279091 100 10 20000 xxxxxxxxx. pass 0.0107016 1 repl 0.0152895 1.4 binr 0.0295117 2.6 find 0.0326937 2.9 copy 1.1235145 100 10 20000 xxxxx..... binr 0.3812319 6.2 pass 0.4503699 7.4 repl 0.5350547 8.7 copy 1.161134 19 find 6.1152957 100 10 20000 x......... binr 0.4514966 2.8 copy 0.7919637 4.8 repl 2.3939795 14.7 pass 2.6035071 15.9 find 16.3293515 100 10 30000 xxxxxxxxx. pass 0.0153792 1 binr 0.0297831 1.9 repl 0.0326177 2.1 find 0.084752 5.3 copy 1.5871371 100 10 30000 xxxxx..... binr 0.7744618 5.6 pass 0.957589 7 repl 1.1838233 8.6 copy 1.7033773 12.4 find 13.7269263 100 10 30000 x......... binr 0.9456252 2.6 copy 1.1857443 3.3 repl 5.2805771 14.8 pass 5.7582673 16.1 find 35.7992586 100 Conclusions: FIND is the worst of the lot. Most cases have BINR and PASS as neck and neck for best, with COPY a close contender for large strings with long runs. However, when PASS loses ground, it loses big. Likewise (only more so) for REPL. BINR is arguably the best all-purpose solution, as it does reasonably well in a wide variety of scenarios, and even when not the fastest, it doesn't bog down too badly. -jn- -- There is no such thing as exact type measurement. -- Just van Rossum joel]FIX]PUNCTUATION]dot]neely]at]fedex]dot]com