Mailing List Archive: 49091 messages
  • Home
  • Script library
  • AltME Archive
  • Mailing list
  • Articles Index
  • Site search
 

[REBOL] Re: byte frequencies

From: joel:neely:fedex at: 6-Jul-2001 14:33

Joel Neely wrote:
> Jeff Kreis wrote: > > > > If you're handling very large files you'd want to use: > > > > fi: open/binary/direct fi > > > > while [ch: pick fi 1][ > > .... > > fi: next fi > > ] > > > > That will be much faster over some minimum size. > > > > Perhaps I'll have a chance to do some benchmarking to get a > clue as to what qualifies as "some minimum size". >
Well, having done the benchmarking, I am *really* clueless as to how the above could ever help performance. Perhaps I misread the suggestion... I tested three variations of a function to read a file and tally (all of) its characters: CFDIR - AFAICT is constructed per the recommendation for large files, CFCPY - is based on the technique in R/CUG 2.3, p 13-11 CFBUF - is the straightforward read-everything-into-memory original version except that all use Larry's redefine-the-variable trick (and CFCPY left CH as a /LOCAL for consistency -- even though it's the target of a FOREACH, since the other versions needed to declare it). totdir: array/initial 256 0 totcpy: array/initial 256 0 totbuf: array/initial 256 0 cfdir: func [fn [file!] /local fi ch] [ totdir: array/initial 256 0 fi: open/binary/direct fn while [ch: pick fi 1] [ ch: 1 + ch poke totdir ch 1 + pick totdir ch fi: next fi ] close fi ] cfcpy: func [fn [file!] /local fi mybuf ch] [ totcpy: array/initial 256 0 fi: open/binary/direct fn while [mybuf: copy/part fi 4096] [ foreach ch mybuf [ ch: 1 + ch poke totcpy ch 1 + pick totcpy ch ] ] close fi ] cfbuf: func [fn [file!] /local ch] [ totbuf: array/initial 256 0 foreach ch read/binary fn [ ch: 1 + ch poke totbuf ch 1 + pick totbuf ch ] ] The TOTxxx tallies were outside the functions so that I could verify afterwards that the tallies were all equal. I just happened to have a 32Mb core dump lying around, so it was easy to dd some test files of different sizes for benchmarking. The relative times, normalized to the fastest function, for various test files are: ------ test file size ------ Function 1 Mb 2 Mb 4 Mb 8 Mb -------- ---- ---- ---- ---- cfdir 3.38 3.71 3.75 3.84 cfcpy 1.00 1.00 1.00 1.00 cfbuf 1.08 1.06 1.10 1.06 Anything below about the 10% level is probably noise, but the trend seems consistent so far; CFCPY consistently wins over CFBUF by a small but noticeable margin, while CFDIR gets worse as the file grows. Anyone have any light to shed? -jn- --------------------------------------------------------------- There are two types of science: physics and stamp collecting! -- Sir Arthur Eddington joel-dot-neely-at-fedex-dot-com