byte frequencies

[1/13] from: joel:neely:fedex at: 5-Jul-2001 11:12

Hello, all! WRT to the issue of tallying occurrences of byte values in a file, the best I've thought of thus far is bytefreq: func [fi [file!] /local ft ch] [ ft: array/initial 256 0 foreach ch read to-file fi [ poke ft 1 + ch 1 + pick ft 1 + ch ] for ch 0 255 1 [ if 0 < pick ft 1 + ch [ print [mold to-char ch tab pick ft 1 + ch] ] ] ] (except for the obvious minor refactoring of using an additional temporary variable or two...) Applied to its own source, it gives

>> bytefreq %bytefreq.r

#"^/" 16 #" " 105 #"!" 2 #"#" 1 #"+" 5 #"-" 2 #"/" 6 #"0" 3 #"1" 6 #"2" 2 #"5" 3 #"6" 1 #":" 2 #"<" 1 #"B" 1 #"E" 1 #"L" 1 #"O" 1 #"R" 1 #"[" 8 #"]" 8 #"a" 9 #"b" 4 #"c" 16 #"d" 2 #"e" 8 #"f" 15 #"h" 10 #"i" 13 #"k" 4 #"l" 9 #"m" 1 #"n" 4 #"o" 9 #"p" 5 #"q" 1 #"r" 10 #"s" 1 #"t" 12 #"u" 2 #"y" 2 Any suggestions for improvement? -jn- -- ------------------------------------------------------------ Programming languages: compact, powerful, simple ... Pick any two! joel'dot'neely'at'fedex'dot'com

[2/13] from: larry:ecotope at: 5-Jul-2001 21:13

Hi Joel, Good work. Here is my slightly more Rebolian revision: bytefreq2: func [fi [file!] /local ft cnt] [ ft: array/initial 256 0 foreach ch read fi [ ch: 1 + ch poke ft ch 1 + pick ft ch ] repeat ch length? ft [ if 0 < cnt: pick ft ch [ print [mold to-char ch - 1 tab cnt] ] ] ] A few quick comments. In your original the variable CH does not have to be declared local to BYTEFREQ because it is only used as an arg (which are always local to their functions) for the FOREACH and FOR functions. TO-FILE is superfluous because fi was declared as file! in the function spec. Inside the code block of a function the args can be redefined as convenient, in particular in the FOREACH function, CH can be given any appropriate value in the code block. It is reset to the correct value before each execution of the code block. I rearranged things to eliminate the redundant index calculations. The REPEAT construction I used guarantees that one automatically indexes through all elements of the series argument. I added CNT as a local variable, but we could also have used the variable LOCAL if we wished to avoid declaring another local at the expense of readability. Your basic idea of using implicit type-conversion to allow characters to be used as indices is clever. But the reader should note that when adding characters and integers, the result is order dependent:

>> 1 + #"a"

== 98

>> #"a" + 1

== #"b"

Cheers -Larry

[3/13] from: sqlab::gmx::net at: 6-Jul-2001 9:31

And in order to get all bytes' frequencies foreach ch to-string read/binary fi [ .. ... AR

[4/13] from: joel:neely:fedex at: 5-Jul-2001 20:25

Hi, Larry, Good comments all; thanks! Larry Palmiter wrote:

> bytefreq2: func [fi [file!] /local ft cnt] [ > ft: array/initial 256 0

<<quoted lines omitted: 8>>

> ] > ]

Nice!

> TO-FILE is superfluous because fi was declared as file! in > the function spec. >

I originally had func [fi [file! string!] ... and forgot to remove TO-FILE when I pruned the types. Silly of me not to have cleaned out the leftovers!

> I added CNT as a local variable, but we could also have used > the variable LOCAL if we wished to avoid declaring another > local at the expense of readability. >

I certainly think readability is The Right Thing, especially when the potential "savings" from sacrificing it are so small!

> ... the reader should note that when adding characters and > integers, the result is order dependent: >

Right, which is why I chose the order I did. I suppose this is another of those cases where an inconsistency occurred in favor of anticipated use, given that

>> type? 1 + 1.0 ;== decimal! >> type? 1.0 + 1 ;== decimal!

Thanks for the helpful feedback! -jn- -- ------------------------------------------------------------ Programming languages: compact, powerful, simple ... Pick any two! joel'dot'neely'at'fedex'dot'com

[5/13] from: joel:neely:fedex at: 5-Jul-2001 20:36

Ah, yes! Forgot about good ol' CP/M... Thanks! -jn- [sqlab--gmx--net] wrote:

> And in order to get all bytes' frequencies > > foreach ch to-string read/binary fi [ > ... >

-- ------------------------------------------------------------ Programming languages: compact, powerful, simple ... Pick any two! joel'dot'neely'at'fedex'dot'com

[6/13] from: joel:neely:fedex at: 5-Jul-2001 20:48

One other tidbit, Larry... Larry Palmiter wrote:

> bytefreq2: func [fi [file!] /local ft cnt] [ > ft: array/initial 256 0

<<quoted lines omitted: 8>>

> ] > ]

...

> Your basic idea of using implicit type-conversion to allow > characters to be used as indices...

<<quoted lines omitted: 11>>

> > ] > >

Over the years, I've developed the practice of using each variable to mean exactly one thing (although, as with all good intentions, I'm not perfect at it yet! ;-) I've often found subtle bugs that result from redefining the meaning or role of a variable "just a little bit". Thus, my original is fairly obsessive about using CH to mean "the current character I'm dealing with". That habit explains my blind spot WRT flipping CH back-and-forth between current character and "index for current character". Of course, in 0-origin indexing those two are the same, so there's no arithmetic (or mental adjustment ;-) required. Finally, to be consistent with your reminder to the reader about int + char vs. char + int , I should note that the loop for ch 0 255 1 [...] actually *had* to be written using INTEGER! bounds; if written using CHAR! bounds, it never terminates! Fair warning! -jn- ------------------------------------------------------------ Programming languages: compact, powerful, simple ... Pick any two! joel'dot'neely'at'fedex'dot'com

[7/13] from: jeff:rebol at: 6-Jul-2001 8:39

If you assume that NULL chars in your files are infrequent you can just use a 1-based index. If bytefreq3 generates an error, then fall back to one of the others. This version is faster than bf2. bytefreq3: func [fi [file!] /local ft cnt] [ ft: array/initial 255 0 foreach ch read/binary fi [ change at ft ch 1 + ft/:ch ;- <- err? ] repeat ch 255 [ if 0 < cnt: ft/:ch [ print [mold to-char ch tab cnt] ] ] ] If you're handling very large files you'd want to use: fi: open/binary/direct fi while [ch: pick fi 1][ .... fi: next fi ] That will be much faster over some minimum size. -jeff

[8/13] from: joel:neely:fedex at: 6-Jul-2001 13:38

Jeff Kreis wrote:

> If you assume that NULL chars in your files are infrequent you > can just use a 1-based index. >

It is not an option simply to ignore some character values.

> If you're handling very large files you'd want to use: > fi: open/binary/direct fi

<<quoted lines omitted: 3>>

> ] > That will be much faster over some minimum size.

Perhaps I'll have a chance to do some benchmarking to get a clue as to what qualifies as "some minimum size". -jn- ___________________________________________________________________ The purpose of computing is insight, not numbers! - R. W. Hamming joel'dot'neely'at'fedex'dot'com

[9/13] from: jeff:rebol at: 6-Jul-2001 12:08

> > If you assume that NULL chars in your files are > > infrequent you can just use a 1-based index. > > > > It is not an option simply to ignore some character values.

Unless they're not in your data set (log files). Or if you're considering the amortized case. :-) -jeff

[10/13] from: joel:neely:fedex at: 6-Jul-2001 14:33

Joel Neely wrote:

> Jeff Kreis wrote: > >

<<quoted lines omitted: 11>>

> Perhaps I'll have a chance to do some benchmarking to get a > clue as to what qualifies as "some minimum size".

Well, having done the benchmarking, I am *really* clueless as to how the above could ever help performance. Perhaps I misread the suggestion... I tested three variations of a function to read a file and tally (all of) its characters: CFDIR - AFAICT is constructed per the recommendation for large files, CFCPY - is based on the technique in R/CUG 2.3, p 13-11 CFBUF - is the straightforward read-everything-into-memory original version except that all use Larry's redefine-the-variable trick (and CFCPY left CH as a /LOCAL for consistency -- even though it's the target of a FOREACH, since the other versions needed to declare it). totdir: array/initial 256 0 totcpy: array/initial 256 0 totbuf: array/initial 256 0 cfdir: func [fn [file!] /local fi ch] [ totdir: array/initial 256 0 fi: open/binary/direct fn while [ch: pick fi 1] [ ch: 1 + ch poke totdir ch 1 + pick totdir ch fi: next fi ] close fi ] cfcpy: func [fn [file!] /local fi mybuf ch] [ totcpy: array/initial 256 0 fi: open/binary/direct fn while [mybuf: copy/part fi 4096] [ foreach ch mybuf [ ch: 1 + ch poke totcpy ch 1 + pick totcpy ch ] ] close fi ] cfbuf: func [fn [file!] /local ch] [ totbuf: array/initial 256 0 foreach ch read/binary fn [ ch: 1 + ch poke totbuf ch 1 + pick totbuf ch ] ] The TOTxxx tallies were outside the functions so that I could verify afterwards that the tallies were all equal. I just happened to have a 32Mb core dump lying around, so it was easy to dd some test files of different sizes for benchmarking. The relative times, normalized to the fastest function, for various test files are: ------ test file size ------ Function 1 Mb 2 Mb 4 Mb 8 Mb -------- ---- ---- ---- ---- cfdir 3.38 3.71 3.75 3.84 cfcpy 1.00 1.00 1.00 1.00 cfbuf 1.08 1.06 1.10 1.06 Anything below about the 10% level is probably noise, but the trend seems consistent so far; CFCPY consistently wins over CFBUF by a small but noticeable margin, while CFDIR gets worse as the file grows. Anyone have any light to shed? -jn- --------------------------------------------------------------- There are two types of science: physics and stamp collecting! -- Sir Arthur Eddington joel-dot-neely-at-fedex-dot-com

[11/13] from: jeff:rebol at: 6-Jul-2001 23:26

> Joel Neely wrote: > > Jeff Kreis wrote:

<<quoted lines omitted: 9>>

> as to how the above could ever help performance. Perhaps I > misread the suggestion...

My apologies for not being robust above. You are correct. Using COPY on a direct port, as recommended by rcug, will be the fastest. I wasn't specific enough, but was alluding to cases where reading in the whole file would cause paging-- I just changed the file access and used the same algorithm. You apparently understood this issue clearly since you included the better algorithm for the direct port case. -jeff P.S. I think benchmarks that involve performance impacted by potential page faults are more meaningful when accompanied by the test system memory stats (available system ram, etc.) (-: No need to provide -- I believe your results. :-)

[12/13] from: g:santilli:tiscalinet:it at: 7-Jul-2001 14:40

Hello Joel! On 06-Lug-01, you wrote: JN> ------ test file size ------ JN> Function 1 Mb 2 Mb 4 Mb 8 Mb JN> -------- ---- ---- ---- ---- JN> cfdir 3.38 3.71 3.75 3.84 JN> cfcpy 1.00 1.00 1.00 1.00 JN> cfbuf 1.08 1.06 1.10 1.06 IMHO, CFBUF is likely to "explode" for files larger than current available memory (because of swapping), but this is probably very platform dependend. CFDIR is probably slow because, with /DIRECT, it calls lowlevel OS funtions to read from the file for each pick, reading a char at a time. If the OS is not doing buffering internally, I'm not surprised it is really so slow. CFCPY is definitely the way to go to read from (very) large files IMHO. Regards, Gabriele. -- Gabriele Santilli <[giesse--writeme--com]> - Amigan - REBOL programmer Amiga Group Italia sez. L'Aquila -- http://www.amyresource.it/AGI/

[13/13] from: joel:neely:fedex at: 6-Jul-2001 23:17

Jeff Kreis wrote:

> I think benchmarks that involve performance impacted by > potential page faults are more meaningful when accompanied > by the test system memory stats (available system ram, etc.) >

Good point. Another reason for doing so is that I've found a few cases where relative benchmarks give different results on different platforms. I ran these on PentiumII, 108Mb RAM, RedHat 7.1, View 1.2.0.4.2. I typically check for performance issues separately from running the timings, to minimize Heisenberg, but re-ran a few checks for the following. Via another xterm during the 8Mb test, top showed rebol using 90%-95% CPU, 7.5% of memory, with over 13Mb mem free and swap doing basically nothing. Most of the time there were 2-4 processes running (REBOL, top, X, and xterm, in that order). Rerunning the tests with that extra load increased the wall time by a few percent, but the relative performances of the three versions stayed consistent with the earlier figures. There's no rocket science here at all, but if anyone is curious enough to run identical tests on other platforms, the test rig was (function bodies deleted for space, they're in the earlier post): charfreq: make object! [ totdir: array/initial 256 0 totcpy: array/initial 256 0 totbuf: array/initial 256 0 cfdir: func [fn [file!] /local fi ch] [...] cfcpy: func [fn [file!] /local fi mybuf ch] [...] cfbuf: func [fn [file!] /local ch] [...] timetest: func [fn [file!] n [integer!] /local t0 t1 t2] [ t0: now/time loop n [cfdir fn] t1: now/time loop n [cfcpy fn] t2: now/time loop n [cfbuf fn] t3: now/time t3: t3 - t2 t3: t3/hour * 60 + t3/minute * 60 + t3/second t2: t2 - t1 t2: t2/hour * 60 + t2/minute * 60 + t2/second t1: t1 - t0 t1: t1/hour * 60 + t1/minute * 60 + t1/second t0: min min t1 t2 t3 print [ t1 tab t1 / t0 newline t2 tab t2 / t0 newline t3 tab t3 / t0 newline ] ] ] The second parameter N to CHARFREQ/TIMETEST can be left at 1 if you have a large enough file (or a slow enough CPU! ;-) -jn- -- --------------------------------------------------------------- There are two types of science: physics and stamp collecting! -- Sir Arthur Eddington joel-dot-neely-at-fedex-dot-com

Notes

Quoted lines have been omitted from some messages.
View the message alone to see the lines that have been omitted