byte frequencies
[1/13] from: joel:neely:fedex at: 5-Jul-2001 11:12
Hello, all!
WRT to the issue of tallying occurrences of byte values in
a file, the best I've thought of thus far is
bytefreq: func [fi [file!] /local ft ch] [
ft: array/initial 256 0
foreach ch read to-file fi [
poke ft 1 + ch 1 + pick ft 1 + ch
]
for ch 0 255 1 [
if 0 < pick ft 1 + ch [
print [mold to-char ch tab pick ft 1 + ch]
]
]
]
(except for the obvious minor refactoring of using an
additional temporary variable or two...)
Applied to its own source, it gives
>> bytefreq %bytefreq.r
#"^/" 16
#" " 105
#"!" 2
#"#" 1
#"+" 5
#"-" 2
#"/" 6
#"0" 3
#"1" 6
#"2" 2
#"5" 3
#"6" 1
#":" 2
#"<" 1
#"B" 1
#"E" 1
#"L" 1
#"O" 1
#"R" 1
#"[" 8
#"]" 8
#"a" 9
#"b" 4
#"c" 16
#"d" 2
#"e" 8
#"f" 15
#"h" 10
#"i" 13
#"k" 4
#"l" 9
#"m" 1
#"n" 4
#"o" 9
#"p" 5
#"q" 1
#"r" 10
#"s" 1
#"t" 12
#"u" 2
#"y" 2
Any suggestions for improvement?
-jn-
--
------------------------------------------------------------
Programming languages: compact, powerful, simple ...
Pick any two!
joel'dot'neely'at'fedex'dot'com
[2/13] from: larry:ecotope at: 5-Jul-2001 21:13
Hi Joel,
Good work. Here is my slightly more Rebolian revision:
bytefreq2: func [fi [file!] /local ft cnt] [
ft: array/initial 256 0
foreach ch read fi [
ch: 1 + ch
poke ft ch 1 + pick ft ch
]
repeat ch length? ft [
if 0 < cnt: pick ft ch [
print [mold to-char ch - 1 tab cnt]
]
]
]
A few quick comments. In your original the variable CH does not have to be
declared local to BYTEFREQ because it is only used as an arg (which are
always local to their functions) for the FOREACH and FOR functions. TO-FILE
is superfluous because fi was declared as file! in the function spec.
Inside the code block of a function the args can be redefined as convenient,
in particular in the FOREACH function, CH can be given any appropriate value
in the code block. It is reset to the correct value before each execution of
the code block.
I rearranged things to eliminate the redundant index calculations. The
REPEAT construction I used guarantees that one automatically indexes through
all elements of the series argument.
I added CNT as a local variable, but we could also have used the variable
LOCAL if we wished to avoid declaring another local at the expense of
readability.
Your basic idea of using implicit type-conversion to allow characters to be
used as indices is clever. But the reader should note that when adding
characters and integers, the result is order dependent:
>> 1 + #"a"
== 98
>> #"a" + 1
== #"b"
>>
Cheers
-Larry
[3/13] from: sqlab::gmx::net at: 6-Jul-2001 9:31
And in order to get all bytes' frequencies
foreach ch to-string read/binary fi [
..
...
AR
[4/13] from: joel:neely:fedex at: 5-Jul-2001 20:25
Hi, Larry,
Good comments all; thanks!
Larry Palmiter wrote:
> bytefreq2: func [fi [file!] /local ft cnt] [
> ft: array/initial 256 0
<<quoted lines omitted: 8>>
> ]
> ]
Nice!
> TO-FILE is superfluous because fi was declared as file! in
> the function spec.
>
I originally had
func [fi [file! string!] ...
and forgot to remove TO-FILE when I pruned the types. Silly
of me not to have cleaned out the leftovers!
> I added CNT as a local variable, but we could also have used
> the variable LOCAL if we wished to avoid declaring another
> local at the expense of readability.
>
I certainly think readability is The Right Thing, especially
when the potential "savings" from sacrificing it are so small!
> ... the reader should note that when adding characters and
> integers, the result is order dependent:
>
Right, which is why I chose the order I did. I suppose this
is another of those cases where an inconsistency occurred in
favor of anticipated use, given that
>> type? 1 + 1.0 ;== decimal!
>> type? 1.0 + 1 ;== decimal!
Thanks for the helpful feedback!
-jn-
--
------------------------------------------------------------
Programming languages: compact, powerful, simple ...
Pick any two!
joel'dot'neely'at'fedex'dot'com
[5/13] from: joel:neely:fedex at: 5-Jul-2001 20:36
Ah, yes!
Forgot about good ol' CP/M...
Thanks!
-jn-
[sqlab--gmx--net] wrote:
> And in order to get all bytes' frequencies
>
> foreach ch to-string read/binary fi [
> ...
>
--
------------------------------------------------------------
Programming languages: compact, powerful, simple ...
Pick any two!
joel'dot'neely'at'fedex'dot'com
[6/13] from: joel:neely:fedex at: 5-Jul-2001 20:48
One other tidbit, Larry...
Larry Palmiter wrote:
> bytefreq2: func [fi [file!] /local ft cnt] [
> ft: array/initial 256 0
<<quoted lines omitted: 8>>
> ]
> ]
...
> Your basic idea of using implicit type-conversion to allow
> characters to be used as indices...
<<quoted lines omitted: 11>>
> > ]
> >
Over the years, I've developed the practice of using each
variable to mean exactly one thing (although, as with all
good intentions, I'm not perfect at it yet! ;-) I've often
found subtle bugs that result from redefining the meaning
or role of a variable "just a little bit".
Thus, my original is fairly obsessive about using CH to
mean "the current character I'm dealing with". That habit
explains my blind spot WRT flipping CH back-and-forth between
current character
and "index for current character". Of
course, in 0-origin indexing those two are the same, so there's
no arithmetic (or mental adjustment ;-) required.
Finally, to be consistent with your reminder to the reader
about int + char vs. char + int , I should note that the
loop
for ch 0 255 1 [...]
actually *had* to be written using INTEGER! bounds; if written
using CHAR! bounds, it never terminates!
Fair warning!
-jn-
------------------------------------------------------------
Programming languages: compact, powerful, simple ...
Pick any two!
joel'dot'neely'at'fedex'dot'com
[7/13] from: jeff:rebol at: 6-Jul-2001 8:39
If you assume that NULL chars in your files are infrequent you
can just use a 1-based index.
If bytefreq3 generates an error, then fall back to one of the
others. This version is faster than bf2.
bytefreq3: func [fi [file!] /local ft cnt] [
ft: array/initial 255 0
foreach ch read/binary fi [
change at ft ch 1 + ft/:ch ;- <- err?
]
repeat ch 255 [
if 0 < cnt: ft/:ch [
print [mold to-char ch tab cnt]
]
]
]
If you're handling very large files you'd want to use:
fi: open/binary/direct fi
while [ch: pick fi 1][
....
fi: next fi
]
That will be much faster over some minimum size.
-jeff
[8/13] from: joel:neely:fedex at: 6-Jul-2001 13:38
Jeff Kreis wrote:
> If you assume that NULL chars in your files are infrequent you
> can just use a 1-based index.
>
It is not an option simply to ignore some character values.
> If you're handling very large files you'd want to use:
> fi: open/binary/direct fi
<<quoted lines omitted: 3>>
> ]
> That will be much faster over some minimum size.
Perhaps I'll have a chance to do some benchmarking to get a clue
as to what qualifies as "some minimum size".
-jn-
___________________________________________________________________
The purpose of computing is insight, not numbers!
- R. W. Hamming
joel'dot'neely'at'fedex'dot'com
[9/13] from: jeff:rebol at: 6-Jul-2001 12:08
> > If you assume that NULL chars in your files are
> > infrequent you can just use a 1-based index.
> >
>
> It is not an option simply to ignore some character values.
Unless they're not in your data set (log files). Or if you're
considering the amortized case. :-)
-jeff
[10/13] from: joel:neely:fedex at: 6-Jul-2001 14:33
Joel Neely wrote:
> Jeff Kreis wrote:
> >
<<quoted lines omitted: 11>>
> Perhaps I'll have a chance to do some benchmarking to get a
> clue as to what qualifies as "some minimum size".
Well, having done the benchmarking, I am *really* clueless as
to how the above could ever help performance. Perhaps I
misread the suggestion...
I tested three variations of a function to read a file and
tally (all of) its characters:
CFDIR - AFAICT is constructed per the recommendation for
large files,
CFCPY - is based on the technique in R/CUG 2.3, p 13-11
CFBUF - is the straightforward read-everything-into-memory
original version
except that all use Larry's redefine-the-variable trick (and
CFCPY left CH as a /LOCAL for consistency -- even though it's
the target of a FOREACH, since the other versions needed to
declare it).
totdir: array/initial 256 0
totcpy: array/initial 256 0
totbuf: array/initial 256 0
cfdir: func [fn [file!] /local fi ch] [
totdir: array/initial 256 0
fi: open/binary/direct fn
while [ch: pick fi 1] [
ch: 1 + ch
poke totdir ch 1 + pick totdir ch
fi: next fi
]
close fi
]
cfcpy: func [fn [file!] /local fi mybuf ch] [
totcpy: array/initial 256 0
fi: open/binary/direct fn
while [mybuf: copy/part fi 4096] [
foreach ch mybuf [
ch: 1 + ch
poke totcpy ch 1 + pick totcpy ch
]
]
close fi
]
cfbuf: func [fn [file!] /local ch] [
totbuf: array/initial 256 0
foreach ch read/binary fn [
ch: 1 + ch
poke totbuf ch 1 + pick totbuf ch
]
]
The TOTxxx tallies were outside the functions so that I could
verify afterwards that the tallies were all equal.
I just happened to have a 32Mb core dump lying around, so it was
easy to dd some test files of different sizes for benchmarking.
The relative times, normalized to the fastest function, for
various test files are:
------ test file size ------
Function 1 Mb 2 Mb 4 Mb 8 Mb
-------- ---- ---- ---- ----
cfdir 3.38 3.71 3.75 3.84
cfcpy 1.00 1.00 1.00 1.00
cfbuf 1.08 1.06 1.10 1.06
Anything below about the 10% level is probably noise, but the
trend seems consistent so far; CFCPY consistently wins over
CFBUF by a small but noticeable margin, while CFDIR gets worse
as the file grows.
Anyone have any light to shed?
-jn-
---------------------------------------------------------------
There are two types of science: physics and stamp collecting!
-- Sir Arthur Eddington
joel-dot-neely-at-fedex-dot-com
[11/13] from: jeff:rebol at: 6-Jul-2001 23:26
> Joel Neely wrote:
> > Jeff Kreis wrote:
<<quoted lines omitted: 9>>
> as to how the above could ever help performance. Perhaps I
> misread the suggestion...
My apologies for not being robust above.
You are correct. Using COPY on a direct port, as
recommended by rcug, will be the fastest. I wasn't specific
enough, but was alluding to cases where reading in the whole
file would cause paging-- I just changed the file access and
used the same algorithm. You apparently understood this
issue clearly since you included the better algorithm for
the direct port case.
-jeff
P.S. I think benchmarks that involve performance impacted by
potential page faults are more meaningful when accompanied
by the test system memory stats (available system ram, etc.)
(-: No need to provide -- I believe your results. :-)
[12/13] from: g:santilli:tiscalinet:it at: 7-Jul-2001 14:40
Hello Joel!
On 06-Lug-01, you wrote:
JN> ------ test file size ------
JN> Function 1 Mb 2 Mb 4 Mb 8 Mb
JN> -------- ---- ---- ---- ----
JN> cfdir 3.38 3.71 3.75 3.84
JN> cfcpy 1.00 1.00 1.00 1.00
JN> cfbuf 1.08 1.06 1.10 1.06
IMHO, CFBUF is likely to "explode" for files larger than current
available memory (because of swapping), but this is probably very
platform dependend. CFDIR is probably slow because, with /DIRECT,
it calls lowlevel OS funtions to read from the file for each pick,
reading a char at a time. If the OS is not doing buffering
internally, I'm not surprised it is really so slow.
CFCPY is definitely the way to go to read from (very) large files
IMHO.
Regards,
Gabriele.
--
Gabriele Santilli <[giesse--writeme--com]> - Amigan - REBOL programmer
Amiga Group Italia sez. L'Aquila -- http://www.amyresource.it/AGI/
[13/13] from: joel:neely:fedex at: 6-Jul-2001 23:17
Jeff Kreis wrote:
> I think benchmarks that involve performance impacted by
> potential page faults are more meaningful when accompanied
> by the test system memory stats (available system ram, etc.)
>
Good point. Another reason for doing so is that I've found a
few cases where relative benchmarks give different results on
different platforms.
I ran these on PentiumII, 108Mb RAM, RedHat 7.1, View 1.2.0.4.2.
I typically check for performance issues separately from running
the timings, to minimize Heisenberg, but re-ran a few checks for
the following.
Via another xterm during the 8Mb test, top showed rebol using
90%-95% CPU, 7.5% of memory, with over 13Mb mem free and swap
doing basically nothing. Most of the time there were 2-4
processes running (REBOL, top, X, and xterm, in that order).
Rerunning the tests with that extra load increased the wall time
by a few percent, but the relative performances of the three
versions stayed consistent with the earlier figures.
There's no rocket science here at all, but if anyone is curious
enough to run identical tests on other platforms, the test rig
was (function bodies deleted for space, they're in the earlier
post):
charfreq: make object! [
totdir: array/initial 256 0
totcpy: array/initial 256 0
totbuf: array/initial 256 0
cfdir: func [fn [file!] /local fi ch] [...]
cfcpy: func [fn [file!] /local fi mybuf ch] [...]
cfbuf: func [fn [file!] /local ch] [...]
timetest: func [fn [file!] n [integer!] /local t0 t1 t2] [
t0: now/time
loop n [cfdir fn]
t1: now/time
loop n [cfcpy fn]
t2: now/time
loop n [cfbuf fn]
t3: now/time
t3: t3 - t2
t3: t3/hour * 60 + t3/minute * 60 + t3/second
t2: t2 - t1
t2: t2/hour * 60 + t2/minute * 60 + t2/second
t1: t1 - t0
t1: t1/hour * 60 + t1/minute * 60 + t1/second
t0: min min t1 t2 t3
print [
t1 tab t1 / t0 newline
t2 tab t2 / t0 newline
t3 tab t3 / t0 newline
]
]
]
The second parameter N to CHARFREQ/TIMETEST can be left at 1
if you have a large enough file (or a slow enough CPU! ;-)
-jn-
--
---------------------------------------------------------------
There are two types of science: physics and stamp collecting!
-- Sir Arthur Eddington
joel-dot-neely-at-fedex-dot-com
Notes
- Quoted lines have been omitted from some messages.
View the message alone to see the lines that have been omitted