Mailing List Archive: Re: byte frequencies

[REBOL] Re: byte frequencies

From: joel:neely:fedex at: 6-Jul-2001 14:33


Joel Neely wrote:
> Jeff Kreis wrote:
> >
> > If you're handling very large files you'd want to use:
> >
> >     fi: open/binary/direct fi
> >
> >     while [ch: pick fi 1][
> >         ....
> >         fi: next fi
> >     ]
> >
> > That will be much faster over some minimum size.
> >
>
> Perhaps I'll have a chance to do some benchmarking to get a
> clue as to what qualifies as "some minimum size".
>

Well, having done the benchmarking, I am *really* clueless as
to how the above could ever help performance.  Perhaps I
misread the suggestion...

I tested three variations of a function to read a file and
tally (all of) its characters:

    CFDIR - AFAICT is constructed per the recommendation for
            large files,
    CFCPY - is based on the technique in R/CUG 2.3, p 13-11
    CFBUF - is the straightforward read-everything-into-memory
            original version
except that all use Larry's redefine-the-variable trick (and
CFCPY left CH as a /LOCAL for consistency -- even though it's
the target of a FOREACH, since the other versions needed to
declare it).

    totdir: array/initial 256 0
    totcpy: array/initial 256 0
    totbuf: array/initial 256 0

    cfdir: func [fn [file!] /local fi ch] [
        totdir: array/initial 256 0
        fi: open/binary/direct fn
        while [ch: pick fi 1] [
            ch: 1 + ch
            poke totdir ch 1 + pick totdir ch
            fi: next fi
        ]
        close fi
    ]

    cfcpy: func [fn [file!] /local fi mybuf ch] [
        totcpy: array/initial 256 0
        fi: open/binary/direct fn
        while [mybuf: copy/part fi 4096] [
            foreach ch mybuf [
                ch: 1 + ch
                poke totcpy ch 1 + pick totcpy ch
            ]
        ]
        close fi
    ]

    cfbuf: func [fn [file!] /local ch] [
        totbuf: array/initial 256 0
        foreach ch read/binary fn [
            ch: 1 + ch
            poke totbuf ch 1 + pick totbuf ch
        ]
    ]

The TOTxxx tallies were outside the functions so that I could
verify afterwards that the tallies were all equal.

I just happened to have a 32Mb core dump lying around, so it was
easy to dd some test files of different sizes for benchmarking.
The relative times, normalized to the fastest function, for
various test files are:

                ------ test file size ------
    Function    1 Mb    2 Mb    4 Mb    8 Mb
    --------    ----    ----    ----    ----
       cfdir    3.38    3.71    3.75    3.84
       cfcpy    1.00    1.00    1.00    1.00
       cfbuf    1.08    1.06    1.10    1.06

Anything below about the 10% level is probably noise, but the
trend seems consistent so far; CFCPY consistently wins over
CFBUF by a small but noticeable margin, while CFDIR gets worse
as the file grows.

Anyone have any light to shed?

-jn-

---------------------------------------------------------------
There are two types of science: physics and stamp collecting!
                                        -- Sir Arthur Eddington
joel-dot-neely-at-fedex-dot-com