[REBOL] Re: Line reduction
From: joel:neely:fedex at: 6-Jul-2001 13:17
Hi, Aaron,
Having had time to sleep on it (and clear my head of some other
distractions)...
[aroberts--swri--edu] wrote:
> I have a very large dataset comprised of numbers. The values
> come in a set of three, one set to a line. The ordering of
> the sets is arbitrary. I need a way of reducing the data down,
> so I can get a 'snap shot' of the full data set...
>
That being the case, another solution (that requires less
processing internally) would be:
source-file: to-file ask "Source file name: "
output-file: to-file ask "Output file name: "
sample-pct: 0.01 *
min 100 max 0 to-decimal ask "% of data to sample: "
line-count: length? all-the-data: read/lines to-file source-file
write/lines to-file output-file
at all-the-data
to-integer line-count - (line-count - 1 * sample-pct) + 0.5
Expressing the sampling rate as the %-age of the data you want to
keep seems to be fairly user-friendly, and lets you get the exact
level you want without multiple passes (e.g. 25% instead of half of
half).
The value of SAMPLE-PCT is limited to the range 0.0 through 100.0
to protect against bogus entries, keying errors, etc.
The input file is read into a block of lines, whose length is the
line count.
The last expression simply calcuates where the *last* SAMPLE-PCT
of the lines are found, and write from there to the end of the
block. Therefore, no copying or removing is required.
This version still reads the entire file into memory as the way to
find out the line count. If all of your lines were close enough
to the same length, you could modify the arithmetic to start with
the size of the input file, calculate the percentage of that total
size, then read and write only that much (ignoring the partial line
that might appear at the end).
-jn-
___________________________________________________________________
The purpose of computing is insight, not numbers!
- R. W. Hamming
joel'dot'neely'at'fedex'dot'com