[REBOL] Re: Large Files and Binary Read
From: gscottjones:mchsi at: 20-Oct-2002 17:37
From: "James Marsden"
> Yeah now I found that bug.. grrrr.. it lets me
> get the first block of data then repeats endlessly.
>
> Anyone suggest a fix?
...
Hi, James,
I hoped you wouldn't be back, which would be good news, but I suspected that
you would be back.
:(
There is no direct working substitute that I am aware of for a true seek
(skip in REBOLese). When needing to skip through data while using
/direct/binary in combination on a local file, the only thing that I am
aware of is to open the file, then "waste" parts of the file as a way to
simulate skipping. Given that it is in direct mode, the memory is not being
eaten up by an ever expanding buffer. However, you are, in essence, cycling
through *all* the data, which may be substantial in file sizes to which you
have refered.
Also, there is a good chance that the first block of info you got that you
thought was correct was probably in fact an incorrect block (it was probably
the beginning of the file, even though you used read/direct/binary/skip).
So that we "know" what we are dealing with, I greated a very small file with
repeating data by column. Here is a ten row matrix with a hex in each
column:
blk: copy []
loop 10 [repeat n 16 [append blk skip to-hex n - 1 7]]
write %//windows/desktop/test.txt rejoin blk
Now, when you practice with your actual algorithm, you'll be able to see
that you in fact have the correct columns. Now for one of many, many
variations to show how to pseudo skip through your data:
rows: 10
cols: 16
data-length: 4
start-col: 3
data-slice: copy ""
data: open/direct/binary %//windows/desktop/test.txt
repeat r rows [
;skip to proper column
copy/part data start-col - 1
;collect some data
append data-slice to-string copy/part data data-length
;skip to end of column
copy/part data cols - start-col - data-length + 1
]
close data
probe data-slice
The most pertinent part is the "throw-away" copy/part statements. The rest
was just my arbitrary controls to cycle by rows (hey, it was a quick and
dirty hack! :-).
For huge row counts but with nominal column counts, I suspect you will
actually want to read in a buffered row of data at a time, and then parse
the proper column stuff out. This would help to reduce disk access while
protecting memory.
If the column count and row counts are huge, then I suspect grabbing a
sector of disk data at a time would be more efficient, but more work
controlling the column access algorithm.
Hope this makes some sense. Out of time. Good luck.
--Scott Jones