[REBOL] Re: Working with large files
From: brock:kalef:innovapost at: 12-Aug-2008 13:51
Thanks to everyone for their feedback/suggestions.
I seem to have a solution that will back track to the starting point of
any non-complete record. This should work on any data that is newline
terminated and you can set the amount of data to grab in each call to
grab a new batch of data using the 'size word. The number represents
the number of bytes to copy from the data file.
rebol[]
port: open/seek %"Sample data/simplified.log"
size: 130
cnt: 1
while [not tail? port] [
data: copy/part port size
working-data: copy data
either (last working-data) = #"^/" [
use-last-record?: true
start-at: (index? data) + :size
][
use-last-record?: false
either not error? try [start-at: (index? find/reverse
tail data "^/")][
; new starting point of next read since block didn't end
in full record;
start-at: (index? find/reverse tail data "^/")
][
start-at: (index? data) + :size
]
]
working-data: parse/all working-data "^/"
record-cnt: length? working-data
print ["Record Count " :cnt ": " record-cnt]
print ["First Record:^/" first working-data]
print ["Use last record?: " use-last-record?]
print ["Last Record:^/" last working-data newline newline]
port: skip port (size + start-at - size)
cnt: cnt + 1
]
close port
halt
If anyone wants to try this for themselves here's a sample data file
that can be cut and then saved to disk and then change the file path in
the script above. I used this data to be able to quickly identify what
record you are in. If you save the file, make sure there is an emply
line at the end of the data file.
1 record1 record1recordonerecord1 end
2 recordtwo record2 record 2 record 2 end
3 rec3 recordthree record3 record 3 record3 end
4 record 4 record4 recordfour record14 end
5 recordfive record5 record 5 record 5 end
6 rec6 recordsix record6 record 6 record6 end
7 record 7 record7 recordseven record7 end
8 record 8 record8 recordeight record8 end
9 recordnine record9 record 9 record 9 end
10 rec10 recordten record10 record 10 record10 end
11 record 11 record11 recordeleven record11 end
12 recordtwelve record12 record 12 record 12 end
13 rec13 recordthirteen record13 record 13 record13 end
14 record 14 record14 recordfourteen record14 end
15 recordfifteen record15 record 15 record 15 end
16 rec16 recordsixteen record16 record 16 record16 end
I just finished running the above script on a 900+ MB file and it
processed through to the end no problem.
Brock