[REBOL] Working with large files
From: brock::kalef::innovapost::com at: 11-Aug-2008 15:11
I'm looking to read 800+ MB web log files and process the log prior to
running through an analysis tool. I'm running into "Out of Memory"
errors and the odd Rebol Crash in attempting to do this.
I started out simply reading the data directly into a word and looping
through the data. This worked great for the sample data set of 45 MB.
this then failed on a 430+ MB file. i.e.. data: read/lines
I then changed the direct read to use a port i.e.. data-port:
open/lines %file-name.log. This worked for the 430+ MB file but then I
started getting the errors again for the 800+ MB files.
It's now obvious that I will need to read in portions of the file at a
time. However, I am unsure how to do this while also ensuring I get all
the data. As you can see from my earlier example code, I'm interested
in reading a line at a time for simplicity in processing the records as
they are not fixed width (vary in length). My fear is that I will not
be able to properly handle the records that are truncated due to the
size of the data block I retrieve from the file. Or atleast not be able
to do this easily. Are there any suggestions?
My guess is that I will need to;
- pull in a fixed length block of data
- read to the data until I reach the first occurrence of a newline
- track the index of the location of the newline
- continue reading the data until I reach the end of the data-block
- once reaching the end of the data retrieved, calculate where the last
record process ended
- read the next data block from that point
- continue until reaching the end of file
Any other suggestions?