[REBOL] Re: Working with large files
From: kpeters::otaksoft::com at: 11-Aug-2008 16:59
On Mon, 11 Aug 2008 15:11:56 -0400, Brock Kalef wrote:
> I'm looking to read 800+ MB web log files and process the log prior to running through
an
> analysis tool. I'm running into "Out of Memory" errors and the odd Rebol Crash in
attempting to
> do this.
>
> I started out simply reading the data directly into a word and looping through the
data. This
> worked great for the sample data set of 45 MB. this then failed on a 430+ MB file.
i.e.. data:
> read/lines %file-name.log
>
> I then changed the direct read to use a port i.e.. data-port: open/lines %file-name.log.
This
> worked for the 430+ MB file but then I started getting the errors again for the 800+
MB files.
>
> It's now obvious that I will need to read in portions of the file at a time. However,
I am
> unsure how to do this while also ensuring I get all the data. As you can see from
my earlier
> example code, I'm interested in reading a line at a time for simplicity in processing
the records
> as they are not fixed width (vary in length). My fear is that I will not be able to
properly
> handle the records that are truncated due to the size of the data block I retrieve
from the file.
> Or atleast not be able to do this easily. Are there any suggestions?
>
> My guess is that I will need to;
> - pull in a fixed length block of data
> - read to the data until I reach the first occurrence of a newline - track the index
of the
> location of the newline
> - continue reading the data until I reach the end of the data-block - once reaching
the end of
> the data retrieved, calculate where the last record process ended - read the next
data block
> from that point - continue until reaching the end of file
>
> Any other suggestions?
>
> Regards,
> Brock Kalef
Sounds like a plan to me. Just ran this on a 1.9 GB file and it was surprisingly fast
(kept my HD
busy for sure):
port: open/seek %/c/apache.log
chunksize: 1'048'576 ; 1 MB chunks
forskip port chunksize [
chunk: copy/part port chunksize
]
close port
Do you really need to process it line by line though? That would really slow it down.
Sure you cannot operate on the chunks in their entirety somehow?
Cheers,
Kai