Mailing List Archive: Re: Working with large files

[REBOL] Re: Working with large files

From: kpeters::otaksoft::com at: 11-Aug-2008 16:59


On Mon, 11 Aug 2008 15:11:56 -0400, Brock Kalef wrote:

> I'm looking to read 800+ MB web log files and process the log prior to running through 
an

> analysis tool.  I'm running into "Out of Memory" errors and the odd Rebol Crash in 
attempting to
> do this.
>

> I started out simply reading the data directly into a word and looping through the 
data.  This

> worked great for the sample data set of 45 MB. this then failed on a 430+ MB file. 
 i.e..  data:
> read/lines %file-name.log
>

> I then changed the direct read to use a port i.e..   data-port: open/lines %file-name.log. 
  This

> worked for the 430+ MB file but then I started getting the errors again for the 800+ 
MB files.
>

> It's now obvious that I will need to read in portions of the file at a time.  However, 
I am

> unsure how to do this while also ensuring I get all the data.  As you can see from 
my earlier

> example code, I'm interested in reading a line at a time for simplicity in processing 
the records

> as they are not fixed width (vary in length).  My fear is that I will not be able to 
properly

> handle the records that are truncated due to the size of the data block I retrieve 
from the file.
> Or atleast not be able to do this easily.  Are there any suggestions?
>
> My guess is that I will need to;
> -  pull in a fixed length block of data

> -  read to the data until I reach the first occurrence of a newline -  track the index 
of the
> location of the newline

> -  continue reading the data until I reach the end of the data-block -  once reaching 
the end of

> the data retrieved, calculate where the last record process ended -  read the next 
data block
> from that point -  continue until reaching the end of file
>
> Any other suggestions?
>
> Regards,
> Brock Kalef


Sounds like a plan to me. Just ran this on a 1.9 GB file and it was surprisingly fast 
(kept my HD
busy for sure):

port: open/seek %/c/apache.log
chunksize: 1'048'576  ; 1 MB chunks
forskip port chunksize [
  chunk: copy/part port chunksize
]
close port

Do you really need to process it line by line though? That would really slow it down.
Sure you cannot operate on the chunks in their entirety somehow?

Cheers,
Kai