Mailing List Archive: 49091 messages
  • Home
  • Script library
  • AltME Archive
  • Mailing list
  • Articles Index
  • Site search
 

[REBOL] Re: Working with large files

From: kpeters::otaksoft::com at: 11-Aug-2008 16:59

On Mon, 11 Aug 2008 15:11:56 -0400, Brock Kalef wrote:
> I'm looking to read 800+ MB web log files and process the log prior to running through an > analysis tool. I'm running into "Out of Memory" errors and the odd Rebol Crash in attempting to > do this. > > I started out simply reading the data directly into a word and looping through the data. This > worked great for the sample data set of 45 MB. this then failed on a 430+ MB file. i.e.. data: > read/lines %file-name.log > > I then changed the direct read to use a port i.e.. data-port: open/lines %file-name.log. This > worked for the 430+ MB file but then I started getting the errors again for the 800+ MB files. > > It's now obvious that I will need to read in portions of the file at a time. However, I am > unsure how to do this while also ensuring I get all the data. As you can see from my earlier > example code, I'm interested in reading a line at a time for simplicity in processing the records > as they are not fixed width (vary in length). My fear is that I will not be able to properly > handle the records that are truncated due to the size of the data block I retrieve from the file. > Or atleast not be able to do this easily. Are there any suggestions? > > My guess is that I will need to; > - pull in a fixed length block of data > - read to the data until I reach the first occurrence of a newline - track the index of the > location of the newline > - continue reading the data until I reach the end of the data-block - once reaching the end of > the data retrieved, calculate where the last record process ended - read the next data block > from that point - continue until reaching the end of file > > Any other suggestions? > > Regards, > Brock Kalef
Sounds like a plan to me. Just ran this on a 1.9 GB file and it was surprisingly fast (kept my HD busy for sure): port: open/seek %/c/apache.log chunksize: 1'048'576 ; 1 MB chunks forskip port chunksize [ chunk: copy/part port chunksize ] close port Do you really need to process it line by line though? That would really slow it down. Sure you cannot operate on the chunks in their entirety somehow? Cheers, Kai