Mailing List Archive: 49091 messages
  • Home
  • Script library
  • AltME Archive
  • Mailing list
  • Articles Index
  • Site search
 

Working with large files

 [1/8] from: brock::kalef::innovapost::com at: 11-Aug-2008 15:11


I'm looking to read 800+ MB web log files and process the log prior to running through an analysis tool. I'm running into "Out of Memory" errors and the odd Rebol Crash in attempting to do this. I started out simply reading the data directly into a word and looping through the data. This worked great for the sample data set of 45 MB. this then failed on a 430+ MB file. i.e.. data: read/lines %file-name.log I then changed the direct read to use a port i.e.. data-port: open/lines %file-name.log. This worked for the 430+ MB file but then I started getting the errors again for the 800+ MB files. It's now obvious that I will need to read in portions of the file at a time. However, I am unsure how to do this while also ensuring I get all the data. As you can see from my earlier example code, I'm interested in reading a line at a time for simplicity in processing the records as they are not fixed width (vary in length). My fear is that I will not be able to properly handle the records that are truncated due to the size of the data block I retrieve from the file. Or atleast not be able to do this easily. Are there any suggestions? My guess is that I will need to; - pull in a fixed length block of data - read to the data until I reach the first occurrence of a newline - track the index of the location of the newline - continue reading the data until I reach the end of the data-block - once reaching the end of the data retrieved, calculate where the last record process ended - read the next data block from that point - continue until reaching the end of file Any other suggestions? Regards, Brock Kalef

 [2/8] from: kpeters::otaksoft::com at: 11-Aug-2008 16:59


On Mon, 11 Aug 2008 15:11:56 -0400, Brock Kalef wrote:
> I'm looking to read 800+ MB web log files and process the log prior to running through an > analysis tool. I'm running into "Out of Memory" errors and the odd Rebol Crash in attempting to
<<quoted lines omitted: 20>>
> Regards, > Brock Kalef
Sounds like a plan to me. Just ran this on a 1.9 GB file and it was surprisingly fast (kept my HD busy for sure): port: open/seek %/c/apache.log chunksize: 1'048'576 ; 1 MB chunks forskip port chunksize [ chunk: copy/part port chunksize ] close port Do you really need to process it line by line though? That would really slow it down. Sure you cannot operate on the chunks in their entirety somehow? Cheers, Kai

 [3/8] from: tim-johnsons::web::com at: 11-Aug-2008 12:36


Hi Brock: Have you tried using 'open instead of read? I use open with the direct refinement on large files: Example: inf: open/direct/lines file while [L: pick inf 1] [ ;;do things with L ] close inf
> help read
USAGE: READ source /binary /string /direct /no-wait /lines /part size /with end-of-line /mode args /custom params /skip length DESCRIPTION: Reads from a file, url, or port-spec (block or object). READ is a native value. ARGUMENTS: source -- (Type: file url object block) REFINEMENTS: /binary -- Preserves contents exactly. /string -- Translates all line terminators. /direct -- Opens the port without buffering. /no-wait -- Returns immediately without waiting if no data. /lines -- Handles data as lines. /part -- Reads a specified amount of data. size -- (Type: number) /with -- Specifies alternate line termination. end-of-line -- (Type: char string) /mode -- Block of above refinements. args -- (Type: block) /custom -- Allows special refinements. params -- (Type: block) /skip -- Skips a number of bytes. length -- (Type: number) HTH Tim On Monday 11 August 2008, Brock Kalef wrote:

 [4/8] from: brock:kalef:innovapost at: 12-Aug-2008 9:05


Kai, Yes, I'm going to need to use the /seek option. I was trying to avoid it but it looks like it is the only way to go. The records that I am working with although not fixed width are tab delimited. I could likely come up with a way to work on the fixed record size using skip etc, but think it may be just as easy to manage by checking if the last character of the block is a #"^/", and if not ignoring that record, then starting the next block with the start of this record. I should be able to do that easily enough using 'index?. I've been playing with it a little and looks very feasible to implement with minimal pain. Whether it will slow it down or not isn't too big a concern. Cheers, and thanks for your reply. Brock

 [5/8] from: brock:kalef:innovapost at: 12-Aug-2008 9:06


Tim, Thanks for your reply. Yes, I had been looking at Carl's Large Files examples and used open, but it wouldn't work on the really large files unless I used the /seek option. Using this, I am then forced to retrieve a block of content at a time. It seems like it's going to be the way I have to work with this file. Thanks again. Brock

 [6/8] from: jonwhispa::googlemail::com at: 12-Aug-2008 16:23


There is also a /with refinement to specify additional line terminators open/direct/lines/with %file "," It seems that works on both the "," and newline. Using Tim`s suggestion and checking the last char for a newline and doing a remove, second pick and a rejoin should fix that. Jon

 [7/8] from: brock:kalef:innovapost at: 12-Aug-2008 13:51


Thanks to everyone for their feedback/suggestions. I seem to have a solution that will back track to the starting point of any non-complete record. This should work on any data that is newline terminated and you can set the amount of data to grab in each call to grab a new batch of data using the 'size word. The number represents the number of bytes to copy from the data file. rebol[] port: open/seek %"Sample data/simplified.log" size: 130 cnt: 1 while [not tail? port] [ data: copy/part port size working-data: copy data either (last working-data) = #"^/" [ use-last-record?: true start-at: (index? data) + :size ][ use-last-record?: false either not error? try [start-at: (index? find/reverse tail data "^/")][ ; new starting point of next read since block didn't end in full record; start-at: (index? find/reverse tail data "^/") ][ start-at: (index? data) + :size ] ] working-data: parse/all working-data "^/" record-cnt: length? working-data print ["Record Count " :cnt ": " record-cnt] print ["First Record:^/" first working-data] print ["Use last record?: " use-last-record?] print ["Last Record:^/" last working-data newline newline] port: skip port (size + start-at - size) cnt: cnt + 1 ] close port halt If anyone wants to try this for themselves here's a sample data file that can be cut and then saved to disk and then change the file path in the script above. I used this data to be able to quickly identify what record you are in. If you save the file, make sure there is an emply line at the end of the data file. 1 record1 record1recordonerecord1 end 2 recordtwo record2 record 2 record 2 end 3 rec3 recordthree record3 record 3 record3 end 4 record 4 record4 recordfour record14 end 5 recordfive record5 record 5 record 5 end 6 rec6 recordsix record6 record 6 record6 end 7 record 7 record7 recordseven record7 end 8 record 8 record8 recordeight record8 end 9 recordnine record9 record 9 record 9 end 10 rec10 recordten record10 record 10 record10 end 11 record 11 record11 recordeleven record11 end 12 recordtwelve record12 record 12 record 12 end 13 rec13 recordthirteen record13 record 13 record13 end 14 record 14 record14 recordfourteen record14 end 15 recordfifteen record15 record 15 record 15 end 16 rec16 recordsixteen record16 record 16 record16 end I just finished running the above script on a 900+ MB file and it processed through to the end no problem. Brock

 [8/8] from: tim-johnsons::web::com at: 12-Aug-2008 9:59


On Tuesday 12 August 2008, CarleySimon wrote:
> There is also a /with refinement to specify additional line terminators > > open/direct/lines/with %file "," > > It seems that works on both the "," and newline. > Using Tim`s suggestion and checking the last char for a newline and doing a > remove, second pick and a rejoin should fix that. > Jon
And of course, results and methods could vary with the OS and the available memory. open/direct/lines works for me on files up to 1GB on linux with 3GB of RAM and 3GB of swap space. tj

Notes
  • Quoted lines have been omitted from some messages.
    View the message alone to see the lines that have been omitted