Working with large files
[1/8] from: brock::kalef::innovapost::com at: 11-Aug-2008 15:11
I'm looking to read 800+ MB web log files and process the log prior to
running through an analysis tool. I'm running into "Out of Memory"
errors and the odd Rebol Crash in attempting to do this.
I started out simply reading the data directly into a word and looping
through the data. This worked great for the sample data set of 45 MB.
this then failed on a 430+ MB file. i.e.. data: read/lines
%file-name.log
I then changed the direct read to use a port i.e.. data-port:
open/lines %file-name.log. This worked for the 430+ MB file but then I
started getting the errors again for the 800+ MB files.
It's now obvious that I will need to read in portions of the file at a
time. However, I am unsure how to do this while also ensuring I get all
the data. As you can see from my earlier example code, I'm interested
in reading a line at a time for simplicity in processing the records as
they are not fixed width (vary in length). My fear is that I will not
be able to properly handle the records that are truncated due to the
size of the data block I retrieve from the file. Or atleast not be able
to do this easily. Are there any suggestions?
My guess is that I will need to;
- pull in a fixed length block of data
- read to the data until I reach the first occurrence of a newline
- track the index of the location of the newline
- continue reading the data until I reach the end of the data-block
- once reaching the end of the data retrieved, calculate where the last
record process ended
- read the next data block from that point
- continue until reaching the end of file
Any other suggestions?
Regards,
Brock Kalef
[2/8] from: kpeters::otaksoft::com at: 11-Aug-2008 16:59
On Mon, 11 Aug 2008 15:11:56 -0400, Brock Kalef wrote:
> I'm looking to read 800+ MB web log files and process the log prior to running through
an
> analysis tool. I'm running into "Out of Memory" errors and the odd Rebol Crash in
attempting to
<<quoted lines omitted: 20>>
> Regards,
> Brock Kalef
Sounds like a plan to me. Just ran this on a 1.9 GB file and it was surprisingly fast
(kept my HD
busy for sure):
port: open/seek %/c/apache.log
chunksize: 1'048'576 ; 1 MB chunks
forskip port chunksize [
chunk: copy/part port chunksize
]
close port
Do you really need to process it line by line though? That would really slow it down.
Sure you cannot operate on the chunks in their entirety somehow?
Cheers,
Kai
[3/8] from: tim-johnsons::web::com at: 11-Aug-2008 12:36
Hi Brock:
Have you tried using 'open instead of read?
I use open with the direct refinement on large files:
Example:
inf: open/direct/lines file
while [L: pick inf 1] [
;;do things with L
]
close inf
> help read
USAGE:
READ source /binary /string /direct /no-wait /lines /part size /with
end-of-line /mode args /custom params /skip length
DESCRIPTION:
Reads from a file, url, or port-spec (block or object).
READ is a native value.
ARGUMENTS:
source -- (Type: file url object block)
REFINEMENTS:
/binary -- Preserves contents exactly.
/string -- Translates all line terminators.
/direct -- Opens the port without buffering.
/no-wait -- Returns immediately without waiting if no data.
/lines -- Handles data as lines.
/part -- Reads a specified amount of data.
size -- (Type: number)
/with -- Specifies alternate line termination.
end-of-line -- (Type: char string)
/mode -- Block of above refinements.
args -- (Type: block)
/custom -- Allows special refinements.
params -- (Type: block)
/skip -- Skips a number of bytes.
length -- (Type: number)
HTH
Tim
On Monday 11 August 2008, Brock Kalef wrote:
[4/8] from: brock:kalef:innovapost at: 12-Aug-2008 9:05
Kai,
Yes, I'm going to need to use the /seek option. I was trying to avoid
it but it looks like it is the only way to go.
The records that I am working with although not fixed width are tab
delimited. I could likely come up with a way to work on the fixed
record size using skip etc, but think it may be just as easy to manage
by checking if the last character of the block is a #"^/", and if not
ignoring that record, then starting the next block with the start of
this record. I should be able to do that easily enough using 'index?.
I've been playing with it a little and looks very feasible to implement
with minimal pain. Whether it will slow it down or not isn't too big a
concern.
Cheers, and thanks for your reply.
Brock
[5/8] from: brock:kalef:innovapost at: 12-Aug-2008 9:06
Tim,
Thanks for your reply. Yes, I had been looking at Carl's Large Files
examples and used open, but it wouldn't work on the really large files
unless I used the /seek option. Using this, I am then forced to
retrieve a block of content at a time. It seems like it's going to be
the way I have to work with this file.
Thanks again.
Brock
[6/8] from: jonwhispa::googlemail::com at: 12-Aug-2008 16:23
There is also a /with refinement to specify additional line terminators
open/direct/lines/with %file ","
It seems that works on both the "," and newline.
Using Tim`s suggestion and checking the last char for a newline and doing a
remove, second pick and a rejoin should fix that.
Jon
[7/8] from: brock:kalef:innovapost at: 12-Aug-2008 13:51
Thanks to everyone for their feedback/suggestions.
I seem to have a solution that will back track to the starting point of
any non-complete record. This should work on any data that is newline
terminated and you can set the amount of data to grab in each call to
grab a new batch of data using the 'size word. The number represents
the number of bytes to copy from the data file.
rebol[]
port: open/seek %"Sample data/simplified.log"
size: 130
cnt: 1
while [not tail? port] [
data: copy/part port size
working-data: copy data
either (last working-data) = #"^/" [
use-last-record?: true
start-at: (index? data) + :size
][
use-last-record?: false
either not error? try [start-at: (index? find/reverse
tail data "^/")][
; new starting point of next read since block didn't end
in full record;
start-at: (index? find/reverse tail data "^/")
][
start-at: (index? data) + :size
]
]
working-data: parse/all working-data "^/"
record-cnt: length? working-data
print ["Record Count " :cnt ": " record-cnt]
print ["First Record:^/" first working-data]
print ["Use last record?: " use-last-record?]
print ["Last Record:^/" last working-data newline newline]
port: skip port (size + start-at - size)
cnt: cnt + 1
]
close port
halt
If anyone wants to try this for themselves here's a sample data file
that can be cut and then saved to disk and then change the file path in
the script above. I used this data to be able to quickly identify what
record you are in. If you save the file, make sure there is an emply
line at the end of the data file.
1 record1 record1recordonerecord1 end
2 recordtwo record2 record 2 record 2 end
3 rec3 recordthree record3 record 3 record3 end
4 record 4 record4 recordfour record14 end
5 recordfive record5 record 5 record 5 end
6 rec6 recordsix record6 record 6 record6 end
7 record 7 record7 recordseven record7 end
8 record 8 record8 recordeight record8 end
9 recordnine record9 record 9 record 9 end
10 rec10 recordten record10 record 10 record10 end
11 record 11 record11 recordeleven record11 end
12 recordtwelve record12 record 12 record 12 end
13 rec13 recordthirteen record13 record 13 record13 end
14 record 14 record14 recordfourteen record14 end
15 recordfifteen record15 record 15 record 15 end
16 rec16 recordsixteen record16 record 16 record16 end
I just finished running the above script on a 900+ MB file and it
processed through to the end no problem.
Brock
[8/8] from: tim-johnsons::web::com at: 12-Aug-2008 9:59
On Tuesday 12 August 2008, CarleySimon wrote:
> There is also a /with refinement to specify additional line terminators
>
> open/direct/lines/with %file ","
>
> It seems that works on both the "," and newline.
> Using Tim`s suggestion and checking the last char for a newline and doing a
> remove, second pick and a rejoin should fix that.
> Jon
And of course, results and methods could vary with the OS and the available
memory. open/direct/lines works for me on files up to 1GB on linux with 3GB
of RAM and 3GB of swap space.
tj
Notes
- Quoted lines have been omitted from some messages.
View the message alone to see the lines that have been omitted