[REBOL] Re: Large file compare
From: gabriele::colellachiara::com at: 8-Jun-2005 16:32
On Wednesday, June 8, 2005, 3:48:14 PM, you wrote:
TM> good hints. So, i now use read/line instead of read and write out the
TM> result from the difference operation immediatly and remove it from
TM> memory. This drops the actual memory consumption to 140 MB during
TM> intersect and 120 MB during difference.
The comment about READ/LINES was not to reduce memory usage, but
because otherwise you are not intersecting two set of lines, but
two sets of characters. The result would be a string of unique
characters that are in both files, which I don't is what you
TM> But i still think it will become too big when operating on the whole
Maybe, but the other option is making it very slow. Since RAM is
quite cheap today while time never is... ;)
Anyway, to read a file line per line:
p: open/direct/lines %some-file.txt
while [line: pick p 1] [print mold line]
TM> As the file content is very trivial like "2348246864;PCINIT2"
TM> and can be sorted, i tink of something like stepping through
TM> the files line by line and compare the line content. The file
TM> which have the lead in the comparison will be alternating,
TM> depending, if there is a difference in the first or second
TM> column. This only works, when both files are sorted
Yes, if the files are sorted, you can follow this approach. It
will be reasonably fast and the memory used is not dependent on
the size of the files. But, if you have to sort the files first,
this may not be the optimal solution.
TM> There will be a minimum memory consumption. But, what i don't know is,
TM> what commands to use as they must remember the positions in the files.
See above. Use two ports.
TM> I will think on this further. Perhaps you have good idea how this could
TM> be implemented. What i don't know know is, if this will be fast enough.
Another possibility, if the files are not sorted, is to keep a
block of line checksums. This will use 16 * number-of-lines bytes
of memory, which is probably much less than the memory required to
load the whole file; and you need to do it for one of the two
The downside is, that since there could, in principle, be
collisions, you'd need to check lines that match with the real
file; if the differences between the two files are small this is
going to be very slow.
Another possibility, could be to process the two files in chunks.
You are after the intersection of two sets, A and B. Let me use u
for union and n for intersection. Assuming that B = B1 u B2,
A n B = (A n B1) u (A n B2)
applying the same for A, with A = A1 u A2,
A n B = ((A1 n B1) u (A2 n B1)) u ((A1 n B2) u (A2 n B2))
A n B = (A1 n B1) u (A2 n B1) u (A1 n B2) u (A2 n B2)
If you have A = A1 u A2 u ... u An and B = B1 u B2 u ... u Bn,
then all you have to do is to compute all the possible Ai n Bj
intersections, and make a union of all of them.
So, you open two ports in /DIRECT/LINES mode, use COPY/PART to get
chunks of N lines, and iterate over the files to generate your
intersection. Seems quite a good strategy to me.
Gabriele Santilli <[g--santilli--tiscalinet--it]> -- REBOL Programmer
Amiga Group Italia sez. L'Aquila --- SOON: http://www.rebol.it/