Mailing List Archive: Re: Large file compare

[REBOL] Re: Large file compare

From: gabriele::colellachiara::com at: 8-Jun-2005 16:32


Hi Thorsten,

On Wednesday, June 8, 2005, 3:48:14 PM, you wrote:

TM> good hints. So, i now use read/line instead of read and write out the
TM> result from the difference operation immediatly and remove it from
TM> memory. This drops the actual memory consumption to 140 MB during
TM> intersect and 120 MB during difference.

The  comment  about READ/LINES was not to reduce memory usage, but
because  otherwise  you are not intersecting two set of lines, but
two  sets  of  characters.  The result would be a string of unique
characters  that  are  in  both  files,  which I don't is what you
wanted.

TM> But i still think it will become too big when operating on the whole
TM> set.

Maybe,  but  the other option is making it very slow. Since RAM is
quite cheap today while time never is... ;)

Anyway, to read a file line per line:

    p: open/direct/lines %some-file.txt
    while [line: pick p 1] [print mold line]
    close p

TM> As  the file content is very trivial like "2348246864;PCINIT2"
TM> and  can  be sorted, i tink of something like stepping through
TM> the  files line by line and compare the line content. The file
TM> which  have  the  lead  in the comparison will be alternating,
TM> depending,  if  there  is  a difference in the first or second
TM> column.   This   only   works,  when  both  files  are  sorted
TM> identically.

Yes,  if  the  files  are sorted, you can follow this approach. It
will  be  reasonably  fast and the memory used is not dependent on
the  size  of the files. But, if you have to sort the files first,
this may not be the optimal solution.

TM> There will be a minimum memory consumption. But, what i don't know is,
TM> what commands to use as they must remember the positions in the files.

See above. Use two ports.

TM> I will think on this further. Perhaps you have good idea how this could
TM> be implemented. What i don't know know is, if this will be fast enough.

Another  possibility,  if  the  files are not sorted, is to keep a
block  of line checksums. This will use 16 * number-of-lines bytes
of memory, which is probably much less than the memory required to
load  the  whole  file;  and  you need to do it for one of the two
files only.

The  downside  is,  that  since  there  could,  in  principle,  be
collisions,  you'd  need  to  check lines that match with the real
file;  if  the differences between the two files are small this is
going to be very slow.

Another  possibility, could be to process the two files in chunks.
You  are after the intersection of two sets, A and B. Let me use u
for union and n for intersection. Assuming that B = B1 u B2,

   A n B = (A n B1) u (A n B2)

applying the same for A, with A = A1 u A2,

   A n B = ((A1 n B1) u (A2 n B1)) u ((A1 n B2) u (A2 n B2))

   A n B = (A1 n B1) u (A2 n B1) u (A1 n B2) u (A2 n B2)

If  you  have  A  = A1 u A2 u ... u An and B = B1 u B2 u ... u Bn,
then  all  you  have  to do is to compute all the possible Ai n Bj
intersections, and make a union of all of them.

So, you open two ports in /DIRECT/LINES mode, use COPY/PART to get
chunks  of  N  lines,  and iterate over the files to generate your
intersection. Seems quite a good strategy to me.

Regards,
   Gabriele.
--
Gabriele Santilli <[g--santilli--tiscalinet--it]>  --  REBOL Programmer
Amiga Group Italia sez. L'Aquila  ---   SOON: http://www.rebol.it/