Mailing List Archive: 49091 messages
  • Home
  • Script library
  • AltME Archive
  • Mailing list
  • Articles Index
  • Site search
 

[REBOL] Re: Large file compare

From: gabriele::colellachiara::com at: 8-Jun-2005 16:32

Hi Thorsten, On Wednesday, June 8, 2005, 3:48:14 PM, you wrote: TM> good hints. So, i now use read/line instead of read and write out the TM> result from the difference operation immediatly and remove it from TM> memory. This drops the actual memory consumption to 140 MB during TM> intersect and 120 MB during difference. The comment about READ/LINES was not to reduce memory usage, but because otherwise you are not intersecting two set of lines, but two sets of characters. The result would be a string of unique characters that are in both files, which I don't is what you wanted. TM> But i still think it will become too big when operating on the whole TM> set. Maybe, but the other option is making it very slow. Since RAM is quite cheap today while time never is... ;) Anyway, to read a file line per line: p: open/direct/lines %some-file.txt while [line: pick p 1] [print mold line] close p TM> As the file content is very trivial like "2348246864;PCINIT2" TM> and can be sorted, i tink of something like stepping through TM> the files line by line and compare the line content. The file TM> which have the lead in the comparison will be alternating, TM> depending, if there is a difference in the first or second TM> column. This only works, when both files are sorted TM> identically. Yes, if the files are sorted, you can follow this approach. It will be reasonably fast and the memory used is not dependent on the size of the files. But, if you have to sort the files first, this may not be the optimal solution. TM> There will be a minimum memory consumption. But, what i don't know is, TM> what commands to use as they must remember the positions in the files. See above. Use two ports. TM> I will think on this further. Perhaps you have good idea how this could TM> be implemented. What i don't know know is, if this will be fast enough. Another possibility, if the files are not sorted, is to keep a block of line checksums. This will use 16 * number-of-lines bytes of memory, which is probably much less than the memory required to load the whole file; and you need to do it for one of the two files only. The downside is, that since there could, in principle, be collisions, you'd need to check lines that match with the real file; if the differences between the two files are small this is going to be very slow. Another possibility, could be to process the two files in chunks. You are after the intersection of two sets, A and B. Let me use u for union and n for intersection. Assuming that B = B1 u B2, A n B = (A n B1) u (A n B2) applying the same for A, with A = A1 u A2, A n B = ((A1 n B1) u (A2 n B1)) u ((A1 n B2) u (A2 n B2)) A n B = (A1 n B1) u (A2 n B1) u (A1 n B2) u (A2 n B2) If you have A = A1 u A2 u ... u An and B = B1 u B2 u ... u Bn, then all you have to do is to compute all the possible Ai n Bj intersections, and make a union of all of them. So, you open two ports in /DIRECT/LINES mode, use COPY/PART to get chunks of N lines, and iterate over the files to generate your intersection. Seems quite a good strategy to me. Regards, Gabriele. -- Gabriele Santilli <[g--santilli--tiscalinet--it]> -- REBOL Programmer Amiga Group Italia sez. L'Aquila --- SOON: http://www.rebol.it/