Large file compare

[1/14] from: valleyroad::gmx::de at: 8-Jun-2005 11:53

Hi, i want to do a large file compare with rebol. I already wrote a script to do this, which works fine and compares two testfiles, containing more than 800000 lines each, within 13 minutes. The only problem is, that the process consumes about 180 MB Memory and 95 % CPU. The CPU is ok, but i think the proble will be the Memory as the taget volume is to compare about 6.7 Mio lines per file. What i am doing is reading (a: read %testfile1.txt) the two files, doing an intersect and make a diff on the two files like this: a: read %testfile1.txt b: read %testfile2.txt inboth: intersect a b only_a: difference inboth a only_b: difference inboth b My question is, if there are better ways in rebol to achive the same with lesser memory consumption?? Thanks for any hints, ideas...... Thorsten -- Geschenkt: 3 Monate GMX ProMail gratis + 3 Ausgaben stern gratis ++ Jetzt anmelden & testen ++ http://www.gmx.net/de/go/promail ++

[2/14] from: gabriele::colellachiara::com at: 8-Jun-2005 12:31

Hi Thorsten, On Wednesday, June 8, 2005, 11:53:08 AM, you wrote: TM> a: read %testfile1.txt TM> b: read %testfile2.txt Did you mean READ/LINES? TM> inboth: intersect a b TM> only_a: difference inboth a TM> only_b: difference inboth b TM> My question is, if there are better ways in rebol to achive the same with TM> lesser memory consumption?? Yes - don't load the whole files in memory. :-) Is the difference going to be big too? If so you may want to avoid keeping it in memory too. OTOH, if you have enough memory for the operation, doing it all in memory is going to be much faster. Regards, Gabriele. -- Gabriele Santilli <[g--santilli--tiscalinet--it]> -- REBOL Programmer Amiga Group Italia sez. L'Aquila --- SOON: http://www.rebol.it/

[3/14] from: valleyroad:gmx at: 8-Jun-2005 15:48

Hi Gabriele, good hints. So, i now use read/line instead of read and write out the result from the difference operation immediatly and remove it from memory. This drops the actual memory consumption to 140 MB during intersect and 120 MB during difference. But i still think it will become too big when operating on the whole set. As the file content is very trivial like "2348246864;PCINIT2" and can be sorted, i tink of something like stepping through the files line by line and compare the line content. The file which have the lead in the comparison will be alternating, depending, if there is a difference in the first or second column. This only works, when both files are sorted identically. There will be a minimum memory consumption. But, what i don't know is, what commands to use as they must remember the positions in the files. I will think on this further. Perhaps you have good idea how this could be implemented. What i don't know know is, if this will be fast enough. Thanks Thorsten On Wed, 8 Jun 2005 12:31:56 +0200, "Gabriele Santilli" <[gabriele--colellachiara--com]> said:

> Hi Thorsten, > On Wednesday, June 8, 2005, 11:53:08 AM, you wrote:

<<quoted lines omitted: 20>>

> To unsubscribe from the list, just send an email to > lists at rebol.com with unsubscribe as the subject.

-- Melian Solutions Thorsten Moeller Mail: [tmoeller--fastmail--fm] -- http://www.fastmail.fm - One of many happy users: http://www.fastmail.fm/docs/quotes.html -- Geschenkt: 3 Monate GMX ProMail gratis + 3 Ausgaben stern gratis ++ Jetzt anmelden & testen ++ http://www.gmx.net/de/go/promail ++

[4/14] from: gabriele::colellachiara::com at: 8-Jun-2005 16:32

Hi Thorsten, On Wednesday, June 8, 2005, 3:48:14 PM, you wrote: TM> good hints. So, i now use read/line instead of read and write out the TM> result from the difference operation immediatly and remove it from TM> memory. This drops the actual memory consumption to 140 MB during TM> intersect and 120 MB during difference. The comment about READ/LINES was not to reduce memory usage, but because otherwise you are not intersecting two set of lines, but two sets of characters. The result would be a string of unique characters that are in both files, which I don't is what you wanted. TM> But i still think it will become too big when operating on the whole TM> set. Maybe, but the other option is making it very slow. Since RAM is quite cheap today while time never is... ;) Anyway, to read a file line per line: p: open/direct/lines %some-file.txt while [line: pick p 1] [print mold line] close p TM> As the file content is very trivial like "2348246864;PCINIT2" TM> and can be sorted, i tink of something like stepping through TM> the files line by line and compare the line content. The file TM> which have the lead in the comparison will be alternating, TM> depending, if there is a difference in the first or second TM> column. This only works, when both files are sorted TM> identically. Yes, if the files are sorted, you can follow this approach. It will be reasonably fast and the memory used is not dependent on the size of the files. But, if you have to sort the files first, this may not be the optimal solution. TM> There will be a minimum memory consumption. But, what i don't know is, TM> what commands to use as they must remember the positions in the files. See above. Use two ports. TM> I will think on this further. Perhaps you have good idea how this could TM> be implemented. What i don't know know is, if this will be fast enough. Another possibility, if the files are not sorted, is to keep a block of line checksums. This will use 16 * number-of-lines bytes of memory, which is probably much less than the memory required to load the whole file; and you need to do it for one of the two files only. The downside is, that since there could, in principle, be collisions, you'd need to check lines that match with the real file; if the differences between the two files are small this is going to be very slow. Another possibility, could be to process the two files in chunks. You are after the intersection of two sets, A and B. Let me use u for union and n for intersection. Assuming that B = B1 u B2, A n B = (A n B1) u (A n B2) applying the same for A, with A = A1 u A2, A n B = ((A1 n B1) u (A2 n B1)) u ((A1 n B2) u (A2 n B2)) A n B = (A1 n B1) u (A2 n B1) u (A1 n B2) u (A2 n B2) If you have A = A1 u A2 u ... u An and B = B1 u B2 u ... u Bn, then all you have to do is to compute all the possible Ai n Bj intersections, and make a union of all of them. So, you open two ports in /DIRECT/LINES mode, use COPY/PART to get chunks of N lines, and iterate over the files to generate your intersection. Seems quite a good strategy to me. Regards, Gabriele. -- Gabriele Santilli <[g--santilli--tiscalinet--it]> -- REBOL Programmer Amiga Group Italia sez. L'Aquila --- SOON: http://www.rebol.it/

[5/14] from: andreas::bolka::gmx::net at: 8-Jun-2005 17:25

Wednesday, June 8, 2005, 11:53:08 AM, Thorsten wrote:

> i want to do a large file compare with rebol. I already wrote a script to do > this, which works fine and compares two testfiles, containing more than > 800000 lines each, within 13 minutes.

^^^^^^^^^^^^^^^^^ are you sure about that number? i just tried a i: intersect read/lines %f1.txt read/lines %f2.txt where f1.txt and f2.txt are both files with roughly 800000 lines (~14mb each). took 0:02:47 on my machine (even that is still awful, but 0:13:00 would be a magnitude even more awful ;). [tested on a p4-m 1.6ghz, 1GB ram, 5400rpm hdd] -- Best regards, Andreas

[6/14] from: tom:conlin:g:mail at: 8-Jun-2005 13:17

I dont have time to try/test any of this so some of the logic may be reversed, but might a single pass approach help ;;; a b sorted blocks while[all[not tail? a not tail b]][ either equal? first a first b [ insert/only tail in-both first a a: next a b: next b ] [either greater? first a first b [insert/only tail only_a first a a: next a] [insert/only tail only_b first b b: next b] ] ] ;;; incase one finishes before the other while[not tail? a][ insert/only tail only_a first a a: next a ] while[not tail? b][ insert/only tail only_b first b b: next b ] On 6/8/05, Thorsten Moeller <[valleyroad--gmx--de]> wrote:

> Hi Gabriele, > good hints. So, i now use read/line instead of read and write out the

<<quoted lines omitted: 67>>

> To unsubscribe from the list, just send an email to > lists at rebol.com with unsubscribe as the subject.

-- ... nice weather eh

[7/14] from: robert:muench:robertmuench at: 9-Jun-2005 9:51

On Wed, 08 Jun 2005 16:32:03 +0200, Gabriele Santilli <[gabriele--colellachiara--com]> wrote:

> Another possibility, could be to process the two files in chunks.

Hi, that's the approach I take. Use a random number generator seeded with the filename for example. Pick some 100KB parts from those big files and compare them with 'checksum. Repeat until you find the first difference. The chances that you will hit the difference is quite high if you always include the first and last 100KB. -- Robert M. M=FCnch Management & IT Freelancer Mobile: +49 (177) 245 2802 http://www.robertmuench.de

[8/14] from: valleyroad::gmx::de at: 8-Jun-2005 14:09

> Hi Thorsten, > On Wednesday, June 8, 2005, 11:53:08 AM, you wrote:

<<quoted lines omitted: 20>>

> To unsubscribe from the list, just send an email to > lists at rebol.com with unsubscribe as the subject.

-- Melian Solutions Thorsten Moeller Mail: [tmoeller--fastmail--fm] -- http://www.fastmail.fm - One of many happy users: http://www.fastmail.fm/docs/quotes.html

[9/14] from: SunandaDH:aol at: 8-Jun-2005 11:18

Thorsten:

> As the file content is very trivial like "2348246864;PCINIT2" and can be > sorted, i tink of something like stepping through the files line by line > and compare the line content

REBOL.org uses a compare utility to let script owners see changes in different versions of a script they've contributed to the Library. It is not currently publicly available. I wrote it. And I was concerned not to use recursion as REBOL has a small stack for that. That limited the possible approaches. As scripts are line-oriented things, I used the approach you suggest: sort both files, use a merge/compare to remove matching lines. On the rest, apply some heuristics to distinguish inserts and deletions from block moves. It works best where there are not a large number of identical lines. Which is why, by default, it ignores blank lines when comparing ..... which, as a happy side-effect, is usually what you'd want when comparing source code versions. It wouldn't work on the size files you have targeted as it acts on blocks in memory. But as most of its processes involve running up and down blocks flagging things there is no in principle reason why the same logic couldn't work on files -- given that read/direct works. Is that enough hints for now? Sunanda.

[10/14] from: robert::muench::robertmuench::de at: 11-Jun-2005 12:48

On Thu, 09 Jun 2005 09:51:35 +0200, Robert M. M=FCnch <[robert--muench--robertmuench--de]> wrote:

> Hi, that's the approach I take. Use a random number generator seeded >

I just see that my posts have those =20 added. Anyone an idea where this comes from? Robert

[11/14] from: SunandaDH:aol at: 11-Jun-2005 9:54

Robert:

> I just see that my posts have those =20 added. Anyone an idea > where =20 this comes from?

No idea......You'll probably see that Tom's post in the same thread has its tabs replaced with =09 thus messing up the viewing or copying of his script. A great many emails seem to get =20 or =20 added at the end of lines. It seems if an email passes through enough layers of software, all of which wish to escape, correct, or translate charsets, eventually what you'll receive will be total gibberish. Carl would probable call that another reaosn why email is dead. The ML archive makes some attempts to reverse the damage, eg: http://www.rebol.org/cgi-bin/cgiwrap/rebol/ml-display-thread.r?m=rmlSXVC Sunanda.

[12/14] from: tom:conlin:gmai:l at: 11-Jun-2005 13:50

Hmmm I see the escaoed chars in Roberts message but not in mine. so it may also be locale dependent (and a reason to go back to pine ...) On 6/11/05, [SunandaDH--aol--com] <[SunandaDH--aol--com]> wrote:

> Robert: > > I just see that my posts have those =20 added. Anyone an idea

<<quoted lines omitted: 12>>

> To unsubscribe from the list, just send an email to > lists at rebol.com with unsubscribe as the subject.

-- ... nice weather eh

[13/14] from: SunandaDH::aol::com at: 11-Jun-2005 17:17

Tom:

> I see the escaoed chars in Roberts message but not in mine. > so it may also be locale dependent (and a reason to go back to pine ...)

Or something at your end has cleaned them up before you see them. Lots of =20 and =09 codes in your previous message as displayed on rebol.net http://mail.rebol.net/cgi-bin/mail-list.r?msg=38549 Sunanda.

[14/14] from: gabriele:colellachiara at: 12-Jun-2005 13:43

Hi Robert, On Saturday, June 11, 2005, 12:48:38 PM, you wrote: RMM> I just see that my posts have those =20 added. Anyone an idea where RMM> this comes from? Robert It's the quoted-printable encoding. I think Ecartis does not decode it, but it somewhat strips the Transfer-Content-Encoding header. The solution is not using it, as I don't think there are still MTAs around that don't accept 8 bit characters. In your mailer, check if you have an option like "use quoted-printable" or "send email as 8 bit" or anything like that. You should not use QP and send as 8 bit. Also, any MTA in between could be translating your mail from 8 bit to QP; if that's the case, you're out of look, unless you manage to convince the admin of that MTA that he should not really be doing that. (Of course, he'll answer that this is Ecartis' fault, and I actually agree on that, but since QP is 99,99% of the times completely useless one can safely just disable it.) Regards, Gabriele. -- Gabriele Santilli <[g--santilli--tiscalinet--it]> -- REBOL Programmer Amiga Group Italia sez. L'Aquila --- SOON: http://www.rebol.it/

Notes

Quoted lines have been omitted from some messages.
View the message alone to see the lines that have been omitted