Large file compare
[1/14] from: valleyroad::gmx::de at: 8-Jun-2005 11:53
Hi,
i want to do a large file compare with rebol. I already wrote a script to do
this, which works fine and compares two testfiles, containing more than
800000 lines each, within 13 minutes. The only problem is, that the process
consumes about 180 MB Memory and 95 % CPU. The CPU is ok, but i think the
proble will be the Memory as the taget volume is to compare about 6.7 Mio
lines per file.
What i am doing is reading (a: read %testfile1.txt) the two files, doing an
intersect and make a diff on the two files like this:
a: read %testfile1.txt
b: read %testfile2.txt
inboth: intersect a b
only_a: difference inboth a
only_b: difference inboth b
My question is, if there are better ways in rebol to achive the same with
lesser memory consumption??
Thanks for any hints, ideas......
Thorsten
--
Geschenkt: 3 Monate GMX ProMail gratis + 3 Ausgaben stern gratis
++ Jetzt anmelden & testen ++ http://www.gmx.net/de/go/promail ++
[2/14] from: gabriele::colellachiara::com at: 8-Jun-2005 12:31
Hi Thorsten,
On Wednesday, June 8, 2005, 11:53:08 AM, you wrote:
TM> a: read %testfile1.txt
TM> b: read %testfile2.txt
Did you mean READ/LINES?
TM> inboth: intersect a b
TM> only_a: difference inboth a
TM> only_b: difference inboth b
TM> My question is, if there are better ways in rebol to achive the same with
TM> lesser memory consumption??
Yes - don't load the whole files in memory. :-)
Is the difference going to be big too? If so you may want to avoid
keeping it in memory too.
OTOH, if you have enough memory for the operation, doing it all in
memory is going to be much faster.
Regards,
Gabriele.
--
Gabriele Santilli <[g--santilli--tiscalinet--it]> -- REBOL Programmer
Amiga Group Italia sez. L'Aquila --- SOON: http://www.rebol.it/
[3/14] from: valleyroad:gmx at: 8-Jun-2005 15:48
Hi Gabriele,
good hints. So, i now use read/line instead of read and write out the
result from the difference operation immediatly and remove it from
memory. This drops the actual memory consumption to 140 MB during
intersect and 120 MB during difference.
But i still think it will become too big when operating on the whole
set.
As the file content is very trivial like "2348246864;PCINIT2" and can be
sorted, i tink of something like stepping through the files line by line
and compare the line content. The file which have the lead in the
comparison will be alternating, depending, if there is a difference in
the first or second column. This only works, when both files are sorted
identically.
There will be a minimum memory consumption. But, what i don't know is,
what commands to use as they must remember the positions in the files.
I will think on this further. Perhaps you have good idea how this could
be implemented. What i don't know know is, if this will be fast enough.
Thanks
Thorsten
On Wed, 8 Jun 2005 12:31:56 +0200, "Gabriele Santilli"
<[gabriele--colellachiara--com]> said:
> Hi Thorsten,
> On Wednesday, June 8, 2005, 11:53:08 AM, you wrote:
<<quoted lines omitted: 20>>
> To unsubscribe from the list, just send an email to
> lists at rebol.com with unsubscribe as the subject.
--
Melian Solutions
Thorsten Moeller
Mail: [tmoeller--fastmail--fm]
--
http://www.fastmail.fm - One of many happy users:
http://www.fastmail.fm/docs/quotes.html
--
Geschenkt: 3 Monate GMX ProMail gratis + 3 Ausgaben stern gratis
++ Jetzt anmelden & testen ++ http://www.gmx.net/de/go/promail ++
[4/14] from: gabriele::colellachiara::com at: 8-Jun-2005 16:32
Hi Thorsten,
On Wednesday, June 8, 2005, 3:48:14 PM, you wrote:
TM> good hints. So, i now use read/line instead of read and write out the
TM> result from the difference operation immediatly and remove it from
TM> memory. This drops the actual memory consumption to 140 MB during
TM> intersect and 120 MB during difference.
The comment about READ/LINES was not to reduce memory usage, but
because otherwise you are not intersecting two set of lines, but
two sets of characters. The result would be a string of unique
characters that are in both files, which I don't is what you
wanted.
TM> But i still think it will become too big when operating on the whole
TM> set.
Maybe, but the other option is making it very slow. Since RAM is
quite cheap today while time never is... ;)
Anyway, to read a file line per line:
p: open/direct/lines %some-file.txt
while [line: pick p 1] [print mold line]
close p
TM> As the file content is very trivial like "2348246864;PCINIT2"
TM> and can be sorted, i tink of something like stepping through
TM> the files line by line and compare the line content. The file
TM> which have the lead in the comparison will be alternating,
TM> depending, if there is a difference in the first or second
TM> column. This only works, when both files are sorted
TM> identically.
Yes, if the files are sorted, you can follow this approach. It
will be reasonably fast and the memory used is not dependent on
the size of the files. But, if you have to sort the files first,
this may not be the optimal solution.
TM> There will be a minimum memory consumption. But, what i don't know is,
TM> what commands to use as they must remember the positions in the files.
See above. Use two ports.
TM> I will think on this further. Perhaps you have good idea how this could
TM> be implemented. What i don't know know is, if this will be fast enough.
Another possibility, if the files are not sorted, is to keep a
block of line checksums. This will use 16 * number-of-lines bytes
of memory, which is probably much less than the memory required to
load the whole file; and you need to do it for one of the two
files only.
The downside is, that since there could, in principle, be
collisions, you'd need to check lines that match with the real
file; if the differences between the two files are small this is
going to be very slow.
Another possibility, could be to process the two files in chunks.
You are after the intersection of two sets, A and B. Let me use u
for union and n for intersection. Assuming that B = B1 u B2,
A n B = (A n B1) u (A n B2)
applying the same for A, with A = A1 u A2,
A n B = ((A1 n B1) u (A2 n B1)) u ((A1 n B2) u (A2 n B2))
A n B = (A1 n B1) u (A2 n B1) u (A1 n B2) u (A2 n B2)
If you have A = A1 u A2 u ... u An and B = B1 u B2 u ... u Bn,
then all you have to do is to compute all the possible Ai n Bj
intersections, and make a union of all of them.
So, you open two ports in /DIRECT/LINES mode, use COPY/PART to get
chunks of N lines, and iterate over the files to generate your
intersection. Seems quite a good strategy to me.
Regards,
Gabriele.
--
Gabriele Santilli <[g--santilli--tiscalinet--it]> -- REBOL Programmer
Amiga Group Italia sez. L'Aquila --- SOON: http://www.rebol.it/
[5/14] from: andreas::bolka::gmx::net at: 8-Jun-2005 17:25
Wednesday, June 8, 2005, 11:53:08 AM, Thorsten wrote:
> i want to do a large file compare with rebol. I already wrote a script to do
> this, which works fine and compares two testfiles, containing more than
> 800000 lines each, within 13 minutes.
^^^^^^^^^^^^^^^^^
are you sure about that number? i just tried a
i: intersect read/lines %f1.txt read/lines %f2.txt
where f1.txt and f2.txt are both files with roughly 800000 lines
(~14mb each). took 0:02:47 on my machine (even that is still awful,
but 0:13:00 would be a magnitude even more awful ;).
[tested on a p4-m 1.6ghz, 1GB ram, 5400rpm hdd]
--
Best regards,
Andreas
[6/14] from: tom:conlin:g:mail at: 8-Jun-2005 13:17
I dont have time to try/test any of this so some of the logic may be reversed,
but might a single pass approach help
;;; a b sorted blocks
while[all[not tail? a not tail b]][
either equal? first a first b
[ insert/only tail in-both first a
a: next a
b: next b
]
[either greater? first a first b
[insert/only tail only_a first a a: next a]
[insert/only tail only_b first b b: next b]
]
]
;;; incase one finishes before the other
while[not tail? a][
insert/only tail only_a first a
a: next a
]
while[not tail? b][
insert/only tail only_b first b
b: next b
]
On 6/8/05, Thorsten Moeller <[valleyroad--gmx--de]> wrote:
> Hi Gabriele,
> good hints. So, i now use read/line instead of read and write out the
<<quoted lines omitted: 67>>
> To unsubscribe from the list, just send an email to
> lists at rebol.com with unsubscribe as the subject.
--
... nice weather eh
[7/14] from: robert:muench:robertmuench at: 9-Jun-2005 9:51
On Wed, 08 Jun 2005 16:32:03 +0200, Gabriele Santilli
<[gabriele--colellachiara--com]> wrote:
> Another possibility, could be to process the two files in chunks.
Hi, that's the approach I take. Use a random number generator seeded
with the filename for example. Pick some 100KB parts from those big
files and compare them with 'checksum. Repeat until you find the first
difference. The chances that you will hit the difference is quite high
if you always include the first and last 100KB.
--
Robert M. M=FCnch
Management & IT Freelancer
Mobile: +49 (177) 245 2802
http://www.robertmuench.de
[8/14] from: valleyroad::gmx::de at: 8-Jun-2005 14:09
Hi Gabriele,
good hints. So, i now use read/line instead of read and write out the
result from the difference operation immediatly and remove it from
memory. This drops the actual memory consumption to 140 MB during
intersect and 120 MB during difference.
But i still think it will become too big when operating on the whole
set.
As the file content is very trivial like "2348246864;PCINIT2" and can be
sorted, i tink of something like stepping through the files line by line
and compare the line content. The file which have the lead in the
comparison will be alternating, depending, if there is a difference in
the first or second column. This only works, when both files are sorted
identically.
There will be a minimum memory consumption. But, what i don't know is,
what commands to use as they must remember the positions in the files.
I will think on this further. Perhaps you have good idea how this could
be implemented. What i don't know know is, if this will be fast enough.
Thanks
Thorsten
On Wed, 8 Jun 2005 12:31:56 +0200, "Gabriele Santilli"
<[gabriele--colellachiara--com]> said:
> Hi Thorsten,
> On Wednesday, June 8, 2005, 11:53:08 AM, you wrote:
<<quoted lines omitted: 20>>
> To unsubscribe from the list, just send an email to
> lists at rebol.com with unsubscribe as the subject.
--
Melian Solutions
Thorsten Moeller
Mail: [tmoeller--fastmail--fm]
--
http://www.fastmail.fm - One of many happy users:
http://www.fastmail.fm/docs/quotes.html
[9/14] from: SunandaDH:aol at: 8-Jun-2005 11:18
Thorsten:
> As the file content is very trivial like "2348246864;PCINIT2" and can be
> sorted, i tink of something like stepping through the files line by line
> and compare the line content
REBOL.org uses a compare utility to let script owners see changes in
different versions of a script they've contributed to the Library. It is not currently
publicly available.
I wrote it. And I was concerned not to use recursion as REBOL has a small
stack for that. That limited the possible approaches.
As scripts are line-oriented things, I used the approach you suggest: sort
both files, use a merge/compare to remove matching lines. On the rest, apply
some heuristics to distinguish inserts and deletions from block moves.
It works best where there are not a large number of identical lines. Which is
why, by default, it ignores blank lines when comparing ..... which, as a
happy side-effect, is usually what you'd want when comparing source code versions.
It wouldn't work on the size files you have targeted as it acts on blocks in
memory. But as most of its processes involve running up and down blocks
flagging things there is no in principle reason why the same logic couldn't work on
files -- given that read/direct works.
Is that enough hints for now?
Sunanda.
[10/14] from: robert::muench::robertmuench::de at: 11-Jun-2005 12:48
On Thu, 09 Jun 2005 09:51:35 +0200, Robert M. M=FCnch
<[robert--muench--robertmuench--de]> wrote:
> Hi, that's the approach I take. Use a random number generator seeded
>
I just see that my posts have those =20 added. Anyone an idea where
this comes from? Robert
[11/14] from: SunandaDH:aol at: 11-Jun-2005 9:54
Robert:
> I just see that my posts have those =20 added. Anyone an idea
> where =20 this comes from?
No idea......You'll probably see that Tom's post in the same thread has its
tabs replaced with =09 thus messing up the viewing or copying of his script.
A great many emails seem to get =20 or =20 added at the end of lines.
It seems if an email passes through enough layers of software, all of which
wish to escape, correct, or translate charsets, eventually what you'll receive
will be total gibberish.
Carl would probable call that another reaosn why email is dead.
The ML archive makes some attempts to reverse the damage, eg:
http://www.rebol.org/cgi-bin/cgiwrap/rebol/ml-display-thread.r?m=rmlSXVC
Sunanda.
[12/14] from: tom:conlin:gmai:l at: 11-Jun-2005 13:50
Hmmm
I see the escaoed chars in Roberts message but not in mine.
so it may also be locale dependent (and a reason to go back to pine ...)
On 6/11/05, [SunandaDH--aol--com] <[SunandaDH--aol--com]> wrote:
> Robert:
> > I just see that my posts have those =20 added. Anyone an idea
<<quoted lines omitted: 12>>
> To unsubscribe from the list, just send an email to
> lists at rebol.com with unsubscribe as the subject.
--
... nice weather eh
[13/14] from: SunandaDH::aol::com at: 11-Jun-2005 17:17
Tom:
> I see the escaoed chars in Roberts message but not in mine.
> so it may also be locale dependent (and a reason to go back to pine ...)
Or something at your end has cleaned them up before you see them.
Lots of =20 and =09 codes in your previous message as displayed on rebol.net
http://mail.rebol.net/cgi-bin/mail-list.r?msg=38549
Sunanda.
[14/14] from: gabriele:colellachiara at: 12-Jun-2005 13:43
Hi Robert,
On Saturday, June 11, 2005, 12:48:38 PM, you wrote:
RMM> I just see that my posts have those =20 added. Anyone an idea where
RMM> this comes from? Robert
It's the quoted-printable encoding. I think Ecartis does not
decode it, but it somewhat strips the Transfer-Content-Encoding
header. The solution is not using it, as I don't think there are
still MTAs around that don't accept 8 bit characters.
In your mailer, check if you have an option like "use
quoted-printable" or "send email as 8 bit" or anything like that.
You should not use QP and send as 8 bit.
Also, any MTA in between could be translating your mail from 8 bit
to QP; if that's the case, you're out of look, unless you manage
to convince the admin of that MTA that he should not really be
doing that. (Of course, he'll answer that this is Ecartis' fault,
and I actually agree on that, but since QP is 99,99% of the times
completely useless one can safely just disable it.)
Regards,
Gabriele.
--
Gabriele Santilli <[g--santilli--tiscalinet--it]> -- REBOL Programmer
Amiga Group Italia sez. L'Aquila --- SOON: http://www.rebol.it/
Notes
- Quoted lines have been omitted from some messages.
View the message alone to see the lines that have been omitted