Mailing List Archive: Re: Large file compare

[REBOL] Re: Large file compare

From: tom:conlin:gma:il at: 8-Jun-2005 13:17


I dont have time to try/test any of this so some of the logic may be reversed,
but might a single pass approach help

;;; a b sorted blocks

while[all[not tail? a not tail b]][
    either equal? first a first b
        [    insert/only tail in-both first a
            a: next a
            b: next b
        ]
        [either greater? first a first b
            [insert/only tail only_a first a a: next a]
            [insert/only tail only_b first b b: next b]
        ]
]
;;; incase one finishes before the other
while[not tail? a][
    insert/only tail only_a first a
    a: next a
]
while[not tail? b][
    insert/only tail only_b first b
    b: next b
]

On 6/8/05, Thorsten Moeller <[valleyroad--gmx--de]> wrote:
> Hi Gabriele,
>
> good hints. So, i now use read/line instead of read and write out the
> result from the difference operation immediatly and remove it from
> memory. This drops the actual memory consumption to 140 MB during
> intersect and 120 MB during difference.
>
> But i still think it will become too big when operating on the whole
> set.
>
> As the file content is very trivial like "2348246864;PCINIT2" and can be
> sorted, i tink of something like stepping through the files line by line
> and compare the line content. The file which have the lead in the
> comparison will be alternating, depending, if there is a difference in
> the first or second column. This only works, when both files are sorted
> identically.
>
> There will be a minimum memory consumption. But, what i don't know is,
> what commands to use as they must remember the positions in the files.
>
> I will think on this further. Perhaps you have good idea how this could
> be implemented. What i don't know know is, if this will be fast enough.
>
> Thanks
>
> Thorsten
>
> On Wed, 8 Jun 2005 12:31:56 +0200, "Gabriele Santilli"
> <[gabriele--colellachiara--com]> said:
> >
> > Hi Thorsten,
> >
> > On Wednesday, June 8, 2005, 11:53:08 AM, you wrote:
> >
> > TM> a: read %testfile1.txt
> > TM> b: read %testfile2.txt
> >
> > Did you mean READ/LINES?
> >
> > TM> inboth: intersect a b
> > TM> only_a: difference inboth a
> > TM> only_b: difference inboth b
> >
> > TM> My question is, if there are better ways in rebol to achive the same
> > with
> > TM> lesser memory consumption??
> >
> > Yes - don't load the whole files in memory. :-)
> >
> > Is the difference going to be big too? If so you may want to avoid
> > keeping it in memory too.
> >
> > OTOH, if you have enough memory for the operation, doing it all in
> > memory is going to be much faster.
> >
> > Regards,
> >    Gabriele.
> > --
> > Gabriele Santilli <[g--santilli--tiscalinet--it]>  --  REBOL Programmer
> > Amiga Group Italia sez. L'Aquila  ---   SOON: http://www.rebol.it/
> >
> > --
> > To unsubscribe from the list, just send an email to
> > lists at rebol.com with unsubscribe as the subject.
> >
> >
> --
>   Melian Solutions
>   Thorsten Moeller
>
>   Mail: [tmoeller--fastmail--fm]
>
> --
> http://www.fastmail.fm - One of many happy users:
>   http://www.fastmail.fm/docs/quotes.html
>
> --
> Geschenkt: 3 Monate GMX ProMail gratis + 3 Ausgaben stern gratis
> ++ Jetzt anmelden & testen ++ http://www.gmx.net/de/go/promail ++
> --
> To unsubscribe from the list, just send an email to
> lists at rebol.com with unsubscribe as the subject.
>

--
   ... nice weather   eh