Another newbie problem
[1/5] from: kpeters::vu-ware::com at: 2-Feb-2005 15:53
Hi all ~
given a simple text file containing values like this, how can I can create a second file
containing every value only once.
Speed does matter as I have to handle millions of entries.
Thanks for any pointers,
Kai
G10T0O6030
G10T0O6030
1DY00Y6106
1DY00Y6106
2KN0I39987
R1Q40K8008
R1Q40K8008
2B1G6Y3626
2B1G6Y3626
R1H0V06485
1XF00W6582
M1F8N72518
M1F8N72518
P1Y41G0352
Q1B0X06440
[2/5] from: tomc:darkwing:uoregon at: 2-Feb-2005 16:12
my-set: unique read/lines file
On Wed, 2 Feb 2005, Kai Peters wrote:
[3/5] from: Izkata:Comcast at: 2-Feb-2005 18:15
If it's all newline-delimited, I'd use something like:
Text: read/lines %WhateverTheFilesWas
NewList: []
foreach Val Text [
if not find NewList Val [insert tail NewList join Val newline]
]
write %TheNewFileName rejoin NewList
So that would create a new file with all the values, none being repeated.
(Also, instead of using "insert tail", you can use "append" - but "isnert
tail"
is slightly faster)
(For anyone else: ) Did I miss anything? Can it be optimized more?
-Izzy Boy
[4/5] from: Izkata::Comcast::net at: 2-Feb-2005 18:18
----- Original Message -----
From: "Tom Conlin" <[tomc--darkwing--uoregon--edu]>
To: <[rebolist--rebol--com]>
Sent: Wednesday, February 02, 2005 6:12 PM
Subject: [REBOL] Re: Another newbie problem
> my-set: unique read/lines file
Dernit, I -knew- something like that existed... But I forgot the name...
-laughs-
-Izzy Boy
[5/5] from: greggirwin:mindspring at: 2-Feb-2005 22:58
Hi Izzy,
I> Can it be optimized more?
There ain't no such thing as the fastest code. :) (any Michael Abrash
readers out there; old DDJ optimization challenges were so much fun!)
The more we know about the data, constraints, etc. the more we can
help. e.g. will the data all fit in memory? What is the approximate
percentage of duplicates? Are all values the same length? How many
unique values can be expected?
Think about this, FIND just does a linear search, so if you have a
large data set, and the match is at the end, or not there at all,
how will that affect you? The more times you call find, and the
larger your data set, the worse it's going to get, right?
What about using a hash! for lookups or a BTree? What about sorting
the data (that would group the duplicates, right?)?
Profiling is important in optimizing, because otherwise you don't know
what to optimize. The same applies to information about the problem.
:)
-- Gregg