Mailing List Archive: 49091 messages
  • Home
  • Script library
  • AltME Archive
  • Mailing list
  • Articles Index
  • Site search
 

Another newbie problem

 [1/5] from: kpeters::vu-ware::com at: 2-Feb-2005 15:53


Hi all ~ given a simple text file containing values like this, how can I can create a second file containing every value only once. Speed does matter as I have to handle millions of entries. Thanks for any pointers, Kai G10T0O6030 G10T0O6030 1DY00Y6106 1DY00Y6106 2KN0I39987 R1Q40K8008 R1Q40K8008 2B1G6Y3626 2B1G6Y3626 R1H0V06485 1XF00W6582 M1F8N72518 M1F8N72518 P1Y41G0352 Q1B0X06440

 [2/5] from: tomc:darkwing:uoregon at: 2-Feb-2005 16:12


my-set: unique read/lines file On Wed, 2 Feb 2005, Kai Peters wrote:

 [3/5] from: Izkata:Comcast at: 2-Feb-2005 18:15


If it's all newline-delimited, I'd use something like: Text: read/lines %WhateverTheFilesWas NewList: [] foreach Val Text [ if not find NewList Val [insert tail NewList join Val newline] ] write %TheNewFileName rejoin NewList So that would create a new file with all the values, none being repeated. (Also, instead of using "insert tail", you can use "append" - but "isnert tail" is slightly faster) (For anyone else: ) Did I miss anything? Can it be optimized more? -Izzy Boy

 [4/5] from: Izkata::Comcast::net at: 2-Feb-2005 18:18


----- Original Message ----- From: "Tom Conlin" <[tomc--darkwing--uoregon--edu]> To: <[rebolist--rebol--com]> Sent: Wednesday, February 02, 2005 6:12 PM Subject: [REBOL] Re: Another newbie problem
> my-set: unique read/lines file
Dernit, I -knew- something like that existed... But I forgot the name... -laughs- -Izzy Boy

 [5/5] from: greggirwin:mindspring at: 2-Feb-2005 22:58


Hi Izzy, I> Can it be optimized more? There ain't no such thing as the fastest code. :) (any Michael Abrash readers out there; old DDJ optimization challenges were so much fun!) The more we know about the data, constraints, etc. the more we can help. e.g. will the data all fit in memory? What is the approximate percentage of duplicates? Are all values the same length? How many unique values can be expected? Think about this, FIND just does a linear search, so if you have a large data set, and the match is at the end, or not there at all, how will that affect you? The more times you call find, and the larger your data set, the worse it's going to get, right? What about using a hash! for lookups or a BTree? What about sorting the data (that would group the duplicates, right?)? Profiling is important in optimizing, because otherwise you don't know what to optimize. The same applies to information about the problem. :) -- Gregg