[REBOL] Re: Sort by first part of line
From: gscottjones:mchsi at: 6-Sep-2002 10:22
Hi, everyone,
My two cents worth:
I recreated a similar data set as Louis (but mine being total nonsence, of
course :-) with the following algorithm:
loop 29688 [
write/append/lines %//windows/desktop/data1.txt rejoin [
" "
random/only {1234567890}
random/only {1234567890}
random/only {1234567890}
" "
random {eawk acef 32233}
]
]
It's a quick hack, so stop laughing! A sample of the data looks like:
...
977 3 2e33aewacf2k
638 e ak33we2c3a 2f
355 ak cefa233we3 2
216 e ec3w3aa2f2k3
422 3fc2eeaa33w k2
462 wa3 ae3e2fk32c
238 wea 3ack2 23f3e
616 2aa3 f3k3w2e ec
658 w 32 ckfa33a2ee
537 3e 3fw2e acak23
592 33 c3ef22we aka
858 e3w3 cea2k2fa 3
453 efca kw223ae33
...
and the resulting file size was 637kb.
As a rough comparison, to see if I was on track, I adapted Sunanda's code
(assuming I got the final version correctly):
;;;;;;;;;;;;;;;;;;;;;;;;
t: now/time
sort/compare sorted-data: copy data func [a b /local a-key b-key ] [
a-key: join second parse/all a " " ["-" index? find data a]
b-key: join second parse/all b " " ["-" index? find data b]
;print [a-key " " b-key]
return a-key < b-key
]
print now/time - t
;;;;;;;;;;;;;;;;;;;;;;;;
I didn't do dedicated run (I needed to do a few other things), but I
interrupted the run after 4 hours on my 500mhz Celeron 128MB Win98.
Next, I drew a few ideas from contributions, but I was determined to let in
run in memory. Here was the resulting code:
;;;;;;;;;;;;;;;;;;;;;;;;
rebol []
full-time: now/time
data: read/lines %//windows/desktop/data1.txt
data-blk: copy []
foreach datum data [
append data-blk second parse/all datum " "
append data-blk datum
]
sort-time: now/time
sort/skip data-blk 2
print now/time - sort-time
foreach [key value] data-blk [
write/lines/append %//windows/desktop/sorted-data.txt value
]
print now/time - full-time
print "done"
ask ""
;;;;;;;;;;;;;;;;;;;;;;;;
The idea was to do a parse through the code only once, and construct a new
key and value
pairing data block. Then I did a a simple sort/skip by 2.
Time for the sort (alone) was 1 to 2 seconds on my machine, and total
runtime about 33 seconds. Sample of the resulting data:
...
000 ae332ae ckw 3f2
000 ea a32c323w fek
000 3ac33 wee2 2afk
000 3 22kee33w acaf
000 2a2e3cak f e33w
000 ak 3e2fe3wa2c3
000 ae2 2efw3ca3 3k
000 ee2ca3w 3ka3f2
000 2 afewe333cak2
000 ce2ew3a3a3 k2f
000 kcaa3f23 w e32e
000 3e2kcwf33 2eaa
000 e3e 2waf cak323
000 332e ak2eca3fw
000 3fek2wa3c a23 e
000 32ck e ea23faw3
000 akf33 eace3w2 2
000 332ew2 ka3eca f
...
I am assuming that this is considered a stable sort based on the first
alphanumeric "word" only.
Comments welcome.
Respectfully,
--Scott Jones