Mailing List Archive: Re: Sort by first part of line

[REBOL] Re: Sort by first part of line

From: gscottjones:mchsi at: 6-Sep-2002 10:22


Hi, everyone,

My two cents worth:

I recreated a similar data set as Louis (but mine being total nonsence, of
course :-) with the following algorithm:

loop 29688 [
    write/append/lines %//windows/desktop/data1.txt rejoin [
        " "
        random/only {1234567890}
        random/only {1234567890}
        random/only {1234567890}
        " "
        random {eawk acef 32233}
    ]
]

It's a quick hack, so stop laughing!  A sample of the data looks like:
...
 977 3 2e33aewacf2k
 638 e ak33we2c3a 2f
 355 ak cefa233we3 2
 216 e  ec3w3aa2f2k3
 422 3fc2eeaa33w k2
 462 wa3  ae3e2fk32c
 238 wea 3ack2 23f3e
 616 2aa3 f3k3w2e ec
 658 w 32 ckfa33a2ee
 537 3e 3fw2e acak23
 592 33 c3ef22we aka
 858 e3w3 cea2k2fa 3
 453 efca  kw223ae33
...
and the resulting file size was 637kb.

As a rough comparison, to see if I was on track, I adapted Sunanda's code
(assuming I got the final version correctly):

;;;;;;;;;;;;;;;;;;;;;;;;
t: now/time
sort/compare  sorted-data: copy data func [a b /local  a-key b-key ] [
    a-key: join second parse/all a " "    ["-" index? find data a]
    b-key: join second parse/all b " "    ["-" index? find data b]
    ;print [a-key "   " b-key]
    return a-key < b-key
]
print now/time - t
;;;;;;;;;;;;;;;;;;;;;;;;

I didn't do dedicated run (I needed to do a few other things), but I
interrupted the run after 4 hours on my 500mhz Celeron 128MB Win98.

Next, I drew a few ideas from contributions, but I was determined to let in
run in memory.  Here was the resulting code:
;;;;;;;;;;;;;;;;;;;;;;;;
rebol []

full-time: now/time
data: read/lines %//windows/desktop/data1.txt

data-blk: copy []
foreach datum data [
    append data-blk second parse/all datum " "
    append data-blk datum
]

sort-time: now/time
sort/skip data-blk 2
print now/time - sort-time

foreach [key value] data-blk [
    write/lines/append %//windows/desktop/sorted-data.txt value
]

print now/time - full-time
print "done"
ask ""
;;;;;;;;;;;;;;;;;;;;;;;;

The idea was to do a parse through the code only once, and construct a new
key and value
 pairing data block.  Then I did a a simple sort/skip by 2.

Time for the sort (alone) was 1 to 2 seconds on my machine, and total
runtime about 33 seconds.  Sample of the resulting data:
...
 000 ae332ae ckw 3f2
 000 ea a32c323w fek
 000 3ac33 wee2 2afk
 000 3 22kee33w acaf
 000 2a2e3cak f e33w
 000 ak 3e2fe3wa2c3
 000 ae2 2efw3ca3 3k
 000 ee2ca3w 3ka3f2
 000 2 afewe333cak2
 000  ce2ew3a3a3 k2f
 000 kcaa3f23 w e32e
 000 3e2kcwf33 2eaa
 000 e3e 2waf cak323
 000 332e ak2eca3fw
 000 3fek2wa3c a23 e
 000 32ck e ea23faw3
 000 akf33 eace3w2 2
 000 332ew2 ka3eca f
...

I am assuming that this is considered a stable sort based on the first
alphanumeric "word" only.

Comments welcome.

Respectfully,
--Scott Jones