[REBOL] Re: Hungarian Alphabet Sort (was Re: Collation sequence - proper and eff
From: carl:cybercraft at: 17-May-2002 21:46
On 17-May-02, G. Scott Jones wrote:
> From: "Carl Read"
>> I've thoughts about how to speed it up - will be testing them out.
> Great!
Well, not so great, actually. The new version's faster, but not
markedly so. Perhaps 30% faster going by the single test of a long
list of random words I did, though it's still 7 or 8 times slower than
REBOL's sort. Maybe if it was all done with parsing it'd be faster,
but I'd have to re-think it all. (:
Anyway, here it is. See the end of the mail for how to handle
characters that are to be considered equal. ie "A" & "a" etc.
pattern-rule: func [
"Create a rule for use by pattern-sort."
pattern [string! block!] "An ordered pattern."
/local rule add-rule n
][
rule: copy []
add-rule: func [str][
str: to-string str
insert tail rule reduce [length? str reduce [
str to-paren reduce ['r n length? str] '|
]]
]
n: 1
foreach pos pattern [
either block? pos [
either 1 = length? pos [
foreach r pos/1 [add-rule r]
][
foreach r pos [add-rule r]
]
][
add-rule pos
]
n: n + 1
]
rule: extract next sort/reverse/skip rule 2 2
insert tail rule reduce ['skip to-paren reduce ['r n 1]]
reduce ['some rule]
]
pattern-sort: func [
{Sort a string or block of strings based on a rule created
by pattern-rule.}
series [string! block!] "Series to sort."
rule [block!] "Pattern rule."
/reverse "Reverse sort order."
/local new blk r pos
][
new: clear []
blk: clear []
r: func [n len][
insert tail blk n
if string? series [insert tail blk len]
]
bind rule 'r
either string? series [
parse/case series rule
pos: 0
foreach [n len] blk [
insert tail new reduce [
reduce [n] copy/part skip series pos len
]
pos: pos + len
]
][
foreach n series [
clear blk
parse/case n rule
insert tail new reduce [copy blk n]
]
]
either reverse [sort/skip/reverse new 2][sort/skip new 2]
clear series
insert tail series extract next new 2
series
]
Use is the same as before, though the rules that are generated are
different from before. ie...
>> rule-1: pattern-rule "aAbBcC"
== [some ["a" (r 1 1) | "A" (r 2 1) | "b" (r 3 1) | "B" (r 4 1) | "c"
(r 5 1) | "C" (r 6 1) | skip (r 7 1)]]
So...
>> pattern-sort "abcABC" rule-1
== "aAbBcC"
>> pattern-sort ["abc" "ABC" "aBc" "AbC"] rule-1
== ["abc" "aBc" "AbC" "ABC"]
>> rule-2: pattern-rule ["a" "b" "c" "h" "ch"]
== [some ["ch" (r 5 2) | "a" (r 1 1) | "b" (r 2 1) | "c" (r 3 1) | "h"
(r 4 1) | skip (r 6 1)]]
>> pattern-sort "ccchcc" rule-2
== "ccccch"
>> pattern-sort "hccchhh" rule-2
== "cchhhch"
>> pattern-sort ["c" "h" "ch" "h" "c"] rule-2
== ["c" "c" "h" "h" "ch"]
Now, to give the same weight to two or more characters, enclose them
in a block. They can either be a single string in the block, in
which case all the characters in the string are weighted the same,
else they can be group of strings which will all be weighted the
same. ie...
>> rule-3: pattern-rule [["aA"]["bB"]["cC"]]
== [some ["a" (r 1 1) | "A" (r 1 1) | "b" (r 2 1) | "B" (r 2 1) | "c"
(r 3 1) | "C" (r 3 1) | skip (r 4 1)]]
>> pattern-sort "BBbbBBcCcAaA" rule-3
== "AaABBbbBBcCc"
>> pattern-sort ["Bbb" "bBB" "aA" "Aa"] rule-3
== ["aA" "Aa" "Bbb" "bBB"]
>> rule-4: pattern-rule ["a" "b" ["cC"]["hH"]["ch" "CH"]]
== [some ["ch" (r 5 2) | "CH" (r 5 2) | "a" (r 1 1) | "b" (r 2 1) |
c
(r 3 1) | "C" (r 3 1) | "h" (r 4 1) | "H" (r 4 1) | skip (r...
>> pattern-sort "CHcCCcchbaHh" rule-4
== "abcCCcHhCHch"
>> pattern-sort ["hhCH" "ccCH" "hhch" "ccch" "hhCH"] rule-4
== ["ccCH" "ccch" "hhCH" "hhch" "hhCH"]
Also, I've allowed for characters not included in the rules, they
being treated as the last character in the rule. So this doesn't
generate an error...
>> pattern-sort ["rat" "hat" "cat"] rule-4
== ["cat" "hat" "rat"]
And the reverse refinement's still there...
>> pattern-sort/reverse ["rat" "hat" "cat"] rule-4
== ["rat" "hat" "cat"]
As before, no promises about how well this will perform with real
alphabets, but it should be a bit better than the last effort.
Hopefully. (;
--
Carl Read