Mailing List Archive: Re: Hungarian Alphabet Sort (was Re: Collation sequence

[REBOL] Re: Hungarian Alphabet Sort (was Re: Collation sequence - proper and eff

From: gscottjones:mchsi at: 15-May-2002 16:55


From: "G. Scott Jones"
> From: "Carl Read"
> > Anyway, I've played around with my idea for sorting according to a
> > pattern, and while I'm not sure if the following code's very fast (or
> > bug-free:), like Volker, I post.
> >
> > There's two functions:  One to take a pattern for creating a rule from
> > and another to use the rule to sort strings or blocks of strings
> > with.  First, the functions...
> <nifty code snipped ... see:
> http://www.escribe.com/internet/rebol/m22420.html
> >
>
> This is extremely promising.  I drew from the ISO-8859-2 character set to
> make a rule, and it initially seems to sort correctly.  The time through
is
> roughly the same as my hack (but I've not really set-up a clean time
> condition).  The only problem so far occurs when I run my word sample list
> through more than once.  It seems to magically have kept the original
sort/s
> and continues to append new results to the block.  I cannot seem to find
> where the problem is occurring.  Furthermore, I'm out of computer
> "play-time" today, so it will have to wait.  :(

Responding to self (I talk to myself sometimes too!).

Wow, major latency in getting the posting (I'm not pointing fingers, so no
one get their "undies in a bundle  ;-).

Hi, Carl,

The idea did look promising, even for the "multi-letter graphemes" (like the
czech "ch"), but then I believe we run into a limitation of 'parse.  The
longer phrase rule needs to come before the shorter one, so that:

rule-4: pattern-rule ["a" "A" "b" "B" "c" "C" "h" "H" "ch" "Ch"]

will not correctly sort:

>> pattern-sort ["c" "ch" "h"]  rule-4
== ["ch" "c" "h"]
;should be "c" "h" "ch"

At least one other person has mused over the desire to have a pattern sort
(in this case under the gnu Linux sort) (look near the bottom):

http://budling.nytud.hu/~szigetva/etcetera/converters/README

In this case, the pattern has a bit more information:
a=�<b<c<cs<d<e=�<f<g<gy<h<i=�...<z<zs

where "a" can be told to sort the same as "a with acute", both of these sort
before "b" ... and "zs" sorts after "z"

Breaking apart this information might allow a parse rule to set-up the
sequence to allow the longer phrase rules to come before the shorter ones.
At least I think it would work.

Back to Geza, ...

Geza, how important are these "multi-letter graphemes" (cs, dz, dzs, gy, ly,
ny, sz, ty and zs) in a sort algorithm?  At the same site, P�ter Szigetv�ri
indicates that it can get very tricky:

Unfortunately, the task is not trivial: some sequences that look like
multi-letter graphemes are in fact not, e.g., b�rcs�k may be ranked before
or after b�rczerge depending on its morphology: b�r+cs�k (after b�rczerge)
or b�rc+s�k (before b�rczerge). This can be decided only with a
morphological/semantic parser, which is probably not worth doing because the
problem practically never turns up.

http://budling.nytud.hu/~szigetva/etcetera/Hungarian/sorting.html

--Scott Jones