Collation sequence - proper and efficient sorting of national accented c

[1/30] from: geza67::freestart::hu at: 12-May-2002 0:57

Hello REBOLers, Is there a way to set a national collation sequence for string SORTing in the SYSTEM object? Defining a per character comparator for SORT/COMPARE would be an overkill in terms of speed and efficacy. Thus, is there a way to use the native SORT with speed for ordering national accented characters (which would be otherwise at the very end of the "English" alphabet :-( ) ? -- Best regards, Geza Lakner MD mailto:[geza67--freestart--hu]

[2/30] from: gscottjones::mchsi::com at: 12-May-2002 6:31

Re: Collation sequence - proper and efficient sorting of national accent

From: "Geza Lakner MD"

> Hello REBOLers, > Is there a way to set a national collation sequence for string

<<quoted lines omitted: 3>>

> ordering national accented characters (which would be otherwise at > the very end of the "English" alphabet :-( ) ?

Hi, Geza, As a special challenge for myself, I undertook the Czech alphabet last year. It was a special challenge both because I didn't know the Czech language and because the Czech alphabet had several special cases (namely "ch"). Here is a link to the first release I did for the Czech alphabet: http://www.escribe.com/internet/rebol/m10477.html A character was missing, so the following contained the final character map: http://www.escribe.com/internet/rebol/m10493.html This was just one approach that I could envision at the time. If I recall correctly, Volker had some other ideas, so you may want to check more of the various threads to see some of the ideas and pitfalls that were discovered: Here was the post that began the challenge: http://www.escribe.com/internet/rebol/m10350.html Here was the post that began the next sequence: http://www.escribe.com/internet/rebol/m10414.html The final sequence began with the first link given in this current email. Let me know if you have any questions. Good luck. --Scott Jones

[3/30] from: geza67:freestart:hu at: 12-May-2002 16:23

Hello G.,

> Let me know if you have any questions. Good luck.

Huhh, pretty scary task NOT to be a REBOLer Englishman ;-) I will try some mapping - it seems the most obvious way to me. Besides I would like to sort first names: as a "loose" (lousy, lazy etc ;-) ) solution it would suffice to map the starting special vowels. In that way Hungarian language is not so exotic, having only accented characters as national specialities, like: � � � � � � � � � (well the last two _both_ should normally have double "grave" accents on top of them not circumflexe or tilde :-( ) -- Best regards, Geza mailto:[geza67--freestart--hu]

[4/30] from: gscottjones::mchsi::com at: 12-May-2002 14:55

Hungarian Alphabet Sort (was Re: Collation sequence - proper and efficie

Hi, Geza, The Czech sorter was alpha, and never moved beyond to a more generic solution due to apparent lack of general interest and time. I was trying to remember how I had mapped the character, and one thing led to another, and I suddenly had a Hungarian Alphabet sorter! Strange how that happens. It was a bit of a puzzle, because my neurofibrillary tangles have been getting worse and worse (warning for the non-medical types: medical internist humor alert). Here is my alpha release of the Hungarian sorter. Watch for line breaks. The one thing that I was unable to be certain about was the sort order for the "diaresis" versus "double acute" forms of "o" and "u". If I have selected the wrong order, it is a very simple matter for me to fix this. Let me know how it works! --Scott Jones ################################################ REBOL [ Title: "Hungarian Language Sort Function" Date: 12-May-2002 Version: 0.0.1 Author: "G. Scott Jones, M.D." File: %hungarian-sort.r Purpose: {Sort support for Hungarian alphabet} Comment: {This is the first alpha release for the Hungarian language sort. It is based on the alpha of my Czech Sort of 2001. For these early versions, I've rolled the character sort list into this file for convenience. The routine is currently hard coded for Hungarian language only, but will readily be made more generic for other languages. The code is heavily commented for easy interpretation by others. The routine could also be rewritten to be a wrapper for REBOL 'sort, with a path refinement allowing for alternative language support. The to-do list is so long as to make it pointless for me to list at this stage. ;-) Now, I'll post to the list for review. USAGE: hungarian-sort series /case /reverse } History: [ 0.0.1 [12-May-2002 {First released for alpha review} "GSJ"] ] ] char-list: {32 1 32 33 2 33 ! 34 3 34 " 35 4 35 # 36 5 36 $ 37 6 37 % 38 7 38 & 39 8 39 ' 40 9 40 ( 41 10 41 ) 42 11 42 * 43 12 43 + 44 13 44 , 45 14 45 - 46 15 46 . 47 16 47 / 48 17 48 0 49 18 49 1 50 19 50 2 51 20 51 3 52 21 52 4 53 22 53 5 54 23 54 6 55 24 55 7 56 25 56 8 57 26 57 9 58 27 58 : 59 28 59 ; 60 29 60 < 61 30 61 62 31 62 > 63 32 63 ? 64 33 64 @ 97 34 97 a 0061 LATIN SMALL LETTER A 225 35 225 a' 00e1 LATIN SMALL LETTER A WITH ACUTE 65 69 65 A 0041 LATIN CAPITAL LETTER A 193 70 193 A' 00c1 LATIN CAPITAL LETTER A WITH ACUTE 98 36 98 b 0062 LATIN SMALL LETTER B 66 71 66 B 0042 LATIN CAPITAL LETTER B 99 37 99 c 0063 LATIN SMALL LETTER C 67 72 67 C 0043 LATIN CAPITAL LETTER C 100 38 100 d 0064 LATIN SMALL LETTER D 68 73 68 D 0044 LATIN CAPITAL LETTER D 101 39 101 e 0065 LATIN SMALL LETTER E 233 40 233 e' 00e9 LATIN SMALL LETTER E WITH ACUTE 69 74 69 E 0045 LATIN CAPITAL LETTER E 201 75 201 E' 00c9 LATIN CAPITAL LETTER E WITH ACUTE 102 41 102 f 0066 LATIN SMALL LETTER F 70 76 70 F 0046 LATIN CAPITAL LETTER F 103 42 103 g 0067 LATIN SMALL LETTER G 71 77 71 G 0047 LATIN CAPITAL LETTER G 104 43 104 h 0068 LATIN SMALL LETTER H 72 78 72 H 0048 LATIN CAPITAL LETTER H 105 44 105 i 0069 LATIN SMALL LETTER I 237 45 237 i' 00ed LATIN SMALL LETTER I WITH ACUTE 73 79 73 I 0049 LATIN CAPITAL LETTER I 205 80 205 I' 00cd LATIN CAPITAL LETTER I WITH ACUTE 106 46 106 j 006a LATIN SMALL LETTER J 74 81 74 J 004a LATIN CAPITAL LETTER J 107 47 107 k 006b LATIN SMALL LETTER K 75 82 75 K 004b LATIN CAPITAL LETTER K 108 48 108 l 006c LATIN SMALL LETTER L 76 83 76 L 004c LATIN CAPITAL LETTER L 109 49 109 m 006d LATIN SMALL LETTER M 77 84 77 M 004d LATIN CAPITAL LETTER M 110 50 110 n 006e LATIN SMALL LETTER N 78 85 78 N 004e LATIN CAPITAL LETTER N 111 51 111 o 006f LATIN SMALL LETTER O 243 52 243 o' 00f3 LATIN SMALL LETTER O WITH ACUTE 245 53 245 o' 00f3 LATIN SMALL LETTER O WITH DOUBLE ACUTE 246 54 246 o: 00f6 LATIN SMALL LETTER O WITH DIAERESIS 79 86 79 O 004f LATIN CAPITAL LETTER O 211 87 211 O' 00d3 LATIN CAPITAL LETTER O WITH ACUTE 213 88 213 O" 0150 LATIN CAPITAL LETTER O WITH DOUBLE ACUTE 214 89 214 O: 00d6 LATIN CAPITAL LETTER O WITH DIAERESIS 112 55 112 p 0070 LATIN SMALL LETTER P 80 90 80 P 0050 LATIN CAPITAL LETTER P 113 56 113 q 0071 LATIN SMALL LETTER Q 81 91 81 Q 0051 LATIN CAPITAL LETTER Q 114 57 114 r 0072 LATIN SMALL LETTER R 82 92 82 R 0052 LATIN CAPITAL LETTER R 115 58 115 s 0073 LATIN SMALL LETTER S 83 93 83 S 0053 LATIN CAPITAL LETTER S 116 59 116 t 0074 LATIN SMALL LETTER T 84 94 84 T 0054 LATIN CAPITAL LETTER T 117 60 117 u 0075 LATIN SMALL LETTER U 250 61 250 u' 00fa LATIN SMALL LETTER U WITH ACUTE 251 62 251 u' 00fa LATIN SMALL LETTER U WITH DOUBLE ACUTE 252 63 252 u: 00fc LATIN SMALL LETTER U WITH DIAERESIS 85 95 85 U 0055 LATIN CAPITAL LETTER U 218 96 218 U' 00da LATIN CAPITAL LETTER U WITH ACUTE 219 97 219 U" 0170 LATIN CAPITAL LETTER U WITH DOUBLE ACUTE 220 98 220 U: 00dc LATIN CAPITAL LETTER U WITH DIAERESIS 118 64 118 v 0076 LATIN SMALL LETTER V 86 99 86 V 0056 LATIN CAPITAL LETTER V 119 65 119 w 0077 LATIN SMALL LETTER W 87 100 87 W 0057 LATIN CAPITAL LETTER W 120 66 120 x 0078 LATIN SMALL LETTER X 88 101 88 X 0058 LATIN CAPITAL LETTER X 121 67 121 y 0079 LATIN SMALL LETTER Y 89 102 89 Y 0059 LATIN CAPITAL LETTER Y 122 68 122 z 007a LATIN SMALL LETTER Z 90 103 90 Z 005a LATIN CAPITAL LETTER Z 91 104 91 [ 92 105 92 \ 93 106 93 ] 94 107 94 ^ 95 108 95 _ 96 109 96 ` 123 110 123 { 124 111 124 | 125 112 125 } 126 113 126 ~ 133 114 133 a} ;;;;set up sort data structures data: copy [] data: parse/all char-list "^/" ;make regular sort map hu-reg: copy data forall hu-reg [hu-reg/1: to-integer first parse hu-reg/1 none] hu-reg: head hu-reg ;make case-sensitive sort map hu-case: copy data mysort: func [a b] [ (to-integer pick parse a none 2) < (to-integer pick parse b none 2) ] ;rearrange the list based on second field sort/compare hu-case :mysort forall hu-case [hu-case/1: to-integer first parse hu-case/1 none] hu-case: head hu-case ;;;;new sort function ;not all 'sort refinements yet supported ;local words have not been specified ;error condition roll-back of block to original not yet added hungarian-sort: func [:blk /case /reverse][ either case [order: hu-case][order: hu-reg] ;backup for future error checking and roll-back blk-backup: copy blk forall blk [ ;swap index position for characters temp: copy [] foreach b blk/1 [ t: find order to-integer b append temp index? t ] blk/1: temp ] blk: head blk ;sort through REBOL 'sort either reverse [ sort/reverse blk ][ sort blk ] forall blk [ temp: copy [] ;change index integer back to characters foreach b blk/1 [append temp to-char order/:b] ;make a word out of characters blk/1: copy rejoin temp ] ;reset head and block returns changed blk: head blk ] ;;;;now for some testing ;these may not be official spellings - it is just what I had available months: ["janu�r" "febru�r" "m�rcius" "�prilis" "m�jus" "j�nius" j�lius "augusztus" "szeptember" "okt�ber" "november" "december"] hungarian-sort months print ["Check month sort/case: " equal? months ["augusztus" "�prilis" december "febru�r" "janu�r" "j�lius" "j�nius" "m�jus" "m�rcius" november "okt�ber" "szeptember"]] ;foreach m months [print m] hungarian-sort/case months print ["Check month sort/case: " equal? months ["augusztus" "�prilis" december "febru�r" "janu�r" "j�lius" "j�nius" "m�jus" "m�rcius" november "okt�ber" "szeptember"]] ;foreach m months [print m] hungarian-sort/reverse months print ["Check month sort/case: " equal? months ["szeptember" "okt�ber" november "m�rcius" "m�jus" "j�nius" "j�lius" "janu�r" "febru�r" december "�prilis" "augusztus"]] ;foreach m months [print m] days: ["h�tfo" "kedd" "szerda" "cs�t�rt�k" "p�ntek" "szombat" "vas�rnap"] hungarian-sort days print ["Check day sort: " equal? days ["cs�t�rt�k" "h�tfo" "kedd" p�ntek "szerda" "szombat" "vas�rnap"]] ;foreach d days [print d] hungarian-sort/case days print ["Check day sort/case: " equal? days ["cs�t�rt�k" "h�tfo" "kedd" p�ntek "szerda" "szombat" "vas�rnap"]] ;foreach d days [print d] hungarian-sort/reverse days print ["Check day sort/case: " equal? days ["vas�rnap" "szombat" szerda "p�ntek" "kedd" "h�tfo" "cs�t�rt�k"]] ;foreach d days [print d] word-sample: ["janu�r" "febru�r" "m�rcius" "�prilis" "m�jus" j�nius "j�lius" "augusztus" "szeptember" "okt�ber" "november" december "h�tfo" "kedd" "szerda" "cs�t�rt�k" "p�ntek" szombat "vas�rnap" "nulla" "egy" "kett�" "h�rom" "n�gy" �t "hat" "h�t" "nyolc" "kilenc" "t�z" "tizenegy" "h�sz" huszonegy "harmincegy" "negyvenegy" "�tvenegy" "hatvanegy" hetvenegy "nyolcvanegy" "kilencvenegy" "sz�z" "ezer" ezeregysz�z "t�zezer" "�tvenezer" "sz�zezer" "milli�" milli�rd "Igen" "Nem" "K�rem" "K�sz�n�m" "Szervusz" Viszontl�t�sra "Magyar" "Magyarorsz�g" "Hogy" "van" vagy "Mit" "csin�lsz" "Bocs�nat" "vagyok" "Hol" "sz�p" orsz�g "Seg�tene" "F�radt" "Seg�ts�g" "a" "Kanadai" Amerikai "Hany" "�ra" "Merre" "kell" "menni" "Az" EMERGE "Eur�pai" "Uni�" "Inform�ci�s" "T�rsadalom" Technol�gi�j�val "foglalkoz�" "projektje" "amely" akt�van "t�mogatja" "�s" "m�s" "k�z�p" "kelet" "orsz�gok" r�szv�tel�t "EU" "�ltal" "finansz�rozott" "IST" projektekben "Inform�l" "keretprogram" "m�szaki" megold�saival "projektjeir�l" "t�j�koztat" "ben" "indul�" keretprogramr�l "Tansz�k�nk" "Budapesti" Gazdas�gtudom�nyi "Egyetem" "T�vk�zl�si" "Telematikai" Tansz�ke "projekt" "hazai" "partnere" "vonatkoz�sa" "f�" f�zisb�l "�ll" "konferencia" "megszervez�se" "Budapesten" melynek "sor�n" "munk�ja" "ir�nt" "�rdekl�d�k" szem�lyesen "is" "bemutatkozhatnak" "egym�snak" ny�jt�sa "int�zm�nyeknek" "ahhoz" "csatlakozhassanak" jelenleg "fut�" "projektekhez" "abban" "partnereket" tal�ljanak "j�v�beliekhez" "T�j�koztat�s" "arr�l" milyen "gazdas�gi" "helyzet" "projektben" "r�sztvev�" �gynevezett "ipar�ban" "ezek" "javar�szt" "ezen" "bel�l" Magyarorsz�gon "itt" "olvashat�" "inform�ci�k" friss�tett "v�ltozata" "elej�re" "v�rhat�" "Szint�n" ehhez "f�zishoz" "tartozik" "elk�vetkez�" "�vre" vonatkoz� "Uni�s" "kutat�si" "programr�l" "els�" "f�zisa" okt�ber�ben "m�sodik" "pedig" "befejez�d�tt" "Mindezekr�l" b�vebb "inform�ci�t" "Arch�vum" "pont" "alatt" "tal�lhat" bal "oldali" "men�ben" "harmadik" "f�zis" "err�l" "t�bbet" t�vk�zl�s "helyzete" "orsz�gokban" "pontok" "tudhat" "meg" �jdons�g "H�rad�stechnika" "c�m�" "foly�iratban" "hamarosan" megjelenik "f�zisban" "megrendezett" "konferenci�n" elhangzott "el�ad�sokb�l" "h�romnak" "nyelv�" "�rott"] hungarian-sort word-sample ;foreach d word-sample [print d] hungarian-sort/reverse word-sample ;foreach d word-sample [print d] hungarian-sort/case word-sample ;foreach d word-sample [print d]

[5/30] from: geza67:freestart:hu at: 12-May-2002 23:34

Re: Hungarian Alphabet Sort (was Re: Collation sequence - proper and eff

Hello Scott, Thanx, cute solution! Though my critical comments :-) : The right order for Hungarian vowels: actually the diaresis characters come first and then the double acute ones (only o and u have double accents in the Hungarian alphabet): oO�� uU�� Unfortunately the case-insensitiveness does not work. Look: hungarian-sort ["alom" "�lom" "�lom" "�llam"] == ["alom" "�lom" "�llam" "�lom"] Though it should read: alom �llam �lom �lom. - The /case refinement results in the same result as the one without it :-( :

>> hungarian-sort/case ["alom" "�lom" "�lom" "�llam"]

== ["alom" "�lom" "�llam" "�lom"] The case-sensitive collation sequence IMHO would be a bit different than you have defined, namely: aA��...eE��... Your order was: a�A�...e�E�... - and so on for all affected special accented chars. -- Best regards, Geza mailto:[geza67--freestart--hu]

[6/30] from: gscottjones:mchsi at: 12-May-2002 21:11

From: "Geza Lakner MD"

<snip> > The right order for Hungarian vowels: actually the diaresis characters > come first and then the double acute ones (only o and u have double > accents in the Hungarian alphabet): > oO�� > uU��

This was easy to fix.

> Unfortunately the case-insensitiveness does not work. Look: > hungarian-sort ["alom" "�lom" "�lom" "�llam"] > == ["alom" "�lom" "�llam" "�lom"] > > Though it should read: > alom �llam �lom �lom.

Yes, this is a problem. My current algorithm will not easily accommodate this change. I now can even remember thinking last year that the approach might cause a problem, but the test samples presented apparently did not detect this problem at that time. Hmmm. Time to go back to the drawing board. I already have an idea, but it may take a while before I have some time to create the new algorithm.

> - The /case refinement results in the same result as the one without > it :-( :

<<quoted lines omitted: 5>>

> Your order was: > a�A�...e�E�...

There end up being two issues at work here. Having the order as a�A�...e�E�... was not my intention. What I was aiming to do was a�..e�..A�..E�... which may also not seem correct to you; however, this behavior mirrors REBOL's default behavior for the /case switch, but does differ in placing the little letters before the capital letters. Petr K. said that this was the more normal method in eastern europe (Czech language in his case). So I was trying to reflect this pattern, but did make the one ordering error. The REBOL 'sort /case switch will sort all the words first by whether the letter is capital or not. In fact, REBOL places all the words that begin in capital letters _before_ the words that begin in small letters (because of the ascii number assigned to the letters). Maybe we need an additional switch that allows for the eastern european desire to have smalls before capitals, and to interleave these together as you suggest. Sometimes it would be handy to have these options too here in the US. Just need a clever name or names for these switches (or paths in REBOLese). Any ideas are welcomed.

> - and so on for all affected special accented chars.

and so on for life in general! :-) I'll repost after I have a chance to develop the new algorithm that I have in mind. "Stay tuned" Thanks for your feedback! --Scott Jones

[7/30] from: geza67:freestart:hu at: 13-May-2002 20:29

Hello Scott!

>> The right order for Hungarian vowels: actually the diaresis characters > This was easy to fix.

... as you have prospectively pointed it out in your first post :-)

> Time to go back to the drawing board. I already have an idea, but it may > take a while before I have some time to create the new algorithm.

Good luck to "braining out" the new enhanced algorithm. :-)

> There end up being two issues at work here. Having the order as > a�A�...e�E�... > was not my intention. What I was aiming to do was > a�..e�..A�..E�... > which may also not seem correct to you; however, this behavior mirrors

Ah, so! No, this is quite right: small letters first , then capitals. I just thought you were aiming at an "interwoven" collation sequence.

> REBOL's default behavior for the /case switch, but does differ in placing

REBOL seems (more and more to me) English-oriented which is very peculiar, Carl being a German fellow (do I know it right?) Has he forgotten the handling of his native language special characters - like the German-only a-umlaut ? ;-)

> the little letters before the capital letters. Petr K. said that this was > the more normal method in eastern europe (Czech language in his case). So I

It is the normal method in Hungarian, as well.

> letter is capital or not. In fact, REBOL places all the words that begin in > capital letters _before_ the words that begin in small letters (because of > the ascii number assigned to the letters).

The problem is - IMHO - that REBOL does not allow _really_ custom sorts: although one can write a /compare refinement function but this refinement is not so general-aimed as it seems first. Maybe mathematicians can use custom comparisons for e.g. complex numbers, but the refinement can not easily accomodated to to custom-order series values, as it is in the case of strings. Specifying collation order for strings is the first step to internationalization. Being Europe a huge and linguistically not homogenous market, RT should adopt a "plugin"-style localization: the 'locale object seems to be a right place to this, i.e. putting custom collation sequences there.

> Maybe we need an additional switch that allows for the eastern european > desire to have smalls before capitals, and to interleave these together as

Maybe I missed this in the English class :-) but does NOT sort English this way, too? What is the proper sorting order for mixed capitalized English words?

> you suggest. Sometimes it would be handy to have these options too here in

On what occasion do you think it would be necessary for you (disregarding the special cases for writing custom softwares for Eastern Europe ;-) ) ?

> the US. Just need a clever name or names for these switches (or paths in > REBOLese). Any ideas are welcomed.

The most obvious (and highly uninspired ;-) ( naming would be: /international. Other ideas: /smallsfirst /capitalized

>> - and so on for all affected special accented chars. > and so on for life in general! > :-)

Do not stop generalization here: Life, Universe and everything ... :-))

> I'll repost after I have a chance to develop the new algorithm that I have > in mind. "Stay tuned"

Beep-beep :-) -- Best regards, Geza mailto:[geza67--freestart--hu]

[8/30] from: carl::cybercraft::co::nz at: 14-May-2002 9:21

Re: Hungarian Alphabet Sort (was Re: Collation sequence -proper and effi

On 14-May-02, Geza Lakner MD wrote:

> Maybe I missed this in the English class :-) but does NOT sort > English this way, too? What is the proper sorting order for mixed > capitalized English words?

Something I'd not thought about. This is what REBOL does...

>> sort "AabB"

== "AabB"

>> sort/case "AabB"

== "ABab" I expected sort/case to return "aAbB"... Would a sort/pattern be of use? ie... sort/pattern "AabB" "aAbBcC" ; == "aAbB" -- Carl Read

[9/30] from: sunandadh::aol::com at: 13-May-2002 19:19

Re: Hungarian Alphabet Sort (was Re: Collation sequence - proper ...

Geza:

> Specifying collation > order for strings is the first step to internationalization. Being Europe a > huge and linguistically not homogenous market, RT should adopt > a "plugin"-style localization: the 'locale object seems to be a right > place to this, i.e. putting custom collation sequences there

It's worth someone from RT taking a look at how MySQL handles adding new character sets and collating sequences -- it's pretty complete. Although it's worth pointing out that they don't handle all the subtleties needed across Europe. One tiny example. German names in phone books may have a different collating sequence to words in a dictionary, and Austrian phone books use a different ordering to German ones. Useful Mysql reference: http://www.unixtech.be/docs/mysql/manual_Server.html#String_collating Sunanda.

[10/30] from: gscottjones:mchsi at: 13-May-2002 18:32

Re: Hungarian Alphabet Sort (was Re: Collation sequence -proper and effi

From: "Carl Read"

> Would a sort/pattern be of use? ie... > > sort/pattern "AabB" "aAbBcC" ; == "aAbB"

Hi, Carl, Are you suggesting this as a prototype of a call, or is there already such a beast out there? At any rate, this is an interesting idea as a way of introducing new or different sort patterns. I'll have to think about it a bit. If this already exists, certainly let us know. Thanks! --Scott Jones

[11/30] from: gscottjones:mchsi at: 13-May-2002 21:43

Re: Hungarian Alphabet Sort (was Re: Collation sequence - proper and eff

From: "Geza Lakner MD"

>> Time to go back to the drawing board. I already have an idea, but it may >> take a while before I have some time to create the new algorithm. > Good luck to "braining out" the new enhanced algorithm. :-)

I think I've got it. In fact I'm expanding the idea to handle all the languages that use the ISO-8859-2 character set. Use the basic underlying technique that I used before, I can set up sorting orders to accomplish any desired goal. The biggest problem that I am running into is the actual sorting order. You've helped with the Hungarian language, and Petr/Cyphre helped with the Czech language. But I show the following languages all (can) use the same character set: Albanian, Bosnian, Croatian, Czech, English, Finnish, Hungarian, Irish, German, Polish, Romanian, Serbian (Latin transcription), Slovak, Slovenian, Sorbian (Lusatian) I am collating *all* the characters/codes for each and I am making a few blind stabs at the sorting order, but it is not obvious to this chap from the US. Then there are the exceptions like "ch" from Czech, and the German ss sharp small s. Wow. I previously figured out how to manage the "ch" conundrum from the Czech language, but I guess in a grand unified scheme for managing the ISO-8859-2 character set, it would require a refinement/switch to instantiate this sort of exception. You know, someone ought to invent a unified character representation and maybe call it ... hmmm .. , let me see, maybe "Unicode" for example. ;) Seriously, I've had *no* experience with whether Unicode necessarily makes sorting any easier. My guess is "no".

>> The problem is - IMHO - that REBOL does not allow _really_ custom >> sorts: although one can write a /compare refinement function but this >> refinement is not so general-aimed as it seems first. Maybe >> mathematicians can use custom comparisons for e.g. complex numbers, >> but the refinement can not easily accomodated to to custom-order >> series values, as it is in the case of strings. ...

I've used the /compare refinement several times and have found that it is usable within its limits. But as Petr and I discussed last year, it does not appear to lend itself to the type of sorting problems that we are using here. Last year, I originally began to develop a complex /compare algorithm, until it dawned on me that I could develop a more generic solution using substitution, and then take advantage of the speed of the native!-level 'sort. If I recall correctly, some samples showed that the current method was significantly faster than a /compare function used alone. I may be mis-remembering this fact, so don't "go to the bank" on it (take it too seriously).

>> Specifying collation order for strings is the first step to

internationalization.

>> Being Europe a huge and linguistically not homogenous market, RT should >> adopt a "plugin"-style localization: the 'locale object seems to be a

right

>> place to this, i.e. putting custom collation sequences there.

I suspect that RT has already given this some thought, and has probably some general idea about the "right way" to go about it. (They seem to have done this about so many things that I doubt that they have neglected this important area.) My *guess* is that they need to make some money before they can make this next big step in making a truly internationalizable product. Tcl has supported Unicode for some time, so I know that it is certainly do-able at a base level. My ignorance begins in where to go from Unicode. I leave that speculation to the people that actually know what they are doing with computers! (I sleep better at night that way. You should too!)

>> Maybe we need an additional switch that allows for the eastern european >> desire to have smalls before capitals, and to interleave these together

> Maybe I missed this in the English class :-) but does NOT sort English

this way, too?

> What is the proper sorting order for mixed capitalized English words?

I hate to be the sole source in this area; I would much rather someone who knew a great deal more about computer sciences, knew English (I just pretend to in order make a living), and was infinitely more articulate than myself (Joel? Sunanda? et. al. Is anyone else here?). However, I'm never overly embarrassed to make a complete fool of myself, so ... What must be distinquished is the difference between the proper sort and the way that computers have done it "easily" to date. I frankly don't know if there is considered to be a proper sort in *Amercian* English (we are hardly proper about much at all except how to get in to a proper war! ;), specifically small letters before capital letters. I feel sure that others *do* know (I'm just a doctor AND I don't play one on television! Bad joke that requires being an avid watcher of US television advertisements ... no one ever gets it even here, so don't worry). What I do recall is that **computer** sorting has historically been most easily accommplished by using the ASCII character set representation of the alphabet. As you likely already know, "A" is 65, "B" is 66 ... , and "a" is 97, "b" is 98. Non-case sensitive sorts will do an implicit reduction of the "small" cases to the capital cases by subtracting 32. (In the old days of Assembler language, it only required a computationally cheap "right shift" of bits by two places for bytes over 96.) Since the capital letters came (in ASCII) before the small letters, then case sensitive sorts placed the capital letters first. The legacy of the computer age then places the "natural" sort as placing the capital letters first. Please, someone slap me down if I have this totally wrong.

>> you suggest. Sometimes it would be handy to have these options too here

> On what occasion do you think it would be necessary for you > (disregarding the special cases for writing custom softwares for > Eastern Europe ;-) ) ?

Having a sort that went by case-insensitive letter with the option of placing one type before the other would seem convenient (and would look nicer), but I honestly can not tell you a specific time that this requirement happened. (Remember, I've been exposed to heavy levels of lead for too many years .... I've got to stop eating those lead paint chips!! Maybe it is time to switch to mercury... ;)

>> the US. Just need a clever name or names for these switches (or paths in >> REBOLese). Any ideas are welcomed.

<<quoted lines omitted: 3>>

> /smallsfirst > /capitalized

I think these are some great ideas! Thanks again for the feedback and stimulus. (Stimulus -> Response, Stimulus -> Response ... it works ... at least in the laboratory ;) --Scott Jones

[12/30] from: carl:cybercraft at: 14-May-2002 18:05

Re: Hungarian Alphabet Sort (was Re: Collation sequence -proper and effi

On 14-May-02, G. Scott Jones wrote:

> From: "Carl Read" >> Would a sort/pattern be of use? ie...

<<quoted lines omitted: 3>>

> Are you suggesting this as a prototype of a call, or is there > already such a beast out there?

As a prototype of a call - ie, as an extra refinement to 'sort. Obviously... sort/pattern "AabB" ["a" "A" "b" "B" "c" "C" "ch"] etc. should also be supported.

> At any rate, this is an interesting idea as a way of introducing new > or different sort patterns. I'll have to think about it a bit. > If this already exists, certainly let us know.

Not as far as I know. Send the idea to Feedback if you think it'd be useful. Who knows, it might be something that's quick and easy to add to REBOL. -- Carl Read

[13/30] from: nitsch-lists:netcologne at: 14-May-2002 12:09

Re: Hungarian Alphabet Sort (was Re: Collation sequence - proper and eff

Hi Scott, Gesa, Carl, not sure if this helps, but since i spended some time to it, i post ;) rebol [title: "char-mapping"] { Hi Scott, Geza, Carl, instead of creating the mapping fully by hand, i created a little dialect, which creates a parse-rule. (not a very efficient one currently. contest! ;) just a demo, lacks all special chars currently. } mapper: context [ {===patch the default mapping with your local specialities} customize-ascii: [ at "h" [+ "ch"] at "H" [+ "CH" = "Ch"] ] "===logical mapping to insert / change easily" logical-mapping: copy [] 'like [.. + "G" + "H" + "CH" = "Ch" + "I" =2E.] "fill with ascii (attention 0-based.. ;)" repeat i 128 [ append logical-mapping compose [+ (to string! to char! i - 1)] ] {now one could, for example, exchange upper & lower chars with some rebol-moves} "===evaluate 'customize, insert custom strings" parse customize-ascii [some [ 'at set string string! set block block! ( insert find/case/tail logical-mapping string block ) ]] ? logical-mapping "===numbered mapping to have the translation-codes" numbered-mapping: copy [] 'like [.. "G" 71 "H" 72 "CH" 73 "Ch" 73 "I" 74 =2E.] next-char: -1 parse logical-mapping [some [ ['+ (next-char: next-char + 1) | '=] set string string! (repend numbered-mapping [string next-char]) ]] ? numbered-mapping "===mapping rule to translate" mapping-rule: cp [] 'like [.. | "CH" (insert tail out #"I") | "H" (insert tail out #"H") | ..] {attention: parse needs the longest strings first, so we reverse!} parse head reverse numbered-mapping [some [ set code integer! set string string! ( append mapping-rule reduce [ string to-paren compose [insert tail out (to-char code)] '| ] ) ]] remove back tail mapping-rule ? mapping-rule "===and now the mapping-function" out: none map: func [string] [ out: cp "" parse/all/case string [any mapping-rule] out ] mapped-sort: func [block /local buf] [ buf: cp [] foreach string block [repend buf [map string string]] sort/skip buf 2 clear block forskip buf 2 [append block second buf] block ] "===test" probe mapped-sort [ "A string with H mapped" "A string with I mapped" "A string with CH mapped" ] ]

[14/30] from: gscottjones:mchsi at: 14-May-2002 8:14

From: "Volker Nitsch" ...

> not sure if this helps, but since i spended some time to it, > i post ;)

<snipped code> Hi, Volker, Neat idea. Kind of like a good cut of beef, I'm going to have to chew on it a bit to fully understand its potential. Thanks for the trans-atlantic volley ball pass. By the way, how should the small sharp s character (ASCII 223 in ISO--8859-2) sort compared to a regular s? --Scott Jones

[15/30] from: carl:cybercraft at: 16-May-2002 0:30

On 15-May-02, G. Scott Jones wrote:

> From: "Volker Nitsch" > ...

<<quoted lines omitted: 5>>

> chew on it a bit to fully understand its potential. Thanks for the > trans-atlantic volley ball pass.

Glad you could work it out, as I couldn't make head nor tail of it. (: Anyway, I've played around with my idea for sorting according to a pattern, and while I'm not sure if the following code's very fast (or bug-free:), like Volker, I post. There's two functions: One to take a pattern for creating a rule from and another to use the rule to sort strings or blocks of strings with. First, the functions... pattern-rule: func [ "Create a rule for use by pattern-sort." pattern [string! block!] "An ordered pattern." /local rule n ][ rule: copy [] n: 1 forall pattern [ append rule reduce [pattern/1 to-paren reduce ['r n] '|] n: n + 1 ] append rule reduce ['skip to-paren reduce ['r n]] reduce ['some rule] ] pattern-sort: func [ {Sort a string or block of strings based on a rule created by pattern-rule.} series [string! block!] "Series to sort." rule [block!] "Pattern rule." /reverse "Reverse sort order." /local ptrs blk r pos val ][ ptrs: copy [] blk: copy [] r: func [n][append/only blk n] bind rule 'r either string? series [ parse/case series rule pos: 1 foreach n blk [ append/only ptrs reduce [ n pick rule/2 (n - 1) * 3 + 1 ] val: next first back tail ptrs if 'skip = val/1 [change val pick series pos] pos: pos + either char? val/1 [1][length? val/1] ] ][ forall series [ clear blk parse/case series/1 rule append/only ptrs copy blk append last ptrs series/1 ] ] either reverse [sort/reverse ptrs][sort ptrs] clear series forall ptrs [append series last ptrs/1] series ] And some examples of use...

>> rule-1: pattern-rule "aAbBcC"

== [some [#"a" (r 1) | #"A" (r 2) | #"b" (r 3) | #"B" (r 4) | #"c" (r 5) | #"C" (r 6) | skip (r 7)]]

>> pattern-sort "AacCBb" rule-1

== "aAbBcC"

>> pattern-sort ["Abc" "abc" "aBC" "ABC"] rule-1

== ["abc" "aBC" "Abc" "ABC"]

>> pattern-sort/reverse ["Abc" "abc" "aBC" "ABC"] rule-1

== ["ABC" "Abc" "aBC" "abc"]

>> rule-2: pattern-rule "AaBbCc"

== [some [#"A" (r 1) | #"a" (r 2) | #"B" (r 3) | #"b" (r 4) | #"C" (r 5) | #"c" (r 6) | skip (r 7)]]

>> pattern-sort "AacCBb" rule-2

== "AaBbCc"

>> pattern-sort ["Abc" "abc" "aBC" "ABC"] rule-2

== ["ABC" "Abc" "aBC" "abc"]

>> rule-3: pattern-rule ["a" "A" "b" "B" "ch" "c" "C"]

== [some ["a" (r 1) | "A" (r 2) | "b" (r 3) | "B" (r 4) | "ch" (r 5) | c (r 6) | "C" (r 7) | skip (r 8)]]

>> pattern-sort "abcABCchCbA" rule-3

== "aAAbbBchcCC"

>> pattern-sort ["AabA" "chab" "chAB" "cchc" "achA"] rule-3

== ["achA" "AabA" "chab" "chAB" "cchc"] It seems to work and might be of some use, but I'd test it well before trusting it. It's had no real-world tests at all... -- Carl Read

[16/30] from: gscottjones:mchsi at: 15-May-2002 9:38

From: "Carl Read"

> On 15-May-02, G. Scott Jones wrote: > > From: "Volker Nitsch"

<<quoted lines omitted: 7>>

> > trans-atlantic volley ball pass. > Glad you could work it out, as I couldn't make head nor tail of it. (:

I'm still cooking the meat on the barbie before I can chew it! ;)

> Anyway, I've played around with my idea for sorting according to a > pattern, and while I'm not sure if the following code's very fast (or > bug-free:), like Volker, I post. > > There's two functions: One to take a pattern for creating a rule from > and another to use the rule to sort strings or blocks of strings > with. First, the functions...

<nifty code snipped ... see: http://www.escribe.com/internet/rebol/m22420.html

This is extremely promising. I drew from the ISO-8859-2 character set to make a rule, and it initially seems to sort correctly. The time through is roughly the same as my hack (but I've not really set-up a clean time condition). The only problem so far occurs when I run my word sample list through more than once. It seems to magically have kept the original sort/s and continues to append new results to the block. I cannot seem to find where the problem is occurring. Furthermore, I'm out of computer play-time today, so it will have to wait. :( Meanwhile, I'm waiting for Volker's serving to slow cook, so I can fully savor it soon!. Thanks for the input. Very exciting because it is so clean! --Scott Jones

[17/30] from: nitsch-lists:netcologne at: 15-May-2002 18:58

Hi Carl, Scott, Gesa, Am Mittwoch, 15. Mai 2002 14:30 schrieb Carl Read:

> On 15-May-02, G. Scott Jones wrote: > > From: "Volker Nitsch"

<<quoted lines omitted: 11>>

> > trans-atlantic volley ball pass. > Glad you could work it out, as I couldn't make head nor tail of it. (:

Carl, my volley pass works similar to yours, except i made it more complicated :) your pattern-rule "aAbBcC" would look like [+ "a" + "A" + "b" + "B" + "c" + "C"] because i use blocks with strings, i can also map multi-char-codes. so [+ "CH"] maps to one char. IIRC "CH" is handled like one char? also i have [+ "CH" = "Ch"]. this says, "CH" and "Ch" are the same. (add a new code-number for "CH" and use the same number for "Ch"). And in my telephone-book "ö" is handled like "oe", so one char expands to two. So i need this kind of commands? (Scott, i found no "ss" in this book, because "ss" has always two chars before it and is rarely used. sorry.. http://www.uni-koeln.de/phil-fak/spinfo/lehre/java/kap23/collating.htm#3.3 duden 73: "ß" like "ss", by same words before (argh!) since 96 changed: "ß" after "ss". and "ä" the same as "a". telephonbook is wrong? or duden? hmm.. back to script.) ) first the block is initialized with ascii-codes [.. + "@" + "A" ..] then i could move whole char-blocks around, to mix "aAbB". then comes customize-ascii: [ at "h" [+ "ch"] at "H" [+ "CH" = "Ch"] ] which says {find "h" in block and insert[+ "ch"] behind},same for "H". now i have [.. + "h" + "ch" + "i"]. in a second pass i give numbers to the strings in tis order, in a third i create the parse-rule, which translates a string to the sort-encoding. for sorting i mix strings and their translations like [translation1 string1 translation2 string2] sort with sort/skip 2, and extract the strings back to the original block. hmm, somehow i like your string more. if it could deal with multi-chars.

> Anyway, I've played around with my idea for sorting according to a > pattern, and while I'm not sure if the following code's very fast (or > bug-free:), like Volker, I post. >

Good idea :-)

> There's two functions: One to take a pattern for creating a rule from > and another to use the rule to sort strings or blocks of strings

<<quoted lines omitted: 75>>

> It seems to work and might be of some use, but I'd test it well before > trusting it. It's had no real-world tests at all...

greetings volker

[18/30] from: gscottjones:mchsi at: 15-May-2002 16:55

From: "G. Scott Jones"

> From: "Carl Read" > > Anyway, I've played around with my idea for sorting according to a

<<quoted lines omitted: 9>>

> This is extremely promising. I drew from the ISO-8859-2 character set to > make a rule, and it initially seems to sort correctly. The time through

> roughly the same as my hack (but I've not really set-up a clean time > condition). The only problem so far occurs when I run my word sample list > through more than once. It seems to magically have kept the original

sort/s

> and continues to append new results to the block. I cannot seem to find > where the problem is occurring. Furthermore, I'm out of computer > "play-time" today, so it will have to wait. :(

Responding to self (I talk to myself sometimes too!). Wow, major latency in getting the posting (I'm not pointing fingers, so no one get their "undies in a bundle ;-). Hi, Carl, The idea did look promising, even for the "multi-letter graphemes" (like the czech "ch"), but then I believe we run into a limitation of 'parse. The longer phrase rule needs to come before the shorter one, so that: rule-4: pattern-rule ["a" "A" "b" "B" "c" "C" "h" "H" "ch" "Ch"] will not correctly sort:

>> pattern-sort ["c" "ch" "h"] rule-4

== ["ch" "c" "h"] ;should be "c" "h" "ch" At least one other person has mused over the desire to have a pattern sort (in this case under the gnu Linux sort) (look near the bottom): http://budling.nytud.hu/~szigetva/etcetera/converters/README In this case, the pattern has a bit more information: a=�<b<c<cs<d<e=�<f<g<gy<h<i=�...<z<zs where "a" can be told to sort the same as "a with acute", both of these sort before "b" ... and "zs" sorts after "z" Breaking apart this information might allow a parse rule to set-up the sequence to allow the longer phrase rules to come before the shorter ones. At least I think it would work. Back to Geza, ... Geza, how important are these "multi-letter graphemes" (cs, dz, dzs, gy, ly, ny, sz, ty and zs) in a sort algorithm? At the same site, P�ter Szigetv�ri indicates that it can get very tricky: Unfortunately, the task is not trivial: some sequences that look like multi-letter graphemes are in fact not, e.g., b�rcs�k may be ranked before or after b�rczerge depending on its morphology: b�r+cs�k (after b�rczerge) or b�rc+s�k (before b�rczerge). This can be decided only with a morphological/semantic parser, which is probably not worth doing because the problem practically never turns up. http://budling.nytud.hu/~szigetva/etcetera/Hungarian/sorting.html --Scott Jones

[19/30] from: carl:cybercraft at: 16-May-2002 18:58

On 16-May-02, G. Scott Jones wrote:

> This is extremely promising. I drew from the ISO-8859-2 character > set to make a rule, and it initially seems to sort correctly. The > time through is roughly the same as my hack (but I've not really > set-up a clean time condition).

I've thoughts about how to speed it up - will be testing them out.

> The only problem so far occurs when > I run my word sample list through more than once. It seems to > magically have kept the original sort/s and continues to append new > results to the block. I cannot seem to find where the problem is > occurring.

It's in here... forall series [ clear blk parse/case series/1 rule append/only ptrs copy blk append last ptrs series/1 ] 'forall leaving 'series at its tail, so the 'clear that follows doesn't clear it. Change it to the following and it should fix that problem. (Though not your other one. See my other post about that.) foreach s series [ clear blk parse/case s rule append/only ptrs copy blk append last ptrs s ] -- Carl Read

[20/30] from: carl:cybercraft at: 16-May-2002 19:35

On 16-May-02, G. Scott Jones wrote:

> Hi, Carl, > The idea did look promising, even for the "multi-letter graphemes"

<<quoted lines omitted: 17>>

> the sequence to allow the longer phrase rules to come before the > shorter ones. At least I think it would work.

My first thoughts are that it'd work too, but then we're talking about my coding here. (; Anyway, only the order of the parse rules should need to be changed. ie, this is what's currently generated...

>> probe rule-4

[some ["a" (r 1) | "A" (r 2) | "b" (r 3) | "B" (r 4) | "c" (r 5) | "C" (r 6) | "h" (r 7) | "H" (r 8) | "ch" (r 9) | "Ch" (r 10) | skip (r 11)]] Moving the "ch"s to the front of the rule gives us this... rule-5: [some ["ch" (r 9) | "Ch" (r 10) | "a" (r 1) | "A" (r 2) | "b" (r 3) | "B" (r 4) | "c" (r 5) | "C" (r 6) | "h" (r 7) | "H" (r 8) | skip (r 11) ]] Using that fixes your error above...

>> pattern-sort ["c" "ch" "h"] rule-5

== ["c" "h" "ch"] though it screws up string sorting big-time...

>> pattern-sort "cchh" rule-5

== "bch" (: Anyway, I'll see if I can get it to behave, and I'll try out the speed improvements I thought of as well. -- Carl Read

[21/30] from: carl:cybercraft at: 16-May-2002 20:15

On 16-May-02, Volker Nitsch wrote:

> Carl, my volley pass works similar to yours, except i made it more > complicated :)

Simple things should be simple. (So I can understand them;)

> your pattern-rule "aAbBcC" would look like [+ "a" + "A" + "b" + "B" > + "c" + "C"] because i use blocks with strings, i can also map > multi-char-codes. so [+ "CH"] maps to one char.

I allowed for that. It accepts a string or a block of strings. Except it's bugged as it stands ): But it may be able to be fixed...

> IIRC "CH" is handled > like one char? also i have [+ "CH" = "Ch"]. this says, "CH" and "Ch" > are the same. (add a new code-number for "CH" and use the same > number for "Ch").

This I didn't allow for. Currently my rule-blocks look like this... ["a" "b" "c" "ch" "CH"] and words with "ch" in would always preceed ones with "CH" in after sorting. Placing equal ones in blocks would seem a nice solution... ["a" "b" "c" ["ch" "CH"]]

> And in my telephone-book "ö" is handled like > "oe", so one char expands to two. So i need this kind of commands?

I can't see why we would, as we're sorting something in just the one format, not changing the format. (I hope:)

> (Scott, i found no "ss" in this book, because "ss" has always two > chars before it and is rarely used. sorry.. >

http://www.uni-koeln.de/phil-fak/spinfo/lehre/java/kap23/collating.htm#3.3

> duden 73: "ß" like "ss", by same words before (argh!) since 96 > changed: "ß" after "ss". and "ä" the same as "a".

<<quoted lines omitted: 14>>

> for sorting i mix strings and their translations like [translation1 > string1 translation2 string2] sort with sort/skip 2,

I should've used sort/skip - it's one of the ways I'm hoping to speed things up.

> and extract the strings back to the original block. > hmm, somehow i like your string more. if it could deal with > multi-chars.

It can - just not correctly. (; 'rule-3 showed how it's meant to work...

>>>> rule-3: pattern-rule ["a" "A" "b" "B" "ch" "c" "C"] >> == [some ["a" (r 1) | "A" (r 2) | "b" (r 3) | "B" (r 4) | "ch" (r

<<quoted lines omitted: 4>>

>>>> pattern-sort ["AabA" "chab" "chAB" "cchc" "achA"] rule-3 >> == ["achA" "AabA" "chab" "chAB" "cchc"]

But, as Scott pointed out, it doesn't get this right... ---8<--- rule-4: pattern-rule ["a" "A" "b" "B" "c" "C" "h" "H" "ch" "Ch"] will not correctly sort:

>> pattern-sort ["c" "ch" "h"] rule-4

== ["ch" "c" "h"] ;should be "c" "h" "ch" ---8<--- Back to the drawing-board... -- Carl Read

[22/30] from: gscottjones:mchsi at: 16-May-2002 7:44

From: "Carl Read"

> On 16-May-02, G. Scott Jones wrote: > > > This is extremely promising. I drew from the ISO-8859-2 character > > set to make a rule, and it initially seems to sort correctly. The > > time through is roughly the same as my hack (but I've not really > > set-up a clean time condition). > > I've thoughts about how to speed it up - will be testing them out.

Great!

> > The only problem so far occurs when > > I run my word sample list through more than once. It seems to

<<quoted lines omitted: 17>>

> append last ptrs s > ]

Yep, needless to say, that fixed it. rule-4: pattern-rule { !"#$%&'()*+,-./0123456789:;<=>? @Aa�ᡱ��BbCc��Dd��Ee��Ff GgHhIi��JjKkLl�奵��MmNn��Oo��Pp QqRr��Ss��Tt��Uu��VvWwXxYy ��Zz��Zz[\]^_`{|}~} Here is the rule pattern I **generated** from my table for the ISO-8859-2 character set. Currently, this is sorted big-uns before little-uns. If the character looks totally out of place, it is because this representation used the ISO-8859-1 implicit in REBOL. Which brings me to the next "problem", there will be no way to generate the proper character that isn't already contained in the standard character set. So some sorted solutions will not appear to be correct, until the result is displayed in the correct character representation set. This does not appear to be a problem in Hungarian (so far), but will be in other languages. Hmmm..... Keep up the good work. I look forward to seeing your ideas regarding the multi-letter graphemes in the separate post (which has not yet arrived here ... whoops .. just arrived!). Later... --Scott Jones

[23/30] from: carl:cybercraft at: 17-May-2002 21:46

On 17-May-02, G. Scott Jones wrote:

> From: "Carl Read" >> I've thoughts about how to speed it up - will be testing them out. > Great!

Well, not so great, actually. The new version's faster, but not markedly so. Perhaps 30% faster going by the single test of a long list of random words I did, though it's still 7 or 8 times slower than REBOL's sort. Maybe if it was all done with parsing it'd be faster, but I'd have to re-think it all. (: Anyway, here it is. See the end of the mail for how to handle characters that are to be considered equal. ie "A" & "a" etc. pattern-rule: func [ "Create a rule for use by pattern-sort." pattern [string! block!] "An ordered pattern." /local rule add-rule n ][ rule: copy [] add-rule: func [str][ str: to-string str insert tail rule reduce [length? str reduce [ str to-paren reduce ['r n length? str] '| ]] ] n: 1 foreach pos pattern [ either block? pos [ either 1 = length? pos [ foreach r pos/1 [add-rule r] ][ foreach r pos [add-rule r] ] ][ add-rule pos ] n: n + 1 ] rule: extract next sort/reverse/skip rule 2 2 insert tail rule reduce ['skip to-paren reduce ['r n 1]] reduce ['some rule] ] pattern-sort: func [ {Sort a string or block of strings based on a rule created by pattern-rule.} series [string! block!] "Series to sort." rule [block!] "Pattern rule." /reverse "Reverse sort order." /local new blk r pos ][ new: clear [] blk: clear [] r: func [n len][ insert tail blk n if string? series [insert tail blk len] ] bind rule 'r either string? series [ parse/case series rule pos: 0 foreach [n len] blk [ insert tail new reduce [ reduce [n] copy/part skip series pos len ] pos: pos + len ] ][ foreach n series [ clear blk parse/case n rule insert tail new reduce [copy blk n] ] ] either reverse [sort/skip/reverse new 2][sort/skip new 2] clear series insert tail series extract next new 2 series ] Use is the same as before, though the rules that are generated are different from before. ie...

>> rule-1: pattern-rule "aAbBcC"

== [some ["a" (r 1 1) | "A" (r 2 1) | "b" (r 3 1) | "B" (r 4 1) | "c" (r 5 1) | "C" (r 6 1) | skip (r 7 1)]] So...

>> pattern-sort "abcABC" rule-1

== "aAbBcC"

>> pattern-sort ["abc" "ABC" "aBc" "AbC"] rule-1

== ["abc" "aBc" "AbC" "ABC"]

>> rule-2: pattern-rule ["a" "b" "c" "h" "ch"]

== [some ["ch" (r 5 2) | "a" (r 1 1) | "b" (r 2 1) | "c" (r 3 1) | "h" (r 4 1) | skip (r 6 1)]]

>> pattern-sort "ccchcc" rule-2

== "ccccch"

>> pattern-sort "hccchhh" rule-2

== "cchhhch"

>> pattern-sort ["c" "h" "ch" "h" "c"] rule-2

== ["c" "c" "h" "h" "ch"] Now, to give the same weight to two or more characters, enclose them in a block. They can either be a single string in the block, in which case all the characters in the string are weighted the same, else they can be group of strings which will all be weighted the same. ie...

>> rule-3: pattern-rule [["aA"]["bB"]["cC"]]

== [some ["a" (r 1 1) | "A" (r 1 1) | "b" (r 2 1) | "B" (r 2 1) | "c" (r 3 1) | "C" (r 3 1) | skip (r 4 1)]]

>> pattern-sort "BBbbBBcCcAaA" rule-3

== "AaABBbbBBcCc"

>> pattern-sort ["Bbb" "bBB" "aA" "Aa"] rule-3

== ["aA" "Aa" "Bbb" "bBB"]

>> rule-4: pattern-rule ["a" "b" ["cC"]["hH"]["ch" "CH"]]

== [some ["ch" (r 5 2) | "CH" (r 5 2) | "a" (r 1 1) | "b" (r 2 1) | c (r 3 1) | "C" (r 3 1) | "h" (r 4 1) | "H" (r 4 1) | skip (r...

>> pattern-sort "CHcCCcchbaHh" rule-4

== "abcCCcHhCHch"

>> pattern-sort ["hhCH" "ccCH" "hhch" "ccch" "hhCH"] rule-4

== ["ccCH" "ccch" "hhCH" "hhch" "hhCH"] Also, I've allowed for characters not included in the rules, they being treated as the last character in the rule. So this doesn't generate an error...

>> pattern-sort ["rat" "hat" "cat"] rule-4

== ["cat" "hat" "rat"] And the reverse refinement's still there...

>> pattern-sort/reverse ["rat" "hat" "cat"] rule-4

== ["rat" "hat" "cat"] As before, no promises about how well this will perform with real alphabets, but it should be a bit better than the last effort. Hopefully. (; -- Carl Read

[24/30] from: gscottjones:mchsi at: 17-May-2002 13:52

From: "Carl Read"

> The new version's faster, but not > markedly so. Perhaps 30% faster going by the single test of a long

<<quoted lines omitted: 17>>

> alphabets, but it should be a bit better than the last effort. > Hopefully. (;

By George, I think you've done it! At least it appears to sort the sample Hungarian word list correctly. That is a slick solution. It is about 35% faster than my original effort. Good job. I like the way you handled the characters of equivalent weight, although I've not put this apsect through any testing. Good job! --Scott Jones

[25/30] from: geza67:freestart:hu at: 17-May-2002 22:59

Hello Scott

> In this case, the pattern has a bit more information: > a=�<b<c<cs<d<e=�<f<g<gy<h<i=�...<z<zs > where "a" can be told to sort the same as "a with acute", both of these sort > before "b" ... and "zs" sorts after "z"

Actually a<>� and e<>� ... more clearly a<� and e<�. In some relaxed situtations the equivalence could be stated but the Hungarian grammar is much more complex that I could be an "ex catedra" judge about it.

> Geza, how important are these "multi-letter graphemes" (cs, dz, dzs, gy, ly, > ny, sz, ty and zs) in a sort algorithm? At the same site, P�ter Szigetv�ri

dz and dzs are good for translative ortography i.e. for transcribing foreign words. E.g. dzs means j (the Hungarian language is more phonetic-oriented than any other indo-europian or latin-legacy language families). cs, gy, ly, ny, sz, ty and zs are "inborn" Hungarian specialities, many words has them as components. How important they are? That's a very hard question because in a mixed language text (e.g. Hungarian medical report intersprsed with medical latin terminology) one should understand the word itself to specify its corresponding sorting order: e.g. in a Hungarian word the "ly" phoneme (which roughly corresponds to the English "y", but in Hungarian "j" is phonetically also equivalent with "ly" but ortographically different words use the one than the other). If you don't know the word you cannot even decide its hyphenation, as you wrote:

> "Unfortunately, the task is not trivial: some sequences that look like > multi-letter graphemes are in fact not, e.g., b�rcs�k may be ranked before > or after b�rczerge depending on its morphology: b�r+cs�k (after b�rczerge) > or b�rc+s�k (before b�rczerge). This can be decided only with a

b�r-cs�k or b�rc-s�k - different sorting order and even different hyphenation (just for fulfilling your presmued curiosity what these words mean: the 1st one could be translated to payment-stripe [not a logical word combination] the second one to a geographical plane [correct Hungarian word]. Without a dictionary, no program can get through this, not even a semantic parser. Back to these di-graphemes: they are important, fundamental parts of our language but personally I can live without sorting them correctly in a computer program. :-) -- Best regards, Geza mailto:[geza67--freestart--hu]

[26/30] from: gscottjones:mchsi at: 18-May-2002 6:51

From: "Geza Lakner MD"

> > In this case, the pattern has a bit more information: > > a=�<b<c<cs<d<e=�<f<g<gy<h<i=�...<z<zs > > where "a" can be told to sort the same as "a with acute", both of these

sort

> > before "b" ... and "zs" sorts after "z" > Actually a<>� and e<>� ... more clearly a<� and e<�. In some relaxed > situtations the equivalence could be stated but the Hungarian grammar > is much more complex that I could be an "ex catedra" judge about it.

How do you like Carl's representation?

<snip> > Back to these di-graphemes: they are important, fundamental parts of > our language but personally I can live without sorting them correctly > in a computer program. :-)

That was the final opinion of the Hungarian author (P�ter Szigetv�ri) of the website I was using as a reference. By the way, he offers a number of format conversion tools that are Hungarian friendly. They are written in Perl. http://budling.nytud.hu/~szigetva/etcetera/Hungarian.html I almost have the ISO-8859-2 character set (for central europe) mapped based on a the various sort orders that we discussed earlier. (I just remembered that I forgot Petr K's Czech "ch" -- darn!) If you would like to use Carl R's nifty sorting parser, I can transform the various sorting orders into patterns for easy use (that was a very clever idea). What I do not have is any authoritative resource that tells me the best order that covers "all" the bases. My fear is that the letters with diacritics may sort differently in the various languages covered by the ISO-8859-2 character set: Albanian, Bosnian, Croatian, Czech, English, Finnish, Hungarian, Irish, German, Polish, Romanian, Serbian (Latin transcription), Slovak, Slovenian, and Sorbian (Lusatian). My master table can now handle any permutation, but it is the actual orders that are so hard to come across. Thanks for the feedback on the "multi-letter graphemes." --Scott Jones

[27/30] from: carl:cybercraft at: 19-May-2002 12:32

On 18-May-02, G. Scott Jones wrote:

> By George, I think you've done it! At least it appears to sort the > sample Hungarian word list correctly. That is a slick solution. It > is about 35% faster than my original effort. Good job. I like the > way you handled the characters of equivalent weight, although I've > not put this apsect through any testing. Good job! > --Scott Jones

Good to hear it seems to work Scott. Be interesting to know what other languages it can work with. One possible improvement in the creation of the rule would be to allow for some of the strings in blocks to be treated as a collection of seperate characters, perhaps by using a different string datatype to string!, such as file!. So that instead of this... ["a" "b" "c" "ch" "d" "e"] we could have... [%abc "ch" %de] Though it'd probably be better round the other way. What would be the best string datatype for such a job? -- Carl Read

[28/30] from: gscottjones:mchsi at: 19-May-2002 12:17

From: "Carl Read"

> One possible improvement in the creation of the rule would be to allow > for some of the strings in blocks to be treated as a collection of

<<quoted lines omitted: 5>>

> Though it'd probably be better round the other way. What would be the > best string datatype for such a job?

It's not immediately obvious to me. Maybe something will become obvious with some thought. --Scott Jones

[29/30] from: brett:codeconscious at: 20-May-2002 9:38

Interesting question. Tag! and Issue! might be useful for your design but both will not be able to handle certain characters. I figured I could write some code to show what they are: chars-in-form: function [ example-form [block!] ] [all-chars useable-chars ch test-value] [ all-chars: copy {} useable-chars: copy {} repeat i 255 [ append all-chars ch: to-char i if all [ not error? try [test-value: load rejoin example-form] 1 = length? test-value ch = first test-value ] [append useable-chars ch] ] exclude all-chars useable-chars ] print mold chars-in-form [#"#" ch] print mold chars-in-form [#"<" ch #">"] Regards, Brett.

[30/30] from: carl:cybercraft at: 21-May-2002 21:56

On 20-May-02, Brett Handley wrote:

> Interesting question. > Tag! and Issue! might be useful for your design but both will not be > able to handle certain characters.

Actually, tags might be the best, as you can put strings in them and they could then be used to hold the multiple-letter characters, with plain strings being used for single-letter characters. ie... ["aAbBcC" <"ch" "CH"> "dDeE"] (Yes, I know it's just one string in the tag, but to-block seperates them.) And using blocks for same-value characters could still be used. But I won't be changing the script till asked, since I'm not using it myself. (: -- Carl Read

Notes

Quoted lines have been omitted from some messages.
View the message alone to see the lines that have been omitted