Collation sequence - proper and efficient sorting of national accented c
[1/30] from: geza67::freestart::hu at: 12-May-2002 0:57
Hello REBOLers,
Is there a way to set a national collation sequence for string
SORTing in the SYSTEM object? Defining a per character comparator
for SORT/COMPARE would be an overkill in terms of speed and
efficacy. Thus, is there a way to use the native SORT with speed for
ordering national accented characters (which would be otherwise at
the very end of the "English" alphabet :-( ) ?
--
Best regards,
Geza Lakner MD mailto:[geza67--freestart--hu]
[2/30] from: gscottjones::mchsi::com at: 12-May-2002 6:31
Re: Collation sequence - proper and efficient sorting of national accent
From: "Geza Lakner MD"
> Hello REBOLers,
> Is there a way to set a national collation sequence for string
<<quoted lines omitted: 3>>
> ordering national accented characters (which would be otherwise at
> the very end of the "English" alphabet :-( ) ?
Hi, Geza,
As a special challenge for myself, I undertook the Czech alphabet last year.
It was a special challenge both because I didn't know the Czech language and
because the Czech alphabet had several special cases (namely "ch").
Here is a link to the first release I did for the Czech alphabet:
http://www.escribe.com/internet/rebol/m10477.html
A character was missing, so the following contained the final character map:
http://www.escribe.com/internet/rebol/m10493.html
This was just one approach that I could envision at the time. If I recall
correctly, Volker had some other ideas, so you may want to check more of the
various threads to see some of the ideas and pitfalls that were discovered:
Here was the post that began the challenge:
http://www.escribe.com/internet/rebol/m10350.html
Here was the post that began the next sequence:
http://www.escribe.com/internet/rebol/m10414.html
The final sequence began with the first link given in this current email.
Let me know if you have any questions. Good luck.
--Scott Jones
[3/30] from: geza67:freestart:hu at: 12-May-2002 16:23
Hello G.,
> Let me know if you have any questions. Good luck.
Huhh, pretty scary task NOT to be a REBOLer Englishman ;-) I will try
some mapping - it seems the most obvious way to me. Besides I would
like to sort first names: as a "loose" (lousy, lazy etc ;-) ) solution
it would suffice to map the starting special vowels. In that way
Hungarian language is not so exotic, having only accented characters
as national specialities, like: á é í ó ú ö ü ő ű (well the last two
_both_ should normally have double "grave" accents on top of them not
circumflexe or tilde :-( )
--
Best regards,
Geza mailto:[geza67--freestart--hu]
[4/30] from: gscottjones::mchsi::com at: 12-May-2002 14:55
Hungarian Alphabet Sort (was Re: Collation sequence - proper and efficie
Hi, Geza,
The Czech sorter was alpha, and never moved beyond to a more generic
solution due to apparent lack of general interest and time. I was trying to
remember how I had mapped the character, and one thing led to another, and I
suddenly had a Hungarian Alphabet sorter! Strange how that happens.
It was a bit of a puzzle, because my neurofibrillary tangles have been
getting worse and worse (warning for the non-medical types: medical
internist humor alert).
Here is my alpha release of the Hungarian sorter. Watch for line breaks.
The one thing that I was unable to be certain about was the sort order for
the "diaresis" versus "double acute" forms of "o" and "u". If I have
selected the wrong order, it is a very simple matter for me to fix this.
Let me know how it works!
--Scott Jones
################################################
REBOL [
Title: "Hungarian Language Sort Function"
Date: 12-May-2002
Version: 0.0.1
Author: "G. Scott Jones, M.D."
File: %hungarian-sort.r
Purpose: {Sort support for Hungarian alphabet}
Comment: {This is the first alpha release for the Hungarian
language sort. It is based on the alpha of my Czech Sort
of 2001. For these early versions, I've rolled the
character sort list into this file for convenience. The
routine is currently hard coded for Hungarian language
only, but will readily be made more generic for other
languages. The code is heavily commented for easy
interpretation by others. The routine could also be
rewritten to be a wrapper for REBOL 'sort, with a path
refinement allowing for alternative language support.
The to-do list is so long as to make it pointless for me
to list at this stage. ;-) Now, I'll post to the list
for review.
USAGE: hungarian-sort series /case /reverse
}
History: [
0.0.1 [12-May-2002 {First released for alpha review} "GSJ"]
]
]
char-list: {32 1 32
33 2 33 !
34 3 34 "
35 4 35 #
36 5 36 $
37 6 37 %
38 7 38 &
39 8 39 '
40 9 40 (
41 10 41 )
42 11 42 *
43 12 43 +
44 13 44 ,
45 14 45 -
46 15 46 .
47 16 47 /
48 17 48 0
49 18 49 1
50 19 50 2
51 20 51 3
52 21 52 4
53 22 53 5
54 23 54 6
55 24 55 7
56 25 56 8
57 26 57 9
58 27 58 :
59 28 59 ;
60 29 60 <
61 30 61 62 31 62 >
63 32 63 ?
64 33 64 @
97 34 97 a 0061 LATIN SMALL LETTER A
225 35 225 a' 00e1 LATIN SMALL LETTER A WITH ACUTE
65 69 65 A 0041 LATIN CAPITAL LETTER A
193 70 193 A' 00c1 LATIN CAPITAL LETTER A WITH ACUTE
98 36 98 b 0062 LATIN SMALL LETTER B
66 71 66 B 0042 LATIN CAPITAL LETTER B
99 37 99 c 0063 LATIN SMALL LETTER C
67 72 67 C 0043 LATIN CAPITAL LETTER C
100 38 100 d 0064 LATIN SMALL LETTER D
68 73 68 D 0044 LATIN CAPITAL LETTER D
101 39 101 e 0065 LATIN SMALL LETTER E
233 40 233 e' 00e9 LATIN SMALL LETTER E WITH ACUTE
69 74 69 E 0045 LATIN CAPITAL LETTER E
201 75 201 E' 00c9 LATIN CAPITAL LETTER E WITH ACUTE
102 41 102 f 0066 LATIN SMALL LETTER F
70 76 70 F 0046 LATIN CAPITAL LETTER F
103 42 103 g 0067 LATIN SMALL LETTER G
71 77 71 G 0047 LATIN CAPITAL LETTER G
104 43 104 h 0068 LATIN SMALL LETTER H
72 78 72 H 0048 LATIN CAPITAL LETTER H
105 44 105 i 0069 LATIN SMALL LETTER I
237 45 237 i' 00ed LATIN SMALL LETTER I WITH ACUTE
73 79 73 I 0049 LATIN CAPITAL LETTER I
205 80 205 I' 00cd LATIN CAPITAL LETTER I WITH ACUTE
106 46 106 j 006a LATIN SMALL LETTER J
74 81 74 J 004a LATIN CAPITAL LETTER J
107 47 107 k 006b LATIN SMALL LETTER K
75 82 75 K 004b LATIN CAPITAL LETTER K
108 48 108 l 006c LATIN SMALL LETTER L
76 83 76 L 004c LATIN CAPITAL LETTER L
109 49 109 m 006d LATIN SMALL LETTER M
77 84 77 M 004d LATIN CAPITAL LETTER M
110 50 110 n 006e LATIN SMALL LETTER N
78 85 78 N 004e LATIN CAPITAL LETTER N
111 51 111 o 006f LATIN SMALL LETTER O
243 52 243 o' 00f3 LATIN SMALL LETTER O WITH ACUTE
245 53 245 o' 00f3 LATIN SMALL LETTER O WITH DOUBLE ACUTE
246 54 246 o: 00f6 LATIN SMALL LETTER O WITH DIAERESIS
79 86 79 O 004f LATIN CAPITAL LETTER O
211 87 211 O' 00d3 LATIN CAPITAL LETTER O WITH ACUTE
213 88 213 O" 0150 LATIN CAPITAL LETTER O WITH DOUBLE ACUTE
214 89 214 O: 00d6 LATIN CAPITAL LETTER O WITH DIAERESIS
112 55 112 p 0070 LATIN SMALL LETTER P
80 90 80 P 0050 LATIN CAPITAL LETTER P
113 56 113 q 0071 LATIN SMALL LETTER Q
81 91 81 Q 0051 LATIN CAPITAL LETTER Q
114 57 114 r 0072 LATIN SMALL LETTER R
82 92 82 R 0052 LATIN CAPITAL LETTER R
115 58 115 s 0073 LATIN SMALL LETTER S
83 93 83 S 0053 LATIN CAPITAL LETTER S
116 59 116 t 0074 LATIN SMALL LETTER T
84 94 84 T 0054 LATIN CAPITAL LETTER T
117 60 117 u 0075 LATIN SMALL LETTER U
250 61 250 u' 00fa LATIN SMALL LETTER U WITH ACUTE
251 62 251 u' 00fa LATIN SMALL LETTER U WITH DOUBLE ACUTE
252 63 252 u: 00fc LATIN SMALL LETTER U WITH DIAERESIS
85 95 85 U 0055 LATIN CAPITAL LETTER U
218 96 218 U' 00da LATIN CAPITAL LETTER U WITH ACUTE
219 97 219 U" 0170 LATIN CAPITAL LETTER U WITH DOUBLE ACUTE
220 98 220 U: 00dc LATIN CAPITAL LETTER U WITH DIAERESIS
118 64 118 v 0076 LATIN SMALL LETTER V
86 99 86 V 0056 LATIN CAPITAL LETTER V
119 65 119 w 0077 LATIN SMALL LETTER W
87 100 87 W 0057 LATIN CAPITAL LETTER W
120 66 120 x 0078 LATIN SMALL LETTER X
88 101 88 X 0058 LATIN CAPITAL LETTER X
121 67 121 y 0079 LATIN SMALL LETTER Y
89 102 89 Y 0059 LATIN CAPITAL LETTER Y
122 68 122 z 007a LATIN SMALL LETTER Z
90 103 90 Z 005a LATIN CAPITAL LETTER Z
91 104 91 [
92 105 92 \
93 106 93 ]
94 107 94 ^
95 108 95 _
96 109 96 `
123 110 123 {
124 111 124 |
125 112 125 }
126 113 126 ~
133 114 133 a}
;;;;set up sort data structures
data: copy []
data: parse/all char-list "^/"
;make regular sort map
hu-reg: copy data
forall hu-reg [hu-reg/1: to-integer first parse hu-reg/1 none]
hu-reg: head hu-reg
;make case-sensitive sort map
hu-case: copy data
mysort: func [a b] [
(to-integer pick parse a none 2) < (to-integer pick parse b none 2)
]
;rearrange the list based on second field
sort/compare hu-case :mysort
forall hu-case [hu-case/1: to-integer first parse hu-case/1 none]
hu-case: head hu-case
;;;;new sort function
;not all 'sort refinements yet supported
;local words have not been specified
;error condition roll-back of block to original not yet added
hungarian-sort: func [:blk /case /reverse][
either case [order: hu-case][order: hu-reg]
;backup for future error checking and roll-back
blk-backup: copy blk
forall blk [
;swap index position for characters
temp: copy []
foreach b blk/1 [
t: find order to-integer b
append temp index? t
]
blk/1: temp
]
blk: head blk
;sort through REBOL 'sort
either reverse [
sort/reverse blk
][
sort blk
]
forall blk [
temp: copy []
;change index integer back to characters
foreach b blk/1 [append temp to-char order/:b]
;make a word out of characters
blk/1: copy rejoin temp
]
;reset head and block returns changed
blk: head blk
]
;;;;now for some testing
;these may not be official spellings - it is just what I had available
months: ["január" "február" "március" "április" "május" "június"
július
"augusztus" "szeptember" "október" "november" "december"]
hungarian-sort months
print ["Check month sort/case: " equal? months ["augusztus" "április"
december
"február" "január" "július" "június" "május" "március"
november
"október" "szeptember"]]
;foreach m months [print m]
hungarian-sort/case months
print ["Check month sort/case: " equal? months ["augusztus" "április"
december
"február" "január" "július" "június" "május" "március"
november
"október" "szeptember"]]
;foreach m months [print m]
hungarian-sort/reverse months
print ["Check month sort/case: " equal? months ["szeptember" "október"
november
"március" "május" "június" "július" "január" "február"
december
"április" "augusztus"]]
;foreach m months [print m]
days: ["hétfo" "kedd" "szerda" "csütörtök" "péntek" "szombat" "vasárnap"]
hungarian-sort days
print ["Check day sort: " equal? days ["csütörtök" "hétfo" "kedd"
péntek
"szerda" "szombat" "vasárnap"]]
;foreach d days [print d]
hungarian-sort/case days
print ["Check day sort/case: " equal? days ["csütörtök" "hétfo" "kedd"
péntek
"szerda" "szombat" "vasárnap"]]
;foreach d days [print d]
hungarian-sort/reverse days
print ["Check day sort/case: " equal? days ["vasárnap" "szombat"
szerda
"péntek" "kedd" "hétfo" "csütörtök"]]
;foreach d days [print d]
word-sample: ["január" "február" "március" "április" "május"
június
"július" "augusztus" "szeptember" "október" "november"
december
"hétfo" "kedd" "szerda" "csütörtök" "péntek"
szombat
"vasárnap" "nulla" "egy" "kettő" "három" "négy"
öt
"hat" "hét" "nyolc" "kilenc" "tíz" "tizenegy" "húsz"
huszonegy
"harmincegy" "negyvenegy" "ötvenegy" "hatvanegy"
hetvenegy
"nyolcvanegy" "kilencvenegy" "száz" "ezer"
ezeregyszáz
"tízezer" "ötvenezer" "százezer" "millió"
milliárd
"Igen" "Nem" "Kérem" "Köszönöm" "Szervusz"
Viszontlátásra
"Magyar" "Magyarország" "Hogy" "van"
vagy
"Mit" "csinálsz" "Bocsánat" "vagyok" "Hol" "szép"
ország
"Segítene" "Fáradt" "Segítség" "a" "Kanadai"
Amerikai
"Hany" "óra" "Merre" "kell" "menni" "Az"
EMERGE
"Európai" "Unió" "Információs" "Társadalom"
Technológiájával
"foglalkozó" "projektje" "amely"
aktívan
"támogatja" "és" "más" "közép" "kelet" "országok"
részvételét
"EU" "által" "finanszírozott" "IST"
projektekben
"Informál" "keretprogram" "műszaki"
megoldásaival
"projektjeiről" "tájékoztat" "ben" "induló"
keretprogramról
"Tanszékünk" "Budapesti"
Gazdaságtudományi
"Egyetem" "Távközlési" "Telematikai"
Tanszéke
"projekt" "hazai" "partnere" "vonatkozása" "fő"
fázisból
"áll" "konferencia" "megszervezése" "Budapesten"
melynek
"során" "munkája" "iránt" "érdeklődők"
személyesen
"is" "bemutatkozhatnak" "egymásnak"
nyújtása
"intézményeknek" "ahhoz" "csatlakozhassanak"
jelenleg
"futó" "projektekhez" "abban" "partnereket"
találjanak
"jövőbeliekhez" "Tájékoztatás" "arról"
milyen
"gazdasági" "helyzet" "projektben" "résztvevő"
úgynevezett
"iparában" "ezek" "javarészt" "ezen" "belül"
Magyarországon
"itt" "olvasható" "információk"
frissített
"változata" "elejére" "várható" "Szintén"
ehhez
"fázishoz" "tartozik" "elkövetkező" "évre"
vonatkozó
"Uniós" "kutatási" "programról" "első" "fázisa"
októberében
"második" "pedig" "befejeződött" "Mindezekről"
bővebb
"információt" "Archívum" "pont" "alatt" "találhat"
bal
"oldali" "menüben" "harmadik" "fázis" "erről" "többet"
távközlés
"helyzete" "országokban" "pontok" "tudhat" "meg"
Újdonság
"Híradástechnika" "című" "folyóiratban" "hamarosan"
megjelenik
"fázisban" "megrendezett" "konferencián"
elhangzott
"előadásokból" "háromnak" "nyelvű" "írott"]
hungarian-sort word-sample
;foreach d word-sample [print d]
hungarian-sort/reverse word-sample
;foreach d word-sample [print d]
hungarian-sort/case word-sample
;foreach d word-sample [print d]
[5/30] from: geza67:freestart:hu at: 12-May-2002 23:34
Re: Hungarian Alphabet Sort (was Re: Collation sequence - proper and eff
Hello Scott,
Thanx, cute solution! Though my critical comments :-) :
The right order for Hungarian vowels: actually the diaresis characters
come first and then the double acute ones (only o and u have double
accents in the Hungarian alphabet):
oOóÓöÖőŐ
uUúÚüÜűŰ
Unfortunately the case-insensitiveness does not work. Look:
hungarian-sort ["alom" "Álom" "álom" "Állam"]
== ["alom" "álom" "Állam" "Álom"]
Though it should read:
alom Állam Álom álom.
- The /case refinement results in the same result as the one without
it :-( :
>> hungarian-sort/case ["alom" "álom" "Álom" "Állam"]
== ["alom" "álom" "Állam" "Álom"]
The case-sensitive collation sequence IMHO would be a bit different than
you have defined, namely:
aAáÁ...eEéÉ...
Your order was:
aáAÁ...eéEÉ...
- and so on for all affected special accented chars.
--
Best regards,
Geza mailto:[geza67--freestart--hu]
[6/30] from: gscottjones:mchsi at: 12-May-2002 21:11
From: "Geza Lakner MD"
<snip>
> The right order for Hungarian vowels: actually the diaresis characters
> come first and then the double acute ones (only o and u have double
> accents in the Hungarian alphabet):
> oOóÓöÖőŐ
> uUúÚüÜűŰ
This was easy to fix.
> Unfortunately the case-insensitiveness does not work. Look:
> hungarian-sort ["alom" "Álom" "álom" "Állam"]
> == ["alom" "álom" "Állam" "Álom"]
>
> Though it should read:
> alom Állam Álom álom.
Yes, this is a problem. My current algorithm will not easily accommodate
this change. I now can even remember thinking last year that the approach
might cause a problem, but the test samples presented apparently did not
detect
this problem at that time. Hmmm.
Time to go back to the drawing board. I already have an idea, but it may
take a while before I have some time to create the new algorithm.
> - The /case refinement results in the same result as the one without
> it :-( :
<<quoted lines omitted: 5>>
> Your order was:
> aáAÁ...eéEÉ...
There end up being two issues at work here. Having the order as
aáAÁ...eéEÉ...
was not my intention. What I was aiming to do was
aá..eé..AÁ..EÉ...
which may also not seem correct to you; however, this behavior mirrors
REBOL's default behavior for the /case switch, but does differ in placing
the little letters before the capital letters. Petr K. said that this was
the more normal method in eastern europe (Czech language in his case). So I
was trying to reflect this pattern, but did make the one ordering error.
The REBOL 'sort /case switch will sort all the words first by whether the
letter is capital or not. In fact, REBOL places all the words that begin in
capital letters _before_ the words that begin in small letters (because of
the ascii number assigned to the letters).
Maybe we need an additional switch that allows for the eastern european
desire to have smalls before capitals, and to interleave these together as
you suggest. Sometimes it would be handy to have these options too here in
the US. Just need a clever name or names for these switches (or paths in
REBOLese). Any ideas are welcomed.
> - and so on for all affected special accented chars.
and so on for life in general!
:-)
I'll repost after I have a chance to develop the new algorithm that I have
in mind. "Stay tuned"
Thanks for your feedback!
--Scott Jones
[7/30] from: geza67:freestart:hu at: 13-May-2002 20:29
Hello Scott!
>> The right order for Hungarian vowels: actually the diaresis characters
> This was easy to fix.
... as you have prospectively pointed it out in your first post :-)
> Time to go back to the drawing board. I already have an idea, but it may
> take a while before I have some time to create the new algorithm.
Good luck to "braining out" the new enhanced algorithm. :-)
> There end up being two issues at work here. Having the order as
> aáAÁ...eéEÉ...
> was not my intention. What I was aiming to do was
> aá..eé..AÁ..EÉ...
> which may also not seem correct to you; however, this behavior mirrors
Ah, so! No, this is quite right: small letters first , then capitals.
I just thought you were aiming at an "interwoven" collation sequence.
> REBOL's default behavior for the /case switch, but does differ in placing
REBOL seems (more and more to me) English-oriented which is very
peculiar, Carl being a German fellow (do I know it right?) Has he
forgotten the handling of his native language special characters -
like the German-only a-umlaut ? ;-)
> the little letters before the capital letters. Petr K. said that this was
> the more normal method in eastern europe (Czech language in his case). So I
It is the normal method in Hungarian, as well.
> letter is capital or not. In fact, REBOL places all the words that begin in
> capital letters _before_ the words that begin in small letters (because of
> the ascii number assigned to the letters).
The problem is - IMHO - that REBOL does not allow _really_ custom
sorts: although one can write a /compare refinement function but this
refinement is not so general-aimed as it seems first. Maybe
mathematicians can use custom comparisons for e.g. complex numbers,
but the refinement can not easily accomodated to to custom-order
series values, as it is in the case of strings. Specifying collation
order for strings is the first step to internationalization. Being Europe a
huge and linguistically not homogenous market, RT should adopt
a "plugin"-style localization: the 'locale object seems to be a right
place to this, i.e. putting custom collation sequences there.
> Maybe we need an additional switch that allows for the eastern european
> desire to have smalls before capitals, and to interleave these together as
Maybe I missed this in the English class :-) but does NOT sort English this way, too?
What is the proper sorting order for mixed capitalized English words?
> you suggest. Sometimes it would be handy to have these options too here in
On what occasion do you think it would be necessary for you
(disregarding the special cases for writing custom softwares for
Eastern Europe ;-) ) ?
> the US. Just need a clever name or names for these switches (or paths in
> REBOLese). Any ideas are welcomed.
The most obvious (and highly uninspired ;-) ( naming would be:
/international.
Other ideas:
/smallsfirst
/capitalized
>> - and so on for all affected special accented chars.
> and so on for life in general!
> :-)
Do not stop generalization here: Life, Universe and everything ... :-))
> I'll repost after I have a chance to develop the new algorithm that I have
> in mind. "Stay tuned"
Beep-beep :-)
--
Best regards,
Geza mailto:[geza67--freestart--hu]
[8/30] from: carl::cybercraft::co::nz at: 14-May-2002 9:21
Re: Hungarian Alphabet Sort (was Re: Collation sequence -proper and effi
On 14-May-02, Geza Lakner MD wrote:
> Maybe I missed this in the English class :-) but does NOT sort
> English this way, too? What is the proper sorting order for mixed
> capitalized English words?
Something I'd not thought about. This is what REBOL does...
>> sort "AabB"
== "AabB"
>> sort/case "AabB"
== "ABab"
I expected sort/case to return "aAbB"...
Would a sort/pattern be of use? ie...
sort/pattern "AabB" "aAbBcC" ; == "aAbB"
--
Carl Read
[9/30] from: sunandadh::aol::com at: 13-May-2002 19:19
Re: Hungarian Alphabet Sort (was Re: Collation sequence - proper ...
Geza:
> Specifying collation
> order for strings is the first step to internationalization. Being Europe a
> huge and linguistically not homogenous market, RT should adopt
> a "plugin"-style localization: the 'locale object seems to be a right
> place to this, i.e. putting custom collation sequences there
It's worth someone from RT taking a look at how MySQL handles adding new
character sets and collating sequences -- it's pretty complete.
Although it's worth pointing out that they don't handle all the subtleties
needed across Europe. One tiny example. German names in phone books may have
a different collating sequence to words in a dictionary, and Austrian phone
books use a different ordering to German ones.
Useful Mysql reference:
http://www.unixtech.be/docs/mysql/manual_Server.html#String_collating
Sunanda.
[10/30] from: gscottjones:mchsi at: 13-May-2002 18:32
Re: Hungarian Alphabet Sort (was Re: Collation sequence -proper and effi
From: "Carl Read"
> Would a sort/pattern be of use? ie...
>
> sort/pattern "AabB" "aAbBcC" ; == "aAbB"
Hi, Carl,
Are you suggesting this as a prototype of a call, or is there already such a
beast out there?
At any rate, this is an interesting idea as a way of introducing new or
different sort patterns. I'll have to think about it a bit.
If this already exists, certainly let us know.
Thanks!
--Scott Jones
[11/30] from: gscottjones:mchsi at: 13-May-2002 21:43
Re: Hungarian Alphabet Sort (was Re: Collation sequence - proper and eff
From: "Geza Lakner MD"
>> Time to go back to the drawing board. I already have an idea, but it may
>> take a while before I have some time to create the new algorithm.
> Good luck to "braining out" the new enhanced algorithm. :-)
I think I've got it. In fact I'm expanding the idea to handle all the
languages that use the ISO-8859-2 character set. Use the basic underlying
technique that I used before, I can set up sorting orders to accomplish any
desired goal. The biggest problem that I am running into is the actual
sorting order. You've helped with the Hungarian language, and Petr/Cyphre
helped with the Czech language. But I show the following languages all
(can) use the same character set:
Albanian, Bosnian, Croatian, Czech, English, Finnish, Hungarian, Irish,
German, Polish, Romanian, Serbian (Latin transcription), Slovak,
Slovenian, Sorbian (Lusatian)
I am collating *all* the characters/codes for each and I am making a few
blind stabs at the sorting order, but it is not obvious to this chap from
the US. Then there are the exceptions like "ch" from Czech, and the German
ss
sharp small s. Wow. I previously figured out how to manage the "ch"
conundrum from the Czech language, but I guess in a grand unified scheme for
managing the ISO-8859-2 character set, it would require a refinement/switch
to instantiate this sort of exception.
You know, someone ought to invent a unified character representation and
maybe call it ... hmmm .. , let me see, maybe "Unicode" for example.
;)
Seriously, I've had *no* experience with whether Unicode necessarily makes
sorting any easier. My guess is "no".
>> The problem is - IMHO - that REBOL does not allow _really_ custom
>> sorts: although one can write a /compare refinement function but this
>> refinement is not so general-aimed as it seems first. Maybe
>> mathematicians can use custom comparisons for e.g. complex numbers,
>> but the refinement can not easily accomodated to to custom-order
>> series values, as it is in the case of strings. ...
I've used the /compare refinement several times and have found that it is
usable within its limits. But as Petr and I discussed last year, it does
not appear to lend itself to the type of sorting problems that we are using
here. Last year, I originally began to develop a complex /compare
algorithm, until it dawned on me that I could develop a more generic
solution using substitution, and then take advantage of the speed of the
native!-level 'sort. If I recall correctly, some samples showed that the
current method was significantly faster than a /compare function used alone.
I may be mis-remembering this fact, so don't "go to the bank" on it (take it
too seriously).
>> Specifying collation order for strings is the first step to
internationalization.
>> Being Europe a huge and linguistically not homogenous market, RT should
>> adopt a "plugin"-style localization: the 'locale object seems to be a
right
>> place to this, i.e. putting custom collation sequences there.
I suspect that RT has already given this some thought, and has probably some
general idea about the "right way" to go about it. (They seem to have done
this about so many things that I doubt that they have neglected this
important area.) My *guess* is that they need to make some money before
they can make this next big step in making a truly internationalizable
product. Tcl has supported Unicode for some time, so I know that it is
certainly do-able at a base level. My ignorance begins in where to go from
Unicode. I leave that speculation to the people that actually know what
they are doing with computers! (I sleep better at night that way. You
should too!)
>> Maybe we need an additional switch that allows for the eastern european
>> desire to have smalls before capitals, and to interleave these together
as
> Maybe I missed this in the English class :-) but does NOT sort English
this way, too?
> What is the proper sorting order for mixed capitalized English words?
I hate to be the sole source in this area; I would much rather someone who
knew a great deal more about computer sciences, knew English (I just pretend
to in order make a living), and was infinitely more articulate than myself
(Joel? Sunanda? et. al. Is anyone else here?). However, I'm never overly
embarrassed to make a complete fool of myself, so ...
What must be distinquished is the difference between the proper sort and the
way that computers have done it "easily" to date. I frankly don't know if
there is considered to be a proper sort in *Amercian* English (we are hardly
proper about much at all except how to get in to a proper war! ;),
specifically small letters before capital letters. I feel sure that others
*do* know (I'm just a doctor AND I don't play one on television! Bad joke
that requires being an avid watcher of US television advertisements ... no
one ever gets it even here, so don't worry). What I do recall is that
**computer** sorting has historically been most easily accommplished by
using the ASCII character set representation of the alphabet. As you likely
already know, "A" is 65, "B" is 66 ... , and "a" is 97, "b" is 98. Non-case
sensitive sorts will do an implicit reduction of the "small" cases to the
capital
cases by subtracting 32. (In the old days of Assembler language,
it only required a computationally cheap "right shift" of bits by two places
for bytes over 96.) Since the capital letters came (in ASCII) before the
small letters, then case sensitive sorts placed the capital letters first.
The legacy of the computer age then places the "natural" sort as placing the
capital letters first. Please, someone slap me down if I have this totally
wrong.
>> you suggest. Sometimes it would be handy to have these options too here
in
> On what occasion do you think it would be necessary for you
> (disregarding the special cases for writing custom softwares for
> Eastern Europe ;-) ) ?
Having a sort that went by case-insensitive letter with the option of
placing one type before the other would seem convenient (and would look
nicer), but I honestly can not tell you a specific time that this
requirement happened. (Remember, I've been exposed to heavy levels of lead
for too many years .... I've got to stop eating those lead paint chips!!
Maybe it is time to switch to mercury... ;)
>> the US. Just need a clever name or names for these switches (or paths in
>> REBOLese). Any ideas are welcomed.
<<quoted lines omitted: 3>>
> /smallsfirst
> /capitalized
I think these are some great ideas!
Thanks again for the feedback and stimulus. (Stimulus -> Response,
Stimulus -> Response ... it works ... at least in the laboratory ;)
--Scott Jones
[12/30] from: carl:cybercraft at: 14-May-2002 18:05
Re: Hungarian Alphabet Sort (was Re: Collation sequence -proper and effi
On 14-May-02, G. Scott Jones wrote:
> From: "Carl Read"
>> Would a sort/pattern be of use? ie...
<<quoted lines omitted: 3>>
> Are you suggesting this as a prototype of a call, or is there
> already such a beast out there?
As a prototype of a call - ie, as an extra refinement to 'sort.
Obviously...
sort/pattern "AabB" ["a" "A" "b" "B" "c" "C" "ch"]
etc. should also be supported.
> At any rate, this is an interesting idea as a way of introducing new
> or different sort patterns. I'll have to think about it a bit.
> If this already exists, certainly let us know.
Not as far as I know. Send the idea to Feedback if you think it'd be
useful. Who knows, it might be something that's quick and easy to
add to REBOL.
--
Carl Read
[13/30] from: nitsch-lists:netcologne at: 14-May-2002 12:09
Re: Hungarian Alphabet Sort (was Re: Collation sequence - proper and eff
Hi Scott, Gesa, Carl,
not sure if this helps, but since i spended some time to it,
i post ;)
rebol [title: "char-mapping"]
{
Hi Scott, Geza, Carl,
instead of creating the mapping fully by hand,
i created a little dialect, which creates a parse-rule.
(not a very efficient one currently. contest! ;)
just a demo, lacks all special chars currently.
}
mapper: context [
{===patch the default mapping with your local specialities}
customize-ascii: [
at "h" [+ "ch"]
at "H" [+ "CH" = "Ch"]
]
"===logical mapping to insert / change easily"
logical-mapping: copy [] 'like [.. + "G" + "H" + "CH" = "Ch" + "I" =2E.]
"fill with ascii (attention 0-based.. ;)"
repeat i 128 [
append logical-mapping compose [+ (to string! to char! i - 1)]
]
{now one could, for example,
exchange upper & lower chars with some rebol-moves}
"===evaluate 'customize, insert custom strings"
parse customize-ascii [some [
'at set string string! set block block! (
insert find/case/tail logical-mapping string block
)
]]
? logical-mapping
"===numbered mapping to have the translation-codes"
numbered-mapping: copy [] 'like [.. "G" 71 "H" 72 "CH" 73 "Ch" 73 "I" 74
=2E.]
next-char: -1
parse logical-mapping [some [
['+ (next-char: next-char + 1) | '=]
set string string! (repend numbered-mapping [string next-char])
]]
? numbered-mapping
"===mapping rule to translate"
mapping-rule: cp [] 'like
[.. | "CH" (insert tail out #"I") | "H" (insert tail out #"H") | ..]
{attention: parse needs the longest strings first, so we reverse!}
parse head reverse numbered-mapping [some [
set code integer! set string string! (
append mapping-rule reduce [
string to-paren compose [insert tail out (to-char code)]
'|
]
)
]]
remove back tail mapping-rule
? mapping-rule
"===and now the mapping-function"
out: none
map: func [string] [
out: cp ""
parse/all/case string [any mapping-rule]
out
]
mapped-sort: func [block /local buf] [
buf: cp []
foreach string block [repend buf [map string string]]
sort/skip buf 2
clear block
forskip buf 2 [append block second buf]
block
]
"===test"
probe mapped-sort [
"A string with H mapped"
"A string with I mapped"
"A string with CH mapped"
]
]
[14/30] from: gscottjones:mchsi at: 14-May-2002 8:14
From: "Volker Nitsch"
...
> not sure if this helps, but since i spended some time to it,
> i post ;)
<snipped code>
Hi, Volker,
Neat idea. Kind of like a good cut of beef, I'm going to have to chew on it
a bit to fully understand its potential. Thanks for the trans-atlantic
volley ball pass.
By the way, how should the small sharp s character (ASCII 223 in
ISO--8859-2) sort compared to a regular s?
--Scott Jones
[15/30] from: carl:cybercraft at: 16-May-2002 0:30
On 15-May-02, G. Scott Jones wrote:
> From: "Volker Nitsch"
> ...
<<quoted lines omitted: 5>>
> chew on it a bit to fully understand its potential. Thanks for the
> trans-atlantic volley ball pass.
Glad you could work it out, as I couldn't make head nor tail of it. (:
Anyway, I've played around with my idea for sorting according to a
pattern, and while I'm not sure if the following code's very fast (or
bug-free:), like Volker, I post.
There's two functions: One to take a pattern for creating a rule from
and another to use the rule to sort strings or blocks of strings
with. First, the functions...
pattern-rule: func [
"Create a rule for use by pattern-sort."
pattern [string! block!] "An ordered pattern."
/local rule n
][
rule: copy []
n: 1
forall pattern [
append rule reduce [pattern/1 to-paren reduce ['r n] '|]
n: n + 1
]
append rule reduce ['skip to-paren reduce ['r n]]
reduce ['some rule]
]
pattern-sort: func [
{Sort a string or block of strings based on a rule created
by pattern-rule.}
series [string! block!] "Series to sort."
rule [block!] "Pattern rule."
/reverse "Reverse sort order."
/local ptrs blk r pos val
][
ptrs: copy []
blk: copy []
r: func [n][append/only blk n]
bind rule 'r
either string? series [
parse/case series rule
pos: 1
foreach n blk [
append/only ptrs reduce [
n pick rule/2 (n - 1) * 3 + 1
]
val: next first back tail ptrs
if 'skip = val/1 [change val pick series pos]
pos: pos + either char? val/1 [1][length? val/1]
]
][
forall series [
clear blk
parse/case series/1 rule
append/only ptrs copy blk
append last ptrs series/1
]
]
either reverse [sort/reverse ptrs][sort ptrs]
clear series
forall ptrs [append series last ptrs/1]
series
]
And some examples of use...
>> rule-1: pattern-rule "aAbBcC"
== [some [#"a" (r 1) | #"A" (r 2) | #"b" (r 3) | #"B" (r 4) | #"c" (r
5) | #"C" (r 6) | skip (r 7)]]
>> pattern-sort "AacCBb" rule-1
== "aAbBcC"
>> pattern-sort ["Abc" "abc" "aBC" "ABC"] rule-1
== ["abc" "aBC" "Abc" "ABC"]
>> pattern-sort/reverse ["Abc" "abc" "aBC" "ABC"] rule-1
== ["ABC" "Abc" "aBC" "abc"]
>> rule-2: pattern-rule "AaBbCc"
== [some [#"A" (r 1) | #"a" (r 2) | #"B" (r 3) | #"b" (r 4) | #"C" (r
5) | #"c" (r 6) | skip (r 7)]]
>> pattern-sort "AacCBb" rule-2
== "AaBbCc"
>> pattern-sort ["Abc" "abc" "aBC" "ABC"] rule-2
== ["ABC" "Abc" "aBC" "abc"]
>> rule-3: pattern-rule ["a" "A" "b" "B" "ch" "c" "C"]
== [some ["a" (r 1) | "A" (r 2) | "b" (r 3) | "B" (r 4) | "ch" (r 5) |
c
(r 6) | "C" (r 7) | skip (r 8)]]
>> pattern-sort "abcABCchCbA" rule-3
== "aAAbbBchcCC"
>> pattern-sort ["AabA" "chab" "chAB" "cchc" "achA"] rule-3
== ["achA" "AabA" "chab" "chAB" "cchc"]
It seems to work and might be of some use, but I'd test it well before
trusting it. It's had no real-world tests at all...
--
Carl Read
[16/30] from: gscottjones:mchsi at: 15-May-2002 9:38
From: "Carl Read"
> On 15-May-02, G. Scott Jones wrote:
> > From: "Volker Nitsch"
<<quoted lines omitted: 7>>
> > trans-atlantic volley ball pass.
> Glad you could work it out, as I couldn't make head nor tail of it. (:
I'm still cooking the meat on the barbie before I can chew it!
;)
> Anyway, I've played around with my idea for sorting according to a
> pattern, and while I'm not sure if the following code's very fast (or
> bug-free:), like Volker, I post.
>
> There's two functions: One to take a pattern for creating a rule from
> and another to use the rule to sort strings or blocks of strings
> with. First, the functions...
<nifty code snipped ... see:
http://www.escribe.com/internet/rebol/m22420.html
This is extremely promising. I drew from the ISO-8859-2 character set to
make a rule, and it initially seems to sort correctly. The time through is
roughly the same as my hack (but I've not really set-up a clean time
condition). The only problem so far occurs when I run my word sample list
through more than once. It seems to magically have kept the original sort/s
and continues to append new results to the block. I cannot seem to find
where the problem is occurring. Furthermore, I'm out of computer
play-time
today, so it will have to wait. :(
Meanwhile, I'm waiting for Volker's serving to slow cook, so I can fully
savor it soon!.
Thanks for the input. Very exciting because it is so clean!
--Scott Jones
[17/30] from: nitsch-lists:netcologne at: 15-May-2002 18:58
Hi Carl, Scott, Gesa,
Am Mittwoch, 15. Mai 2002 14:30 schrieb Carl Read:
> On 15-May-02, G. Scott Jones wrote:
> > From: "Volker Nitsch"
<<quoted lines omitted: 11>>
> > trans-atlantic volley ball pass.
> Glad you could work it out, as I couldn't make head nor tail of it. (:
Carl, my volley pass works similar to yours, except i made it more complicated
:)
your pattern-rule "aAbBcC" would look like
[+ "a" + "A" + "b" + "B" + "c" + "C"]
because i use blocks with strings, i can also map multi-char-codes.
so [+ "CH"] maps to one char. IIRC "CH" is handled like one char?
also i have [+ "CH" = "Ch"]. this says, "CH" and "Ch" are the same.
(add a new code-number for "CH" and use the same number for "Ch").
And in my telephone-book "ö" is handled like "oe", so one char
expands to two. So i need this kind of commands?
(Scott, i found no "ss" in this book, because "ss" has always two chars
before it and is rarely used. sorry..
http://www.uni-koeln.de/phil-fak/spinfo/lehre/java/kap23/collating.htm#3.3
duden 73: "ß" like "ss", by same words before (argh!)
since 96 changed: "ß" after "ss".
and "ä" the same as "a". telephonbook is wrong? or duden? hmm..
back to script.)
)
first the block is initialized with ascii-codes [.. + "@" + "A" ..]
then i could move whole char-blocks around, to mix "aAbB".
then comes
customize-ascii: [
at "h" [+ "ch"]
at "H" [+ "CH" = "Ch"]
]
which says {find "h" in block and insert[+ "ch"] behind},same for "H".
now i have [.. + "h" + "ch" + "i"].
in a second pass i give numbers to the strings in tis order,
in a third i create the parse-rule, which translates a string to the
sort-encoding.
for sorting i mix strings and their translations like
[translation1 string1 translation2 string2]
sort with sort/skip 2,
and extract the strings back to the original block.
hmm, somehow i like your string more. if it could deal with multi-chars.
> Anyway, I've played around with my idea for sorting according to a
> pattern, and while I'm not sure if the following code's very fast (or
> bug-free:), like Volker, I post.
>
Good idea :-)
> There's two functions: One to take a pattern for creating a rule from
> and another to use the rule to sort strings or blocks of strings
<<quoted lines omitted: 75>>
> It seems to work and might be of some use, but I'd test it well before
> trusting it. It's had no real-world tests at all...
greetings
volker
[18/30] from: gscottjones:mchsi at: 15-May-2002 16:55
From: "G. Scott Jones"
> From: "Carl Read"
> > Anyway, I've played around with my idea for sorting according to a
<<quoted lines omitted: 9>>
> This is extremely promising. I drew from the ISO-8859-2 character set to
> make a rule, and it initially seems to sort correctly. The time through
is
> roughly the same as my hack (but I've not really set-up a clean time
> condition). The only problem so far occurs when I run my word sample list
> through more than once. It seems to magically have kept the original
sort/s
> and continues to append new results to the block. I cannot seem to find
> where the problem is occurring. Furthermore, I'm out of computer
> "play-time" today, so it will have to wait. :(
Responding to self (I talk to myself sometimes too!).
Wow, major latency in getting the posting (I'm not pointing fingers, so no
one get their "undies in a bundle ;-).
Hi, Carl,
The idea did look promising, even for the "multi-letter graphemes" (like the
czech "ch"), but then I believe we run into a limitation of 'parse. The
longer phrase rule needs to come before the shorter one, so that:
rule-4: pattern-rule ["a" "A" "b" "B" "c" "C" "h" "H" "ch" "Ch"]
will not correctly sort:
>> pattern-sort ["c" "ch" "h"] rule-4
== ["ch" "c" "h"]
;should be "c" "h" "ch"
At least one other person has mused over the desire to have a pattern sort
(in this case under the gnu Linux sort) (look near the bottom):
http://budling.nytud.hu/~szigetva/etcetera/converters/README
In this case, the pattern has a bit more information:
a=á<b<c<cs<d<e=é<f<g<gy<h<i=í...<z<zs
where "a" can be told to sort the same as "a with acute", both of these sort
before "b" ... and "zs" sorts after "z"
Breaking apart this information might allow a parse rule to set-up the
sequence to allow the longer phrase rules to come before the shorter ones.
At least I think it would work.
Back to Geza, ...
Geza, how important are these "multi-letter graphemes" (cs, dz, dzs, gy, ly,
ny, sz, ty and zs) in a sort algorithm? At the same site, Péter Szigetvári
indicates that it can get very tricky:
Unfortunately, the task is not trivial: some sequences that look like
multi-letter graphemes are in fact not, e.g., bércsík may be ranked before
or after bérczerge depending on its morphology: bér+csík (after bérczerge)
or bérc+sík (before bérczerge). This can be decided only with a
morphological/semantic parser, which is probably not worth doing because the
problem practically never turns up.
http://budling.nytud.hu/~szigetva/etcetera/Hungarian/sorting.html
--Scott Jones
[19/30] from: carl:cybercraft at: 16-May-2002 18:58
On 16-May-02, G. Scott Jones wrote:
> This is extremely promising. I drew from the ISO-8859-2 character
> set to make a rule, and it initially seems to sort correctly. The
> time through is roughly the same as my hack (but I've not really
> set-up a clean time condition).
I've thoughts about how to speed it up - will be testing them out.
> The only problem so far occurs when
> I run my word sample list through more than once. It seems to
> magically have kept the original sort/s and continues to append new
> results to the block. I cannot seem to find where the problem is
> occurring.
It's in here...
forall series [
clear blk
parse/case series/1 rule
append/only ptrs copy blk
append last ptrs series/1
]
'forall leaving 'series at its tail, so the 'clear that follows
doesn't clear it. Change it to the following and it should fix that
problem. (Though not your other one. See my other post about that.)
foreach s series [
clear blk
parse/case s rule
append/only ptrs copy blk
append last ptrs s
]
--
Carl Read
[20/30] from: carl:cybercraft at: 16-May-2002 19:35
On 16-May-02, G. Scott Jones wrote:
> Hi, Carl,
> The idea did look promising, even for the "multi-letter graphemes"
<<quoted lines omitted: 17>>
> the sequence to allow the longer phrase rules to come before the
> shorter ones. At least I think it would work.
My first thoughts are that it'd work too, but then we're talking about
my coding here. (;
Anyway, only the order of the parse rules should need to be changed.
ie, this is what's currently generated...
>> probe rule-4
[some ["a" (r 1) | "A" (r 2) | "b" (r 3) | "B" (r 4) | "c" (r 5) | "C"
(r 6) | "h" (r 7) | "H" (r 8) | "ch" (r 9) | "Ch" (r 10) | skip (r
11)]]
Moving the "ch"s to the front of the rule gives us this...
rule-5: [some ["ch" (r 9) | "Ch" (r 10) | "a" (r 1) |
"A" (r 2) | "b" (r 3) | "B" (r 4) | "c" (r 5) |
"C" (r 6) | "h" (r 7) | "H" (r 8) | skip (r 11)
]]
Using that fixes your error above...
>> pattern-sort ["c" "ch" "h"] rule-5
== ["c" "h" "ch"]
though it screws up string sorting big-time...
>> pattern-sort "cchh" rule-5
== "bch"
(: Anyway, I'll see if I can get it to behave, and I'll try out the
speed improvements I thought of as well.
--
Carl Read
[21/30] from: carl:cybercraft at: 16-May-2002 20:15
On 16-May-02, Volker Nitsch wrote:
> Carl, my volley pass works similar to yours, except i made it more
> complicated :)
Simple things should be simple.
(So I can understand them;)
> your pattern-rule "aAbBcC" would look like [+ "a" + "A" + "b" + "B"
> + "c" + "C"] because i use blocks with strings, i can also map
> multi-char-codes. so [+ "CH"] maps to one char.
I allowed for that. It accepts a string or a block of strings.
Except it's bugged as it stands ): But it may be able to be fixed...
> IIRC "CH" is handled
> like one char? also i have [+ "CH" = "Ch"]. this says, "CH" and "Ch"
> are the same. (add a new code-number for "CH" and use the same
> number for "Ch").
This I didn't allow for. Currently my rule-blocks look like this...
["a" "b" "c" "ch" "CH"]
and words with "ch" in would always preceed ones with "CH" in after
sorting. Placing equal ones in blocks would seem a nice solution...
["a" "b" "c" ["ch" "CH"]]
> And in my telephone-book "ö" is handled like
> "oe", so one char expands to two. So i need this kind of commands?
I can't see why we would, as we're sorting something in just the one
format, not changing the format. (I hope:)
> (Scott, i found no "ss" in this book, because "ss" has always two
> chars before it and is rarely used. sorry..
>
http://www.uni-koeln.de/phil-fak/spinfo/lehre/java/kap23/collating.htm#3.3
> duden 73: "ß" like "ss", by same words before (argh!) since 96
> changed: "ß" after "ss". and "ä" the same as "a".
<<quoted lines omitted: 14>>
> for sorting i mix strings and their translations like [translation1
> string1 translation2 string2] sort with sort/skip 2,
I should've used sort/skip - it's one of the ways I'm hoping to speed
things up.
> and extract the strings back to the original block.
> hmm, somehow i like your string more. if it could deal with
> multi-chars.
It can - just not correctly. (; 'rule-3 showed how it's meant to
work...
>>>> rule-3: pattern-rule ["a" "A" "b" "B" "ch" "c" "C"]
>> == [some ["a" (r 1) | "A" (r 2) | "b" (r 3) | "B" (r 4) | "ch" (r
<<quoted lines omitted: 4>>
>>>> pattern-sort ["AabA" "chab" "chAB" "cchc" "achA"] rule-3
>> == ["achA" "AabA" "chab" "chAB" "cchc"]
But, as Scott pointed out, it doesn't get this right...
---8<---
rule-4: pattern-rule ["a" "A" "b" "B" "c" "C" "h" "H" "ch" "Ch"]
will not correctly sort:
>> pattern-sort ["c" "ch" "h"] rule-4
== ["ch" "c" "h"]
;should be "c" "h" "ch"
---8<---
Back to the drawing-board...
--
Carl Read
[22/30] from: gscottjones:mchsi at: 16-May-2002 7:44
From: "Carl Read"
> On 16-May-02, G. Scott Jones wrote:
>
> > This is extremely promising. I drew from the ISO-8859-2 character
> > set to make a rule, and it initially seems to sort correctly. The
> > time through is roughly the same as my hack (but I've not really
> > set-up a clean time condition).
>
> I've thoughts about how to speed it up - will be testing them out.
Great!
> > The only problem so far occurs when
> > I run my word sample list through more than once. It seems to
<<quoted lines omitted: 17>>
> append last ptrs s
> ]
Yep, needless to say, that fixed it.
rule-4: pattern-rule { !"#$%&'()*+,-./0123456789:;<=>?
@AaÁᡱÂâÄäĂăBbCcĆćČčÇçDdĎďĐđEeÉéĘęËëĚěFf
GgHhIiÍíÎîJjKkLlĹĺĄµŁłMmNnŃńŇňOoÓóÔôŐőÖöPp
QqRrŔŕŘřSs¦¶©ąŞşßTt«»ŢţUuŮůÚúÜüŰűVvWwXxYy
ÝýZz¬ĽŻżZz[\]^_`{|}~}
Here is the rule pattern I **generated** from my table for the ISO-8859-2
character set. Currently, this is sorted big-uns before little-uns. If the
character looks totally out of place, it is because this representation used
the ISO-8859-1 implicit in REBOL. Which brings me to the next "problem",
there will be no way to generate the proper character that isn't already
contained in the standard character set. So some sorted solutions will not
appear
to be correct, until the result is displayed in the correct
character representation set. This does not appear to be a problem in
Hungarian (so far), but will be in other languages. Hmmm.....
Keep up the good work. I look forward to seeing your ideas regarding the
multi-letter graphemes
in the separate post (which has not yet arrived
here ... whoops .. just arrived!).
Later...
--Scott Jones
[23/30] from: carl:cybercraft at: 17-May-2002 21:46
On 17-May-02, G. Scott Jones wrote:
> From: "Carl Read"
>> I've thoughts about how to speed it up - will be testing them out.
> Great!
Well, not so great, actually. The new version's faster, but not
markedly so. Perhaps 30% faster going by the single test of a long
list of random words I did, though it's still 7 or 8 times slower than
REBOL's sort. Maybe if it was all done with parsing it'd be faster,
but I'd have to re-think it all. (:
Anyway, here it is. See the end of the mail for how to handle
characters that are to be considered equal. ie "A" & "a" etc.
pattern-rule: func [
"Create a rule for use by pattern-sort."
pattern [string! block!] "An ordered pattern."
/local rule add-rule n
][
rule: copy []
add-rule: func [str][
str: to-string str
insert tail rule reduce [length? str reduce [
str to-paren reduce ['r n length? str] '|
]]
]
n: 1
foreach pos pattern [
either block? pos [
either 1 = length? pos [
foreach r pos/1 [add-rule r]
][
foreach r pos [add-rule r]
]
][
add-rule pos
]
n: n + 1
]
rule: extract next sort/reverse/skip rule 2 2
insert tail rule reduce ['skip to-paren reduce ['r n 1]]
reduce ['some rule]
]
pattern-sort: func [
{Sort a string or block of strings based on a rule created
by pattern-rule.}
series [string! block!] "Series to sort."
rule [block!] "Pattern rule."
/reverse "Reverse sort order."
/local new blk r pos
][
new: clear []
blk: clear []
r: func [n len][
insert tail blk n
if string? series [insert tail blk len]
]
bind rule 'r
either string? series [
parse/case series rule
pos: 0
foreach [n len] blk [
insert tail new reduce [
reduce [n] copy/part skip series pos len
]
pos: pos + len
]
][
foreach n series [
clear blk
parse/case n rule
insert tail new reduce [copy blk n]
]
]
either reverse [sort/skip/reverse new 2][sort/skip new 2]
clear series
insert tail series extract next new 2
series
]
Use is the same as before, though the rules that are generated are
different from before. ie...
>> rule-1: pattern-rule "aAbBcC"
== [some ["a" (r 1 1) | "A" (r 2 1) | "b" (r 3 1) | "B" (r 4 1) | "c"
(r 5 1) | "C" (r 6 1) | skip (r 7 1)]]
So...
>> pattern-sort "abcABC" rule-1
== "aAbBcC"
>> pattern-sort ["abc" "ABC" "aBc" "AbC"] rule-1
== ["abc" "aBc" "AbC" "ABC"]
>> rule-2: pattern-rule ["a" "b" "c" "h" "ch"]
== [some ["ch" (r 5 2) | "a" (r 1 1) | "b" (r 2 1) | "c" (r 3 1) | "h"
(r 4 1) | skip (r 6 1)]]
>> pattern-sort "ccchcc" rule-2
== "ccccch"
>> pattern-sort "hccchhh" rule-2
== "cchhhch"
>> pattern-sort ["c" "h" "ch" "h" "c"] rule-2
== ["c" "c" "h" "h" "ch"]
Now, to give the same weight to two or more characters, enclose them
in a block. They can either be a single string in the block, in
which case all the characters in the string are weighted the same,
else they can be group of strings which will all be weighted the
same. ie...
>> rule-3: pattern-rule [["aA"]["bB"]["cC"]]
== [some ["a" (r 1 1) | "A" (r 1 1) | "b" (r 2 1) | "B" (r 2 1) | "c"
(r 3 1) | "C" (r 3 1) | skip (r 4 1)]]
>> pattern-sort "BBbbBBcCcAaA" rule-3
== "AaABBbbBBcCc"
>> pattern-sort ["Bbb" "bBB" "aA" "Aa"] rule-3
== ["aA" "Aa" "Bbb" "bBB"]
>> rule-4: pattern-rule ["a" "b" ["cC"]["hH"]["ch" "CH"]]
== [some ["ch" (r 5 2) | "CH" (r 5 2) | "a" (r 1 1) | "b" (r 2 1) |
c
(r 3 1) | "C" (r 3 1) | "h" (r 4 1) | "H" (r 4 1) | skip (r...
>> pattern-sort "CHcCCcchbaHh" rule-4
== "abcCCcHhCHch"
>> pattern-sort ["hhCH" "ccCH" "hhch" "ccch" "hhCH"] rule-4
== ["ccCH" "ccch" "hhCH" "hhch" "hhCH"]
Also, I've allowed for characters not included in the rules, they
being treated as the last character in the rule. So this doesn't
generate an error...
>> pattern-sort ["rat" "hat" "cat"] rule-4
== ["cat" "hat" "rat"]
And the reverse refinement's still there...
>> pattern-sort/reverse ["rat" "hat" "cat"] rule-4
== ["rat" "hat" "cat"]
As before, no promises about how well this will perform with real
alphabets, but it should be a bit better than the last effort.
Hopefully. (;
--
Carl Read
[24/30] from: gscottjones:mchsi at: 17-May-2002 13:52
From: "Carl Read"
> The new version's faster, but not
> markedly so. Perhaps 30% faster going by the single test of a long
<<quoted lines omitted: 17>>
> alphabets, but it should be a bit better than the last effort.
> Hopefully. (;
By George, I think you've done it! At least it appears to sort the sample
Hungarian word list correctly. That is a slick solution. It is about 35%
faster than my original effort. Good job. I like the way you handled the
characters of equivalent weight, although I've not put this apsect through
any testing. Good job!
--Scott Jones
[25/30] from: geza67:freestart:hu at: 17-May-2002 22:59
Hello Scott
> In this case, the pattern has a bit more information:
> a=á<b<c<cs<d<e=é<f<g<gy<h<i=í...<z<zs
> where "a" can be told to sort the same as "a with acute", both of these sort
> before "b" ... and "zs" sorts after "z"
Actually a<>á and e<>é ... more clearly a<á and e<é. In some relaxed
situtations the equivalence could be stated but the Hungarian grammar
is much more complex that I could be an "ex catedra" judge about it.
> Geza, how important are these "multi-letter graphemes" (cs, dz, dzs, gy, ly,
> ny, sz, ty and zs) in a sort algorithm? At the same site, Péter Szigetvári
dz and dzs are good for translative ortography i.e. for transcribing
foreign words. E.g. dzs means j (the Hungarian language is more
phonetic-oriented than any other indo-europian or latin-legacy
language families). cs, gy, ly, ny, sz, ty and zs are "inborn"
Hungarian specialities, many words has them as components. How
important they are? That's a very hard question because in a mixed
language text (e.g. Hungarian medical report intersprsed with medical
latin terminology) one should understand the word itself to specify
its corresponding sorting order: e.g. in a Hungarian word the "ly" phoneme
(which roughly corresponds to the English "y", but in Hungarian "j" is
phonetically also equivalent with "ly" but ortographically different
words use the one than the other). If you don't know the word you
cannot even decide its hyphenation, as you wrote:
> "Unfortunately, the task is not trivial: some sequences that look like
> multi-letter graphemes are in fact not, e.g., bércsík may be ranked before
> or after bérczerge depending on its morphology: bér+csík (after bérczerge)
> or bérc+sík (before bérczerge). This can be decided only with a
bér-csík or bérc-sík - different sorting order and even different
hyphenation (just for fulfilling your presmued curiosity what these
words mean: the 1st one could be translated to payment-stripe [not a
logical word combination] the second one to a geographical plane
[correct Hungarian word]. Without a dictionary, no program can get
through this, not even a semantic parser.
Back to these di-graphemes: they are important, fundamental parts of
our language but personally I can live without sorting them correctly
in a computer program. :-)
--
Best regards,
Geza mailto:[geza67--freestart--hu]
[26/30] from: gscottjones:mchsi at: 18-May-2002 6:51
From: "Geza Lakner MD"
> > In this case, the pattern has a bit more information:
> > a=á<b<c<cs<d<e=é<f<g<gy<h<i=í...<z<zs
> > where "a" can be told to sort the same as "a with acute", both of these
sort
> > before "b" ... and "zs" sorts after "z"
> Actually a<>á and e<>é ... more clearly a<á and e<é. In some relaxed
> situtations the equivalence could be stated but the Hungarian grammar
> is much more complex that I could be an "ex catedra" judge about it.
How do you like Carl's representation?
<snip>
> Back to these di-graphemes: they are important, fundamental parts of
> our language but personally I can live without sorting them correctly
> in a computer program. :-)
That was the final opinion of the Hungarian author (Péter Szigetvári) of the
website I was using as a reference. By the way, he offers a number of
format conversion tools that are Hungarian friendly. They are written in
Perl.
http://budling.nytud.hu/~szigetva/etcetera/Hungarian.html
I almost have the ISO-8859-2 character set (for central europe) mapped based
on a the various sort orders that we discussed earlier. (I just remembered
that I forgot Petr K's Czech "ch" -- darn!) If you would like to use Carl
R's nifty sorting parser, I can transform the various sorting orders into
patterns
for easy use (that was a very clever idea). What I do not have
is any authoritative resource that tells me the best order that covers "all"
the bases. My fear is that the letters with diacritics may sort differently
in the various languages covered by the ISO-8859-2 character set: Albanian,
Bosnian, Croatian, Czech, English, Finnish, Hungarian, Irish, German,
Polish, Romanian, Serbian (Latin transcription), Slovak, Slovenian, and
Sorbian (Lusatian). My master table can now handle any permutation, but it
is the actual orders that are so hard to come across.
Thanks for the feedback on the "multi-letter graphemes."
--Scott Jones
[27/30] from: carl:cybercraft at: 19-May-2002 12:32
On 18-May-02, G. Scott Jones wrote:
> By George, I think you've done it! At least it appears to sort the
> sample Hungarian word list correctly. That is a slick solution. It
> is about 35% faster than my original effort. Good job. I like the
> way you handled the characters of equivalent weight, although I've
> not put this apsect through any testing. Good job!
> --Scott Jones
Good to hear it seems to work Scott. Be interesting to know what
other languages it can work with.
One possible improvement in the creation of the rule would be to allow
for some of the strings in blocks to be treated as a collection of
seperate characters, perhaps by using a different string datatype to
string!, such as file!. So that instead of this...
["a" "b" "c" "ch" "d" "e"]
we could have...
[%abc "ch" %de]
Though it'd probably be better round the other way. What would be the
best string datatype for such a job?
--
Carl Read
[28/30] from: gscottjones:mchsi at: 19-May-2002 12:17
From: "Carl Read"
> One possible improvement in the creation of the rule would be to allow
> for some of the strings in blocks to be treated as a collection of
<<quoted lines omitted: 5>>
> Though it'd probably be better round the other way. What would be the
> best string datatype for such a job?
It's not immediately obvious to me. Maybe something will become obvious
with some thought.
--Scott Jones
[29/30] from: brett:codeconscious at: 20-May-2002 9:38
Interesting question.
Tag! and Issue! might be useful for your design but both will not be
able to handle certain characters. I figured I could write some code to
show what they are:
chars-in-form: function [
example-form [block!]
] [all-chars useable-chars ch test-value] [
all-chars: copy {}
useable-chars: copy {}
repeat i 255 [
append all-chars ch: to-char i
if all [
not error? try [test-value: load rejoin example-form]
1 = length? test-value
ch = first test-value
] [append useable-chars ch]
]
exclude all-chars useable-chars
]
print mold chars-in-form [#"#" ch]
print mold chars-in-form [#"<" ch #">"]
Regards,
Brett.
[30/30] from: carl:cybercraft at: 21-May-2002 21:56
On 20-May-02, Brett Handley wrote:
> Interesting question.
> Tag! and Issue! might be useful for your design but both will not be
> able to handle certain characters.
Actually, tags might be the best, as you can put strings in them and
they could then be used to hold the multiple-letter characters, with
plain strings being used for single-letter characters. ie...
["aAbBcC" <"ch" "CH"> "dDeE"]
(Yes, I know it's just one string in the tag, but to-block seperates
them.)
And using blocks for same-value characters could still be used.
But I won't be changing the script till asked, since I'm not using it
myself. (:
--
Carl Read
Notes
- Quoted lines have been omitted from some messages.
View the message alone to see the lines that have been omitted