Mailing List Archive: 49091 messages
  • Home
  • Script library
  • AltME Archive
  • Mailing list
  • Articles Index
  • Site search
 

[REBOL] Re: Hungarian Alphabet Sort (was Re: Collation sequence - proper and eff

From: gscottjones:mchsi at: 13-May-2002 21:43

From: "Geza Lakner MD"
>> Time to go back to the drawing board. I already have an idea, but it may >> take a while before I have some time to create the new algorithm. > Good luck to "braining out" the new enhanced algorithm. :-)
I think I've got it. In fact I'm expanding the idea to handle all the languages that use the ISO-8859-2 character set. Use the basic underlying technique that I used before, I can set up sorting orders to accomplish any desired goal. The biggest problem that I am running into is the actual sorting order. You've helped with the Hungarian language, and Petr/Cyphre helped with the Czech language. But I show the following languages all (can) use the same character set: Albanian, Bosnian, Croatian, Czech, English, Finnish, Hungarian, Irish, German, Polish, Romanian, Serbian (Latin transcription), Slovak, Slovenian, Sorbian (Lusatian) I am collating *all* the characters/codes for each and I am making a few blind stabs at the sorting order, but it is not obvious to this chap from the US. Then there are the exceptions like "ch" from Czech, and the German ss sharp small s. Wow. I previously figured out how to manage the "ch" conundrum from the Czech language, but I guess in a grand unified scheme for managing the ISO-8859-2 character set, it would require a refinement/switch to instantiate this sort of exception. You know, someone ought to invent a unified character representation and maybe call it ... hmmm .. , let me see, maybe "Unicode" for example. ;) Seriously, I've had *no* experience with whether Unicode necessarily makes sorting any easier. My guess is "no".
>> The problem is - IMHO - that REBOL does not allow _really_ custom >> sorts: although one can write a /compare refinement function but this >> refinement is not so general-aimed as it seems first. Maybe >> mathematicians can use custom comparisons for e.g. complex numbers, >> but the refinement can not easily accomodated to to custom-order >> series values, as it is in the case of strings. ...
I've used the /compare refinement several times and have found that it is usable within its limits. But as Petr and I discussed last year, it does not appear to lend itself to the type of sorting problems that we are using here. Last year, I originally began to develop a complex /compare algorithm, until it dawned on me that I could develop a more generic solution using substitution, and then take advantage of the speed of the native!-level 'sort. If I recall correctly, some samples showed that the current method was significantly faster than a /compare function used alone. I may be mis-remembering this fact, so don't "go to the bank" on it (take it too seriously).
>> Specifying collation order for strings is the first step to
internationalization.
>> Being Europe a huge and linguistically not homogenous market, RT should >> adopt a "plugin"-style localization: the 'locale object seems to be a
right
>> place to this, i.e. putting custom collation sequences there.
I suspect that RT has already given this some thought, and has probably some general idea about the "right way" to go about it. (They seem to have done this about so many things that I doubt that they have neglected this important area.) My *guess* is that they need to make some money before they can make this next big step in making a truly internationalizable product. Tcl has supported Unicode for some time, so I know that it is certainly do-able at a base level. My ignorance begins in where to go from Unicode. I leave that speculation to the people that actually know what they are doing with computers! (I sleep better at night that way. You should too!)
>> Maybe we need an additional switch that allows for the eastern european >> desire to have smalls before capitals, and to interleave these together
as
> Maybe I missed this in the English class :-) but does NOT sort English
this way, too?
> What is the proper sorting order for mixed capitalized English words?
I hate to be the sole source in this area; I would much rather someone who knew a great deal more about computer sciences, knew English (I just pretend to in order make a living), and was infinitely more articulate than myself (Joel? Sunanda? et. al. Is anyone else here?). However, I'm never overly embarrassed to make a complete fool of myself, so ... What must be distinquished is the difference between the proper sort and the way that computers have done it "easily" to date. I frankly don't know if there is considered to be a proper sort in *Amercian* English (we are hardly proper about much at all except how to get in to a proper war! ;), specifically small letters before capital letters. I feel sure that others *do* know (I'm just a doctor AND I don't play one on television! Bad joke that requires being an avid watcher of US television advertisements ... no one ever gets it even here, so don't worry). What I do recall is that **computer** sorting has historically been most easily accommplished by using the ASCII character set representation of the alphabet. As you likely already know, "A" is 65, "B" is 66 ... , and "a" is 97, "b" is 98. Non-case sensitive sorts will do an implicit reduction of the "small" cases to the capital cases by subtracting 32. (In the old days of Assembler language, it only required a computationally cheap "right shift" of bits by two places for bytes over 96.) Since the capital letters came (in ASCII) before the small letters, then case sensitive sorts placed the capital letters first. The legacy of the computer age then places the "natural" sort as placing the capital letters first. Please, someone slap me down if I have this totally wrong.
>> you suggest. Sometimes it would be handy to have these options too here
in
> On what occasion do you think it would be necessary for you > (disregarding the special cases for writing custom softwares for > Eastern Europe ;-) ) ?
Having a sort that went by case-insensitive letter with the option of placing one type before the other would seem convenient (and would look nicer), but I honestly can not tell you a specific time that this requirement happened. (Remember, I've been exposed to heavy levels of lead for too many years .... I've got to stop eating those lead paint chips!! Maybe it is time to switch to mercury... ;)
>> the US. Just need a clever name or names for these switches (or paths in >> REBOLese). Any ideas are welcomed. > The most obvious (and highly uninspired ;-) ( naming would be: > /international. > Other ideas: > /smallsfirst > /capitalized
I think these are some great ideas! Thanks again for the feedback and stimulus. (Stimulus -> Response, Stimulus -> Response ... it works ... at least in the laboratory ;) --Scott Jones