Simulating Unicode support

[1/1] from: jan::skibinski::sympatico::ca at: 2-Dec-2002 12:01

Hi All, I just posted an updated version of UTF-8, which attempts to simulate Unicode support in Rebol console. The additional functions require some explanation, since they are not as obvious as the previously designed basic 'encode and 'decode functions: 1. Collect a bunch of cross-mapping files from www.unicode.org and store it somewhere in a directory 'unicode-dir. The source is at http://www.unicode.org/Public/MAPPINGS/ . Each file provides mapping from a specific character set, such as ISO-8859-2, or CP1250 to Unicode. We will be using them in the reverse mode. Make sure that you override the value of my default 'unicode-dir. 2. I provide two blocks with names of such files: 'charset-windows and 'charset-iso, but you can run similar experiments with Apple or Adobe cross-mapping files if you wish. 3. In the simplest case, comment out all filenames, but one. Later on you can repeat the experiment with larger sets. The idea is to play with some superset of several specific character sets (Code Pages as MS calls them), which do not overlap. For example, CP1250, CP1252 and CP1251 (Central Europe, Western Europe, and Cyrillic) do not seem to overlap, I think, but I did not do a very thorough testing of that. But I do know that adding Baltic page to such a set screws up the Polish mapping, for example. So start small, with only one file. 4. Run pan: cross-map charset-windows 'or ISO, or Apple This establishes a superset unicode->256-int-table. There will be some many-to-one mappings, because your superset might be much bigger than 256. But this is OK, this is just a simulation. 5. Now select a font from a menu bar if you run it on Windows. If not, than you might be able to do it programmatically on the View console. Make sure that you choose one of charsets, such as Central Europe. 6. Get some unicode-16 data, or just decode it from the 'glass examples I provide: decode 2 glass/Czech 7. Now run the following: to-alias-string 2 pan (decode 2 glass/Czech) 128 == ... You should see a properly formatted Czech sentence here if you have chosen Central European font, or something not-so-Czech if your chosen font is Cyrillic. [The 128 at the end is a substitute fallback character - Euro if you experiment with windows stuff]. Wait a second, why all that fuss? Couldn't we just select a charset and copy a Czech text to the console? Well no, the idea is to test the mapping UTF-8 -> unicode -> local charset, to get a feel how all of this would work if we had a direct access to the unicode characters (which Windows provide, btw). But as far as I know, playing with 'font itself is not a solution here because currently 'font still considers all characters to be narrow, 1-byte. 8. Oh, I am using some %hof.r functions, such as 'foldl1 or 'map. Get just what you need from there or substitute the 'cross-map with something of your own. I wish I could say "Have fun!" but this stuff is not as nice as the VID demos. :-) Jan