Simulating Unicode support
[1/1] from: jan::skibinski::sympatico::ca at: 2-Dec-2002 12:01
Hi All,
I just posted an updated version of UTF-8, which attempts
to simulate Unicode support in Rebol console. The additional
functions require some explanation, since they are not
as obvious as the previously designed basic 'encode and
'decode functions:
1. Collect a bunch of cross-mapping files from www.unicode.org
and store it somewhere in a directory 'unicode-dir.
The source is at http://www.unicode.org/Public/MAPPINGS/ .
Each file provides mapping from a specific character set,
such as ISO-8859-2, or CP1250 to Unicode.
We will be using them in the reverse mode.
Make sure that you override the value of my default 'unicode-dir.
2. I provide two blocks with names of such files: 'charset-windows
and 'charset-iso, but you can run similar experiments
with Apple or Adobe cross-mapping files if you wish.
3. In the simplest case, comment out all filenames, but one.
Later on you can repeat the experiment with larger sets.
The idea is to play with some superset of several specific
character sets (Code Pages as MS calls them), which
do not overlap. For example, CP1250, CP1252 and CP1251
(Central Europe, Western Europe, and Cyrillic) do not seem
to overlap, I think, but I did not do a very thorough testing
of that. But I do know that adding Baltic page to such a set
screws up the Polish mapping, for example. So start small,
with only one file.
4. Run
pan: cross-map charset-windows 'or ISO, or Apple
This establishes a superset unicode->256-int-table.
There will be some many-to-one mappings, because your
superset might be much bigger than 256. But this is
OK, this is just a simulation.
5. Now select a font from a menu bar if you run it on Windows.
If not, than you might be able to do it programmatically
on the View console. Make sure that you choose one of
charsets, such as Central Europe.
6. Get some unicode-16 data, or just decode it from the 'glass
examples I provide:
decode 2 glass/Czech
7. Now run the following:
to-alias-string 2 pan (decode 2 glass/Czech) 128
== ...
You should see a properly formatted Czech sentence
here if you have chosen Central European font, or
something not-so-Czech if your chosen font is Cyrillic.
[The 128 at the end is a substitute fallback character
- Euro if you experiment with windows stuff].
Wait a second, why all that fuss? Couldn't we just
select a charset and copy a Czech text to the console?
Well no, the idea is to test the mapping
UTF-8 -> unicode -> local charset, to get a feel
how all of this would work if we had a direct access to
the unicode characters (which Windows provide, btw).
But as far as I know, playing with 'font itself is
not a solution here because currently 'font still
considers all characters to be narrow, 1-byte.
8. Oh, I am using some %hof.r functions, such as 'foldl1
or 'map. Get just what you need from there or substitute
the 'cross-map with something of your own.
I wish I could say "Have fun!" but this stuff is
not as nice as the VID demos. :-)
Jan