[REBOL] UTF-8 revisited
From: jan:skibinski:sympatico:ca at: 19-Nov-2002 12:26
Hello All,
I just posted UTF-8 script to the RT library.
When browsing the Escribe archive I've noticed a recurring theme:
When
will we get a support for Unicode in Rebol?
and the standard RT
answers
of the sort: "It is on the list of features to be implemented, but
we are busy
with something else now. If you really need it now, pay for it."
It seems to me, that before we get such a toy, professionally done,
a robust and a speedy toy, we could develop at least some sort of
emulation
tools, which -- although possibly slow -- could handle small tasks
on hand.
For example, a Unicode Rebol terminal seems likely to be a desired
Unicode gadget.
I am not by any means an Unicode expert, but a quick glance
at the very good page "UTF-8 and Unicode FAQ for Unix/Linux",
by Markus Kuhn, http://www.cl.cam.ac.uk/~mgk25/unicode.html
convinced me that the unicode support for Rebol console is doable.
After all Python, Perl and other scripting languages already have
it.
I am sure there are experts on this list, who know how to intercept
input/output streams and build a middle tier that would be able
to handle Unicode.
For a start I took upon myself a challenge of UTF-8 encoding
and decoding of 4-octet and 2-octet wide (UCS-4 and UCS-2)
representations of Unicode characters. For those unfamiliar with
UTF-8 encoding the page I cited above provides good motivation
for a need for the UTF-8 encoding.
I was partially inspired by the %uft8-encode.r script, posted by
Oldes and Tenca. In essence, that script encodes ANSI (or Latin-1)
strings (1 character = 1 octet) into longer strings, where
characters
from the upper part of ANSI character table are represented by
two octets, instead of just one.
Fine at it is for Latin-1 strings, it cannot handle anything of
the real interest, where Unicode shines, such as Latin-2
characters,
CJK characters, mathematical symbols, and so on.
So if you want to play with Hungarian, Czech, Dutch, etc.
sequences you need something more general than utf8-encode.
Hopfully, the %utf-8.r I just posted will provide you with the
basic tools for such tasks.
Few disclaimers:
----------------
1. The code is not optimized at all. I've seen an ellegant approach
taken by Paolo, to improve on original version of Oldes script.
Possibly something of this sort could be also applied to %utf-8.r
as well.
2. The implementation is based on the verbal descriptions of
algorithms found in some standards (The references are provided
on the cited page and inside the script). The algorithms use
several arithmetic operations: *, / and //. Although in principle
the shifts could replace the first two, Rebol does not
support shift operations. The conversions are therefore not that
efficient, especially considering that TO-INTEGER is also
a part of a game.
3. I do not handle the big/little signature of UCS as yet.
So if you copy some Unicode file to Rebol and notice the
first two octets in the form of FF FE or FE FF, that's the
signature. Remove it, but make sure that the byte order
is appropriate for your platform. I was assuming big endian
in the script.
4. The standards stress the importance of error handling
of missformatted sequences. I have not done any of that
yet.
5. I have done very little testing so far.
6. I no longer enjoy implementations of converters.
In a way I consider such activity a waste of programmer's time.
Many formats have been devised, many of them are already
gone, and what with all that programming effort?
So I programmed the %utf-8.r in haste and with a little joy.
But someone had to do it, and it's a start for a real optimization.
Improvements and suggestions for improvements are obviously
welcome.
Regards,
Jan