Mailing List Archive: 49091 messages
  • Home
  • Script library
  • AltME Archive
  • Mailing list
  • Articles Index
  • Site search
 

[REBOL] UTF-8 revisited

From: jan:skibinski:sympatico:ca at: 19-Nov-2002 12:26

Hello All, I just posted UTF-8 script to the RT library. When browsing the Escribe archive I've noticed a recurring theme: When will we get a support for Unicode in Rebol? and the standard RT answers of the sort: "It is on the list of features to be implemented, but we are busy with something else now. If you really need it now, pay for it." It seems to me, that before we get such a toy, professionally done, a robust and a speedy toy, we could develop at least some sort of emulation tools, which -- although possibly slow -- could handle small tasks on hand. For example, a Unicode Rebol terminal seems likely to be a desired Unicode gadget. I am not by any means an Unicode expert, but a quick glance at the very good page "UTF-8 and Unicode FAQ for Unix/Linux", by Markus Kuhn, http://www.cl.cam.ac.uk/~mgk25/unicode.html convinced me that the unicode support for Rebol console is doable. After all Python, Perl and other scripting languages already have it. I am sure there are experts on this list, who know how to intercept input/output streams and build a middle tier that would be able to handle Unicode. For a start I took upon myself a challenge of UTF-8 encoding and decoding of 4-octet and 2-octet wide (UCS-4 and UCS-2) representations of Unicode characters. For those unfamiliar with UTF-8 encoding the page I cited above provides good motivation for a need for the UTF-8 encoding. I was partially inspired by the %uft8-encode.r script, posted by Oldes and Tenca. In essence, that script encodes ANSI (or Latin-1) strings (1 character = 1 octet) into longer strings, where characters from the upper part of ANSI character table are represented by two octets, instead of just one. Fine at it is for Latin-1 strings, it cannot handle anything of the real interest, where Unicode shines, such as Latin-2 characters, CJK characters, mathematical symbols, and so on. So if you want to play with Hungarian, Czech, Dutch, etc. sequences you need something more general than utf8-encode. Hopfully, the %utf-8.r I just posted will provide you with the basic tools for such tasks. Few disclaimers: ---------------- 1. The code is not optimized at all. I've seen an ellegant approach taken by Paolo, to improve on original version of Oldes script. Possibly something of this sort could be also applied to %utf-8.r as well. 2. The implementation is based on the verbal descriptions of algorithms found in some standards (The references are provided on the cited page and inside the script). The algorithms use several arithmetic operations: *, / and //. Although in principle the shifts could replace the first two, Rebol does not support shift operations. The conversions are therefore not that efficient, especially considering that TO-INTEGER is also a part of a game. 3. I do not handle the big/little signature of UCS as yet. So if you copy some Unicode file to Rebol and notice the first two octets in the form of FF FE or FE FF, that's the signature. Remove it, but make sure that the byte order is appropriate for your platform. I was assuming big endian in the script. 4. The standards stress the importance of error handling of missformatted sequences. I have not done any of that yet. 5. I have done very little testing so far. 6. I no longer enjoy implementations of converters. In a way I consider such activity a waste of programmer's time. Many formats have been devised, many of them are already gone, and what with all that programming effort? So I programmed the %utf-8.r in haste and with a little joy. But someone had to do it, and it's a start for a real optimization. Improvements and suggestions for improvements are obviously welcome. Regards, Jan