Mailing List Archive: 49091 messages
  • Home
  • Script library
  • AltME Archive
  • Mailing list
  • Articles Index
  • Site search
 

[REBOL] Re: UTF-8 revisited

From: rebol-list2:seznam:cz at: 26-Nov-2002 12:47

Hello Jan, Tuesday, November 19, 2002, 6:26:03 PM, you wrote: JS> I was partially inspired by the %uft8-encode.r script, posted by JS> Oldes and Tenca. In essence, that script encodes ANSI (or Latin-1) JS> strings (1 character = 1 octet) into longer strings, where JS> characters JS> from the upper part of ANSI character table are represented by JS> two octets, instead of just one. JS> Fine at it is for Latin-1 strings, it cannot handle anything of JS> the real interest, where Unicode shines, such as Latin-2 JS> characters, JS> CJK characters, mathematical symbols, and so on. JS> So if you want to play with Hungarian, Czech, Dutch, etc. JS> sequences you need something more general than utf8-encode. JS> Hopfully, the %utf-8.r I just posted will provide you with the JS> basic tools for such tasks. Hmm... yes the utf8-encode was used for only one reason (in the FlashMX all extended characters must be encoded), so it do not cover all tasks.... I'm just looking at your script.... JS> Few disclaimers: JS> ---------------- JS> 1. The code is not optimized at all. I've seen an ellegant approach JS> taken by Paolo, to improve on original version of Oldes script. JS> Possibly something of this sort could be also applied to %utf-8.r JS> as well. I've take just a quick look at it and what's about the optimalisations: a) use to integer! instead of 'to-integer (it's a little bit quicker if you use it too many times: t: now/time/precise loop 1000000 [to-integer "1"] now/time/precise - t ;== 0:00:01.282 (== 0:00:01.202 Rebol/Base) t: now/time/precise loop 1000000 [to integer! "1"] now/time/precise - t ;== 0:00:00.872 (== 0:00:00.801 Rebol/Base) b) 'for loop is slow (because it's not a native) so if you can find other way: t: now/time/precise loop 100000 [for k 1 6 1 []] now/time/precise - t ;== 0:00:03.425 (== 0:00:03.014 Rebol/Base) t: now/time/precise loop 100000 [k: 1 loop 6 [k: k + 1]] now/time/precise - t ;== 0:00:00.361 (== 0:00:00.371 Rebol/Base) c) if you can use: second us instead of us/2 us: [0 192 224 240 248 252] t: now/time/precise loop 10000000 [us/2] now/time/precise - t ;== 0:00:06.599 (== 0:00:06.459 R/B) t: now/time/precise loop 10000000 [second us] now/time/precise - t ;== 0:00:04.076 (== 0:00:04.076 R/B) ;but the problem is that usually you need to use parenthesis:( t: now/time/precise loop 10000000 [(second us)] now/time/precise - t ;== 0:00:04.928 (== 0:00:04.887 R/B) d) in you functions encode-integer and decode-integer you are defining functions 'f and blocks 'us, 'vs and 'cases - that's good to understand how it works as it's more readable, but for the usage it's a big waste of time (what about making the utf-8 as an object with these blocks and functions defined only once?) JS> 2. The implementation is based on the verbal descriptions of JS> algorithms found in some standards (The references are provided JS> on the cited page and inside the script). The algorithms use JS> several arithmetic operations: *, / and //. Although in principle JS> the shifts could replace the first two, Rebol does not JS> support shift operations. The conversions are therefore not that JS> efficient, especially considering that TO-INTEGER is also JS> a part of a game. Maybe a new native function for shift would be good in new Rebols:) I hope that the reason that we don't have shift is not because in Rebol you cannot type << (although you can use >>) JS> Improvements and suggestions for improvements are obviously JS> welcome. I'm testing: from RFC2279 sec.4: ----- The UCS-2 sequence representing the Hangul characters for the Korean word "hangugo" (D55C, AD6D, C5B4) may be encoded as follows: ED 95 9C EA B5 AD EC 96 B4 ----- It looks, that it's working: x: #{ED959CEAB5ADEC96B4} decode-2 x ;== #{D55CAD6DC5B4} encode-2 decode-2 x ;== #{ED959CEAB5ADEC96B4} I've made the speed improvements explained above: Your version: do %/d/view/public/www.reboltech.com/library/scripts/utf-8.r t: now/time/precise loop 10000 [encode-2 decode-2 x] now/time/precise - t ;== 0:00:11.677 (== 0:00:07.551 with Rebol/Base) Improved: do %/d/library/utf-8.r t: now/time/precise loop 10000 [utf-8/encode-2 utf-8/decode-2 x] now/time/precise - t ;== 0:00:05.819 (== 0:00:04.346 with Rebol/Base) Hmm... the Rebol/Base must have some other improvements, not only removed some functions from the Rebol/Core :) You have forgot to include the 'intersperse function in the 'to-ucs2 so I've made my own to-ucs2 function. And will try to make some Latin-2 to UCS2 converter (just if I'll find some doc. how to do it:) =( Oliva David )=======================( [oliva--david--seznam--cz] )== =( Earth/Europe/Czech_Republic/Brno )============================= =( coords: [lat: 49.22 long: 16.67] )============================= -- Attached file included as plaintext by Listar -- -- File: utf-8.r REBOL [ Title: "UTF-8" Date: 26-Nov-2002 Name: "UTF-8" Version: 1.0.1 File: %utf-8.r Author: "Jan Skibinski" Co-author: "Oldes" Purpose: {Encoding and decoding of UCS-4/UC-2 binaries to and from UTF-8 binaries. } History: [ 1.0.1 26-Nov-2002 { Oldes: Speed optimalizations (not so readable now:( + fixed to-ucs2 function} 1.0.0 20-Nov-2002 { Jan: Basic UTF-8 encoding and decoding functions. Limitations: Does not handle a big/little endian signatures yet. Needs thorough testing and algorithms optimalizations.} ] Email: [jan--skibinski--sympatico--ca] Category: [crypt 4] Acknowledments: { Inspired by the script 'utf8-encode.r of Romano Paulo Tenca and Oldes, which encodes Latin-1 strings. } ] comment { UCS means: Universal Character Set (or Unicode) UCS-2 means: 2-byte representation of a character in UCS. UCS-4 means: 4-byte representation of a character in UCS. UTF-8 means: UCS Transformation Format using 8-bit octets. The following excerpt from: UTF-8 and Unicode FAQ for Unix/Linux, by Markus Kuhn http://www.cl.cam.ac.uk/~mgk25/unicode.html provides motivations for using UTF-8. <<Using UCS-2 (or UCS-4) under Unix would lead to very severe problems. Strings with these encodings can contain as parts of many wide characters bytes like '\0' or '/' which have a special meaning in filenames and other C library function parameters. In addition, the majority of UNIX tools expects ASCII files and can't read 16-bit words as characters without major modifications. For these reasons, UCS-2 is not a suitable external encoding of Unicode in filenames, text files, environment variables, etc. The UTF-8 encoding defined in ISO 10646-1:2000 Annex D and also described in RFC 2279 as well as section 3.8 of the Unicode 3.0 standard does not have these problems. It is clearly the way to go for using Unicode under Unix-style operating systems.>> The copy of forementioned Annex D can be found on Markus site: http://www.cl.cam.ac.uk/~mgk25/ucs/ISO-10646-UTF-8.html. Encoding and decoding functions implemented here are based on the descriptions of algorithms found in the Annex D. Testing: The page http://www.cl.cam.ac.uk/~mgk25/unicode.html has many pointers to variety of test data. One of them is a UTF-8 sampler from Kermit pages of Columbia University http://www.columbia.edu/kermit/utf8.html, where the phrase "I can eat glass and it doesn't hurt me." is produced in dozens of world languages. A shareware unicode editor 'EmEditor, from www.emurasoft.com can be used for copying, editing and saving unicode samples from the web browsers. Since it saves its output in UCS-2, UCS-4 (big and little endians), UTF-8 and UTF-7 formats it is a very good tool for testing. } comment { ------------------------------------------------------------ SUMMARY of script UTF-8.R ------------------------------------------------------------ decode-2 (binary -> binary) encode-2 (binary -> binary) decode-4 (binary -> binary) encode-4 (binary -> binary) decode-integer (binary -> [integer integer]) encode-integer (integer -> binary) to-ucs2 (string -> binary) ; auxiliary } utf-8: context [ allchars: complement charset [] to-ucs2: func [ { Converts ANSI (Latin-1) string to UCS-2 octet string. This is an auxiliary function, just for testing. } ascii [string!] /local result [binary!] ][ result: make binary! 2 * length? ascii parse/all ascii [ any [copy c allchars (insert result join c #{00})] ] head reverse result ] encode-2: func [ { Encode a binary string of UCS-2 octets into a UTF-8 encoded binary octet stream. } us [binary!] /local x result [binary!] ][ result: copy #{} while [not tail? us][ x: (256 * first us) + second us insert tail result encode-integer x us: skip us 2 ] result ] encode-4: func [ { Encode a binary string of UCS-4 octets into a UTF-8 encoded binary octet stream. } us [binary!] /local x result [binary!] ][ result: copy #{} while [not tail? us][ x: (16777216 * first us) + (65536 * second us) + (256 * third us) + fourth us insert tail result encode-integer x us: skip us 4 ] result ] decode-2: func [ { Decode a UTF-8 encoded binary string to a UCS-2 binary string } xs [binary!] /local z vs us result [binary!] ][ result: copy #{} while [not tail? xs][ us: decode-integer xs vs: copy [] z: to integer! ((first us) / 256) insert vs z z: (first us) - (z * 256) insert tail vs z insert tail result to binary! vs xs: skip xs second us ] result ] decode-4: func [ { Decode a UTF-8 encoded binary string to UCS-4 binary string } xs [binary!] /local z1 z vs us result [binary!] ][ result: copy #{} while [not tail? xs][ us: decode-integer xs vs: copy [] z: us/1 foreach k [16777216 65536 256][ z1: to integer! (z / :k) insert tail vs z1 z: z - (z1 * :k) ] insert tail vs z insert tail result to binary! vs xs: skip xs second us ] result ] encode-integer: func [ { Encode 4-byte (32-bit) UCS-4 integer to a sequence of UTF-8 octets. } [throw] x [integer!] /local f k result [binary!] ][ k: 1 loop 6 [ if x <= encases/:k [ result: to binary! enf :k x break ] k: k + 1 ] result ] decode-integer: func [ { Decode sequence of 1-6 octets into 32-bit unsigned integer. Return a pair made of a decoded integer and a count of bytes used from the input string. } xs [binary!] /local f k result [block!] ][ k: 1 loop 6 [ if (first xs) <= pick decases k [ result: to block! def :k xs insert tail result :k break ] k: k + 1 ] result ] ;-----functions and values extracted from the decode/encode integer enf: func [ k x /local result ][ result: to block! (us/:k + to integer! (x / vs/:k)) if k > 1 [ for z (k - 1) 1 -1 [ insert tail result ( (to integer! (x / vs/:z)) // 64 + 128) ] ] result ] def: func [ k xs /local m result ][ result: ((first xs) - us/:k) * vs/:k if k >= 2 [ for z 2 k 1 [ m: k - :z + 1 result: result + ((xs/:z - 128) * vs/:m) ] ] result ] us: [0 192 224 240 248 252] vs: [1 64 4096 262144 16777216 1073741824] encases: [ 127 ; 0000 007F 2047 ; 0000 07FF 65535 ; 0000 FFFF 2097151 ; 0001 FFFF 67108863 ; 03FF FFFF 2147483647 ; 7FFF FFFF ] decases: [127 223 239 247 251 253] ]