Mailing List Archive: Re: UTF-8 revisited

[REBOL] Re: UTF-8 revisited

From: rebol-list2:seznam:cz at: 26-Nov-2002 12:47


Hello Jan,

Tuesday, November 19, 2002, 6:26:03 PM, you wrote:

JS>     I was partially inspired by the %uft8-encode.r script, posted by
JS>     Oldes and  Tenca. In essence, that script encodes ANSI (or Latin-1)
JS>     strings (1 character = 1 octet) into longer strings, where
JS> characters
JS>     from the upper part of ANSI character table are represented by
JS>     two octets, instead of just one.

JS>     Fine at it is for Latin-1 strings, it cannot handle anything of
JS>     the  real interest, where Unicode shines, such as Latin-2
JS> characters,
JS>     CJK characters, mathematical symbols,  and so on.

JS>     So if you want to play with Hungarian, Czech, Dutch, etc.
JS>     sequences you need something more general than utf8-encode.
JS>     Hopfully, the %utf-8.r I just posted will provide you with the
JS>     basic tools for such tasks.

Hmm... yes the utf8-encode was used for only one reason (in the
FlashMX all extended characters must be encoded), so it do not
cover all tasks.... I'm just looking at your script....

JS>     Few disclaimers:
JS>     ----------------
JS>     1. The code is not optimized at all. I've seen an ellegant approach
JS>     taken by Paolo, to improve on original version of Oldes script.
JS>     Possibly something of this sort could be also applied to %utf-8.r
JS>     as well.

I've take just a quick look at it and what's about the
optimalisations:

a) use to integer! instead of 'to-integer (it's a little bit quicker if you use
it too many times:
   t: now/time/precise loop 1000000 [to-integer "1"] now/time/precise - t
   ;== 0:00:01.282   (== 0:00:01.202 Rebol/Base)
   t: now/time/precise loop 1000000 [to integer! "1"] now/time/precise - t
   ;== 0:00:00.872   (== 0:00:00.801 Rebol/Base)

b) 'for loop is slow (because it's not a native) so if you can find other way:
   t: now/time/precise loop 100000 [for k 1 6 1 []] now/time/precise - t
   ;== 0:00:03.425    (== 0:00:03.014 Rebol/Base)
   t: now/time/precise loop 100000 [k: 1 loop 6 [k: k + 1]] now/time/precise - t
   ;== 0:00:00.361    (== 0:00:00.371 Rebol/Base)

c) if you can use: second us instead of us/2
   us: [0 192 224 240 248 252]
   t: now/time/precise loop 10000000 [us/2] now/time/precise - t
   ;== 0:00:06.599    (== 0:00:06.459 R/B)
   t: now/time/precise loop 10000000 [second us] now/time/precise - t
   ;== 0:00:04.076    (== 0:00:04.076 R/B)
   ;but the problem is that usually you need to use parenthesis:(
   t: now/time/precise loop 10000000 [(second us)] now/time/precise - t
   ;== 0:00:04.928    (== 0:00:04.887 R/B)

d) in you functions encode-integer and decode-integer you are defining
functions 'f and blocks 'us, 'vs and 'cases - that's good to

understand how it works as it's more readable, but for the usage it's a big waste of 
time
(what about making the utf-8 as an object with these blocks and
functions defined only once?)

JS>     2. The implementation is based on the verbal descriptions of
JS>     algorithms found in some standards (The references are provided
JS>     on the cited page and inside the script). The algorithms use
JS>     several arithmetic operations: *, / and //. Although in principle
JS>     the shifts could replace the first two, Rebol does not
JS>     support shift operations. The conversions are therefore not that
JS>     efficient, especially considering that TO-INTEGER is also
JS>     a part of a game.

Maybe a new native function for shift would be good in new Rebols:)
I hope that the reason that we don't have shift is not because in
Rebol you cannot type << (although you can use >>)

JS>     Improvements and suggestions for improvements are obviously
JS>     welcome.

I'm testing:
from RFC2279 sec.4:
-----
   The UCS-2 sequence representing the Hangul characters for the Korean
   word "hangugo" (D55C, AD6D, C5B4) may be encoded as follows:

   ED 95 9C EA B5 AD EC 96 B4
-----
It looks, that it's working:
 x: #{ED959CEAB5ADEC96B4}
 decode-2 x ;== #{D55CAD6DC5B4}
 encode-2 decode-2 x  ;== #{ED959CEAB5ADEC96B4}

I've made the speed improvements explained above:
Your version:
     do %/d/view/public/www.reboltech.com/library/scripts/utf-8.r
     t: now/time/precise loop 10000 [encode-2 decode-2 x] now/time/precise - t
     ;== 0:00:11.677  (== 0:00:07.551 with Rebol/Base)

Improved:
     do %/d/library/utf-8.r

     t: now/time/precise loop 10000 [utf-8/encode-2 utf-8/decode-2 x] now/time/precise - t
     ;== 0:00:05.819  (== 0:00:04.346 with Rebol/Base)

Hmm... the Rebol/Base must have some other improvements, not only
removed some functions from the Rebol/Core :)

You have forgot to include the 'intersperse function in the 'to-ucs2
so I've made my own to-ucs2 function. And will try to make some
Latin-2 to UCS2 converter (just if I'll find some doc. how to do it:)

=( Oliva David )=======================( [oliva--david--seznam--cz] )==
=( Earth/Europe/Czech_Republic/Brno )=============================
=( coords: [lat: 49.22 long: 16.67] )=============================

-- Attached file included as plaintext by Listar --
-- File: utf-8.r

REBOL [
    Title: "UTF-8"
    Date: 26-Nov-2002
    Name: "UTF-8"
    Version: 1.0.1
    File: %utf-8.r
    Author: "Jan Skibinski"
	Co-author: "Oldes"
    Purpose: {Encoding and decoding of UCS-4/UC-2 binaries
to and from UTF-8 binaries.
}
    History: [
		1.0.1 26-Nov-2002 {
		Oldes: Speed optimalizations (not so readable now:(
		+ fixed to-ucs2 function}
		1.0.0 20-Nov-2002 {
		Jan: Basic UTF-8 encoding and decoding functions.
        Limitations: Does not handle a big/little endian
        signatures yet. Needs thorough testing and algorithms
        optimalizations.}
	]
    Email: [jan--skibinski--sympatico--ca]
    Category: [crypt 4]
    Acknowledments: {
        Inspired by the script 'utf8-encode.r of Romano Paulo Tenca
        and Oldes, which encodes Latin-1 strings.
    }
]
comment {
    UCS means: Universal Character Set (or Unicode)
    UCS-2 means: 2-byte representation of a character in UCS.
    UCS-4 means: 4-byte representation of a character in UCS.
    UTF-8 means: UCS Transformation Format using 8-bit octets.

    The following excerpt from:
        UTF-8 and Unicode FAQ for Unix/Linux, by Markus Kuhn
        http://www.cl.cam.ac.uk/~mgk25/unicode.html
    provides motivations for using UTF-8.

    <<Using UCS-2 (or UCS-4) under Unix would lead to very severe
    problems. Strings with these encodings can contain as parts
    of many wide characters bytes like '\0' or '/' which have a
    special meaning in filenames and other C library function
    parameters. In addition, the majority of UNIX tools expects
    ASCII files and can't read 16-bit words as characters without
    major modifications. For these reasons, UCS-2 is not a suitable
    external encoding of Unicode in filenames, text files,
    environment variables, etc.

    The UTF-8 encoding defined in ISO 10646-1:2000 Annex D
    and also described in RFC 2279 as well as section 3.8
    of the Unicode 3.0 standard does not have these problems.
    It is clearly the way to go for using Unicode under Unix-style
    operating systems.>>

    The copy of forementioned Annex D can be found on Markus site:
    http://www.cl.cam.ac.uk/~mgk25/ucs/ISO-10646-UTF-8.html.
    Encoding and decoding functions implemented here are
    based on the descriptions of algorithms found in the Annex D.

    Testing: The page http://www.cl.cam.ac.uk/~mgk25/unicode.html
    has many pointers to variety of test data. One of them
    is a UTF-8 sampler from Kermit pages of Columbia University
    http://www.columbia.edu/kermit/utf8.html, where the
    phrase "I can eat glass and it doesn't hurt me." is
    produced in dozens of world languages.

    A shareware unicode editor 'EmEditor, from www.emurasoft.com
    can be used for copying, editing and saving unicode samples
    from the web browsers. Since it saves its output in UCS-2,
    UCS-4 (big and little endians), UTF-8 and UTF-7 formats
    it is a very good tool for testing.

}

comment {
------------------------------------------------------------
SUMMARY of script UTF-8.R
------------------------------------------------------------
decode-2             (binary -> binary)
encode-2             (binary -> binary)

decode-4             (binary -> binary)
encode-4             (binary -> binary)

decode-integer       (binary -> [integer integer])
encode-integer       (integer -> binary)

to-ucs2              (string -> binary) ; auxiliary

}

utf-8: context [
	allchars: complement charset []
    to-ucs2: func [
        {
        Converts ANSI (Latin-1) string to UCS-2 octet string.
        This is an auxiliary function, just for testing.
        }
        ascii [string!]
        /local result [binary!]
    ][
		result: make binary! 2 * length? ascii
		parse/all ascii [
			any [copy c allchars (insert result join c #{00})]
		]
        head reverse result
    ]
    encode-2: func [
        {
        Encode a binary string of UCS-2 octets into a UTF-8
        encoded binary octet stream.
        }
        us [binary!]
        /local x result [binary!]
    ][
        result: copy #{}
        while [not tail? us][
            x: (256 * first us) + second us
            insert tail result encode-integer x
			us: skip us 2
        ]
        result
    ]

    encode-4: func [
        {
        Encode a binary string of UCS-4 octets into a UTF-8
        encoded binary octet stream.
        }
        us [binary!]
        /local x result [binary!]
    ][
        result: copy #{}
        while [not tail? us][

            x:  (16777216 * first us) + (65536 * second us) + (256 * third us) + fourth us
            insert tail result encode-integer x
            us: skip us 4
        ]
        result
    ]

    decode-2: func [
        {
        Decode a UTF-8 encoded binary string
        to a UCS-2 binary string
        }
        xs [binary!]
        /local z vs us result [binary!]
    ][
        result: copy #{}
        while [not tail? xs][
            us: decode-integer xs
            vs: copy []
            z: to integer! ((first us) / 256)
            insert vs z
            z: (first us) - (z * 256)
            insert tail vs z
            insert tail result to binary! vs
            xs: skip xs second us
        ]
        result
    ]

    decode-4: func [
        {
        Decode a UTF-8 encoded binary string
        to UCS-4 binary string
        }
        xs [binary!]
        /local z1 z vs us result [binary!]
    ][
        result: copy #{}
        while [not tail? xs][
            us: decode-integer xs
            vs: copy []
            z: us/1
            foreach k [16777216 65536 256][
                z1: to integer! (z / :k)
                insert tail vs z1
                z: z - (z1 * :k)
            ]
            insert tail vs z

            insert tail result to binary! vs
            xs: skip xs second us
        ]
        result
    ]
    encode-integer: func [
        {
        Encode 4-byte (32-bit) UCS-4 integer to a sequence
        of UTF-8 octets.
        }
        [throw]
        x [integer!]
        /local f k result [binary!]
    ][
        k: 1 loop 6 [
            if x <= encases/:k [
                result: to binary! enf :k x
                break
            ]
			k: k + 1
        ]
        result
    ]

    decode-integer: func [
        {
        Decode sequence of 1-6 octets into 32-bit unsigned
        integer. Return a pair made of a decoded integer
        and a count of bytes used from the input string.
        }
        xs [binary!]
        /local f k result [block!]
    ][
        k: 1 loop 6 [
           if (first xs) <= pick decases k [
                result: to block! def :k xs
                insert tail result :k
                break
           ]
		   k: k + 1
        ]
        result

    ]
	;-----functions and values extracted from the decode/encode integer
    enf: func [
	   k x
       /local result
    ][
		result: to block! (us/:k + to integer! (x / vs/:k))
		if k > 1 [
			for z (k - 1) 1 -1 [
				insert tail result (
				(to integer! (x / vs/:z)) // 64 + 128)
			]
		]
		result
    ]
    def: func [
            k xs
            /local m result
        ][
            result: ((first xs) - us/:k) * vs/:k
            if k >= 2 [
                for z 2 k 1 [
                    m: k - :z + 1
                    result: result + ((xs/:z - 128) * vs/:m)
                ]
            ]
            result
        ]
    us: [0 192 224 240 248 252]
    vs: [1 64 4096 262144 16777216 1073741824]
    encases: [
             127         ; 0000 007F
             2047        ; 0000 07FF
             65535       ; 0000 FFFF
             2097151     ; 0001 FFFF
             67108863    ; 03FF FFFF
             2147483647  ; 7FFF FFFF
    ]
    decases: [127 223 239 247 251 253]
]