UTF-8
[1/3] from: alain::goye::free::fr at: 17-Oct-2004 19:27
Hi all,
I got interested in manipulating Unicode with REBOL and tried the UTF-8 script by Jan
Skibinski.
It seems there is an error in the encode function which did not convert correctly my
test case : the 1st letter of Khmer alphabet which code is U+1780, should become #{E19E80}
in UTF-8, according to my understanding (based on http://www.zvon.org/tmRFC/RFC2279/Output/chapter2.html).
In case it may be helpful to someone this version should work (though not optimized and
tested only with k=2 on U+1780 :-) :
encode: func [
k [integer!]
ucs [string!]
/local c f m x result [string!]
][
result: make string! length? ucs
f: pick fetch k
parse/all ucs [any [c: k skip (
either 128 > x: f c [
insert tail result x
][
result: tail result
m: 64
until [
insert result to char! x and 63 or 128
(m: m / 2) > x: x and -64 / 64
]
insert result to char! x or pick udata 1 + length? result
]
)]]
head result
]
[2/3] from: rebol-list2:seznam:cz at: 20-Oct-2004 14:35
Hello Alain,
Sunday, October 17, 2004, 7:27:41 PM, you wrote:
AG> Hi all,
AG> I got interested in manipulating Unicode with REBOL and tried the UTF-8 script by
Jan Skibinski.
AG> It seems there is an error in the encode function which did not convert correctly
my test case : the 1st letter of Khmer alphabet which code is U+1780, should become #{E19E80}
in UTF-8, according
AG> to my understanding (based on http://www.zvon.org/tmRFC/RFC2279/Output/chapter2.html).
AG> In case it may be helpful to someone this version should work (though not optimized
and tested only with k=2 on U+1780 :-) :
Hi, it looks that you were using some older version. Here is available
my latest utf-8.r script:
http://oldes.multimedia.cz/rss/projects/utf-8_latest.rip (4kB)
I removed the to-ucs2 function as I'm using this ucs2.r script:
http://oldes.multimedia.cz/rss/projects/ucs2_latest.rip ( 2.5MB !!!)
The archive is pretty large as it includes all available charmaps
which I collected with already pre-generated appropriate Rebol parsing rules.
I use only cp1250 and ISO-8859-2 so I'm not sure if the others are
good working, but they should be if the included charmap sources are correct.
So if I need to encode a text which was written using 'cp1250' to utf-8 I do:
ucs2/load-rules "cp1250"
utf-8/encode-2 ucs2/encode "text with special char Š"
Theoretically I can also change encoding of the text:
ucs2/load-rules "cp1250"
ucstext: ucs2/encode "text with special char Š"
ucs2/load-rules "iso-8859-2"
to-string ucs2/decode ucstext
== "text with special char ©"
(but I never used this so it's not tested at all and there may be
problem if you have some unicode chars which the decoder rule doesn't know)
I the UCS2 archive there is also a script which creates PHP code for
ucs2 encoding (according charmap you need) as I was missing this in my
PHP build.
Isn't Rebol great tool? :)
Feel free to let me know if you would have some troubles.
Cheers, Oldes
PS: I'm still unicode newbie! I just made a script which is working
as I need it, that's all.
[3/3] from: alain::goye::free::fr at: 21-Oct-2004 9:44
Thank you Oldes,
That's much more than what I expected !
Anyway checking the old version was an occasion to improve my understanding
(I'm new also to unicode...).
Cheers,
Alain.
----- Original Message -----
From: "rebOldes" <[rebol-list2--seznam--cz]>
To: "Alain Goy=E9" <[rebolist--rebol--com]>
Sent: Wednesday, October 20, 2004 2:35 PM
Subject: [REBOL] Re: UTF-8
> Hello Alain,
>
> Sunday, October 17, 2004, 7:27:41 PM, you wrote:
>
> AG> Hi all,
>
> AG> I got interested in manipulating Unicode with REBOL and tried the
UTF-8 script by Jan Skibinski.
> AG> It seems there is an error in the encode function which did not
convert correctly my test case : the 1st letter of Khmer alphabet which code
is U+1780, should become #{E19E80} in UTF-8, according