Mailing List Archive: 49091 messages
  • Home
  • Script library
  • AltME Archive
  • Mailing list
  • Articles Index
  • Site search
 

UTF-8

 [1/3] from: alain::goye::free::fr at: 17-Oct-2004 19:27


Hi all, I got interested in manipulating Unicode with REBOL and tried the UTF-8 script by Jan Skibinski. It seems there is an error in the encode function which did not convert correctly my test case : the 1st letter of Khmer alphabet which code is U+1780, should become #{E19E80} in UTF-8, according to my understanding (based on http://www.zvon.org/tmRFC/RFC2279/Output/chapter2.html). In case it may be helpful to someone this version should work (though not optimized and tested only with k=2 on U+1780 :-) : encode: func [ k [integer!] ucs [string!] /local c f m x result [string!] ][ result: make string! length? ucs f: pick fetch k parse/all ucs [any [c: k skip ( either 128 > x: f c [ insert tail result x ][ result: tail result m: 64 until [ insert result to char! x and 63 or 128 (m: m / 2) > x: x and -64 / 64 ] insert result to char! x or pick udata 1 + length? result ] )]] head result ]

 [2/3] from: rebol-list2:seznam:cz at: 20-Oct-2004 14:35


Hello Alain, Sunday, October 17, 2004, 7:27:41 PM, you wrote: AG> Hi all, AG> I got interested in manipulating Unicode with REBOL and tried the UTF-8 script by Jan Skibinski. AG> It seems there is an error in the encode function which did not convert correctly my test case : the 1st letter of Khmer alphabet which code is U+1780, should become #{E19E80} in UTF-8, according AG> to my understanding (based on http://www.zvon.org/tmRFC/RFC2279/Output/chapter2.html). AG> In case it may be helpful to someone this version should work (though not optimized and tested only with k=2 on U+1780 :-) : Hi, it looks that you were using some older version. Here is available my latest utf-8.r script: http://oldes.multimedia.cz/rss/projects/utf-8_latest.rip (4kB) I removed the to-ucs2 function as I'm using this ucs2.r script: http://oldes.multimedia.cz/rss/projects/ucs2_latest.rip ( 2.5MB !!!) The archive is pretty large as it includes all available charmaps which I collected with already pre-generated appropriate Rebol parsing rules. I use only cp1250 and ISO-8859-2 so I'm not sure if the others are good working, but they should be if the included charmap sources are correct. So if I need to encode a text which was written using 'cp1250' to utf-8 I do: ucs2/load-rules "cp1250" utf-8/encode-2 ucs2/encode "text with special char " Theoretically I can also change encoding of the text: ucs2/load-rules "cp1250" ucstext: ucs2/encode "text with special char " ucs2/load-rules "iso-8859-2" to-string ucs2/decode ucstext == "text with special char " (but I never used this so it's not tested at all and there may be problem if you have some unicode chars which the decoder rule doesn't know) I the UCS2 archive there is also a script which creates PHP code for ucs2 encoding (according charmap you need) as I was missing this in my PHP build. Isn't Rebol great tool? :) Feel free to let me know if you would have some troubles. Cheers, Oldes PS: I'm still unicode newbie! I just made a script which is working as I need it, that's all.

 [3/3] from: alain::goye::free::fr at: 21-Oct-2004 9:44


Thank you Oldes, That's much more than what I expected ! Anyway checking the old version was an occasion to improve my understanding (I'm new also to unicode...). Cheers, Alain. ----- Original Message ----- From: "rebOldes" <[rebol-list2--seznam--cz]> To: "Alain Goy=E9" <[rebolist--rebol--com]> Sent: Wednesday, October 20, 2004 2:35 PM Subject: [REBOL] Re: UTF-8
> Hello Alain, > > Sunday, October 17, 2004, 7:27:41 PM, you wrote: > > AG> Hi all, > > AG> I got interested in manipulating Unicode with REBOL and tried the
UTF-8 script by Jan Skibinski.
> AG> It seems there is an error in the encode function which did not
convert correctly my test case : the 1st letter of Khmer alphabet which code is U+1780, should become #{E19E80} in UTF-8, according