Mailing List Archive: 49091 messages
  • Home
  • Script library
  • AltME Archive
  • Mailing list
  • Articles Index
  • Site search
 

[REBOL] Re: UTF-8 revisited

From: jan:skibinski:sympatico:ca at: 26-Nov-2002 14:51

Hello RebOldes, Thank you for you response and suggestions. I already have had some optimization done, and discovered some of the things you just mentioned: some by trial and error, some by a common sense. So the "for" loops have been replaced, subroutines removed, etc. But I would not even think about other things you mention here, such as using to integer! instead of to-integer, etc. So thank you so much for those. --------------------------------------------------- Anyway, here is a status: I think my latest unpublished version is quite acceptable in speed. But I am too tired now to do the final cleanup, and comparison with what you just posted today. This must wait till I get few hours of sleep. But I'll post here the encode function for your revision. I was using your little sentence of Latin-1 characters chars: =EC=9A=E8=F8=9E=FD=E1=ED=E9 looping 1000 times or so. And while your version was doing it in about 0.20 s, my original non-optimized version was dragging its feet at about 13 s, or so. That was clearly unacceptable. To cut the long story short, my latest version is working it in about 0.60 s in a general case and in about 0.35 using specialization to Latin-1 but still keeping the same framework as for other more general cases. Further specialization would obviously degenerate to utf8-encode. I am attaching a skeleton of one function only, so you will see what I am bragging about here. There are few subtle points to be made : + If you can do all your arithmetic on chars, then you can be as fast as in utf8-encode. First of all, the time consuming to-integer is not required, because in this case the operation / behaves as the integer division. That means that a / b gives you a character, which can be directly stored in an output string - no convertion is required. However, you also have to remember that the multiplication works modulo 256 too. This is why am fiddling with (unfinished for the case k=4) function 'f at the beginning of 'encode. Adding 0 and assuming correct order of multiplication is important if one expects values to be greater than 256. + To assure that after the division of two integers I still get the integer, I use little trick of 'and-ing with negative numbers as in "x and -64 / 64". This seem to be faster than to-integer. I have not tried yor suggested to integer! yet. The cascade below goes down from the fastest/cheapest case to some more elaborate cases: nothing much to do for ascii characters, a bit more for ansi (latin 1), and quite a bit of work with to-char convertion for the most generic case. Notice that, compared to the original version, I simplified the entire scheme as well. Best wishes, Jan encode: func [ { Encode string of k-wide characters into UTF-8 string, where k: 1, 2 or 4. Case k = 1 could have been isolated for much improved speed. (integer -> string -> string) } k [integer!] ucs [string!] /local x m result [string!] ][ result: make ucs 0 f: switch :k [ 1 [func[u][u/1]] 2 [func[ u ][ either u/1 > 0 [0 + u/2 + (256 * u/1)][u/2] ]] 4 [func[u][u/4 + (256 * u/3) + (65536 * u/2) + (16777216 * u/1)]] ] while [not tail? ucs][ x: f ucs result: tail result either x < 128 [ insert result x ][ either x < 256 [ insert result x and 63 or 128 insert result x / 64 or 192 ][ m: 1 while [x > 127 ][ insert result to-char (x and 63 or 128) x: x and -64 / 64 m: m + 1 ] insert result to-char (x or udata/3/:m) ] ] ucs: skip ucs k ] head result ] Oh, you will need this too: udata/3 == [0 192 224 240 248 252] I hope I did not miss anything here.