[REBOL] Re: UTF-8 revisited
From: jan:skibinski:sympatico:ca at: 26-Nov-2002 14:51
Hello RebOldes,
Thank you for you response and suggestions. I already have had
some optimization done, and discovered some of the things you just
mentioned: some by trial and error, some by a common sense.
So the "for" loops have been replaced, subroutines removed, etc.
But I would not even think about other things you mention
here, such as using to integer! instead of to-integer, etc.
So thank you so much for those.
---------------------------------------------------
Anyway, here is a status:
I think my latest unpublished version is quite acceptable in speed.
But I am too tired now to do the final cleanup, and comparison
with what you just posted today. This must wait till I get few
hours of sleep. But I'll post here the encode function for your revision.
I was using your little sentence of Latin-1 characters
chars: =EC=9A=E8=F8=9E=FD=E1=ED=E9
looping 1000 times or so. And while your
version was doing it in about 0.20 s, my original
non-optimized version was dragging its feet at about 13 s,
or so. That was clearly unacceptable. To cut the long story
short, my latest version is working it in about 0.60 s in
a general case and in about 0.35 using specialization
to Latin-1 but still keeping the same framework as for other
more general cases. Further specialization would obviously
degenerate to utf8-encode.
I am attaching a skeleton of one function only, so you will
see what I am bragging about here. There are few subtle points
to be made :
+ If you can do all your arithmetic on chars, then you can be as
fast as in utf8-encode. First of all, the time consuming
to-integer is not required, because in this case the
operation / behaves as the integer division. That means that
a / b gives you a character, which can be directly stored
in an output string - no convertion is required.
However, you also have to remember that the multiplication
works modulo 256 too. This is why am fiddling with (unfinished
for the case k=4) function 'f at the beginning of 'encode.
Adding 0 and assuming correct order of multiplication
is important if one expects values to be greater than 256.
+ To assure that after the division of two integers I
still get the integer, I use little trick of 'and-ing with
negative numbers as in "x and -64 / 64". This seem to be
faster than to-integer. I have not tried yor suggested
to integer!
yet.
The cascade below goes down from the fastest/cheapest case to
some more elaborate cases: nothing much to do for ascii
characters, a bit more for ansi (latin 1), and quite a bit
of work with to-char convertion for the most generic case.
Notice that, compared to the original version,
I simplified the entire scheme as well.
Best wishes,
Jan
encode: func [
{
Encode string of k-wide characters into UTF-8 string,
where k: 1, 2 or 4.
Case k = 1 could have been isolated for much
improved speed.
(integer -> string -> string)
}
k [integer!]
ucs [string!]
/local x m result [string!]
][
result: make ucs 0
f: switch :k [
1 [func[u][u/1]]
2 [func[
u
][
either u/1 > 0 [0 + u/2 + (256 * u/1)][u/2]
]]
4 [func[u][u/4 + (256 * u/3)
+ (65536 * u/2) + (16777216 * u/1)]]
]
while [not tail? ucs][
x: f ucs
result: tail result
either x < 128 [
insert result x
][
either x < 256 [
insert result x and 63 or 128
insert result x / 64 or 192
][
m: 1
while [x > 127 ][
insert result to-char (x and 63 or 128)
x: x and -64 / 64
m: m + 1
]
insert result to-char (x or udata/3/:m)
]
]
ucs: skip ucs k
]
head result
]
Oh, you will need this too:
udata/3
== [0 192 224 240 248 252]
I hope I did not miss anything here.