UTF-8 revisited
[1/12] from: jan:skibinski:sympatico:ca at: 19-Nov-2002 12:26
Hello All,
I just posted UTF-8 script to the RT library.
When browsing the Escribe archive I've noticed a recurring theme:
When
will we get a support for Unicode in Rebol?
and the standard RT
answers
of the sort: "It is on the list of features to be implemented, but
we are busy
with something else now. If you really need it now, pay for it."
It seems to me, that before we get such a toy, professionally done,
a robust and a speedy toy, we could develop at least some sort of
emulation
tools, which -- although possibly slow -- could handle small tasks
on hand.
For example, a Unicode Rebol terminal seems likely to be a desired
Unicode gadget.
I am not by any means an Unicode expert, but a quick glance
at the very good page "UTF-8 and Unicode FAQ for Unix/Linux",
by Markus Kuhn, http://www.cl.cam.ac.uk/~mgk25/unicode.html
convinced me that the unicode support for Rebol console is doable.
After all Python, Perl and other scripting languages already have
it.
I am sure there are experts on this list, who know how to intercept
input/output streams and build a middle tier that would be able
to handle Unicode.
For a start I took upon myself a challenge of UTF-8 encoding
and decoding of 4-octet and 2-octet wide (UCS-4 and UCS-2)
representations of Unicode characters. For those unfamiliar with
UTF-8 encoding the page I cited above provides good motivation
for a need for the UTF-8 encoding.
I was partially inspired by the %uft8-encode.r script, posted by
Oldes and Tenca. In essence, that script encodes ANSI (or Latin-1)
strings (1 character = 1 octet) into longer strings, where
characters
from the upper part of ANSI character table are represented by
two octets, instead of just one.
Fine at it is for Latin-1 strings, it cannot handle anything of
the real interest, where Unicode shines, such as Latin-2
characters,
CJK characters, mathematical symbols, and so on.
So if you want to play with Hungarian, Czech, Dutch, etc.
sequences you need something more general than utf8-encode.
Hopfully, the %utf-8.r I just posted will provide you with the
basic tools for such tasks.
Few disclaimers:
----------------
1. The code is not optimized at all. I've seen an ellegant approach
taken by Paolo, to improve on original version of Oldes script.
Possibly something of this sort could be also applied to %utf-8.r
as well.
2. The implementation is based on the verbal descriptions of
algorithms found in some standards (The references are provided
on the cited page and inside the script). The algorithms use
several arithmetic operations: *, / and //. Although in principle
the shifts could replace the first two, Rebol does not
support shift operations. The conversions are therefore not that
efficient, especially considering that TO-INTEGER is also
a part of a game.
3. I do not handle the big/little signature of UCS as yet.
So if you copy some Unicode file to Rebol and notice the
first two octets in the form of FF FE or FE FF, that's the
signature. Remove it, but make sure that the byte order
is appropriate for your platform. I was assuming big endian
in the script.
4. The standards stress the importance of error handling
of missformatted sequences. I have not done any of that
yet.
5. I have done very little testing so far.
6. I no longer enjoy implementations of converters.
In a way I consider such activity a waste of programmer's time.
Many formats have been devised, many of them are already
gone, and what with all that programming effort?
So I programmed the %utf-8.r in haste and with a little joy.
But someone had to do it, and it's a start for a real optimization.
Improvements and suggestions for improvements are obviously
welcome.
Regards,
Jan
[2/12] from: rebol-list2:seznam:cz at: 26-Nov-2002 12:47
Hello Jan,
Tuesday, November 19, 2002, 6:26:03 PM, you wrote:
JS> I was partially inspired by the %uft8-encode.r script, posted by
JS> Oldes and Tenca. In essence, that script encodes ANSI (or Latin-1)
JS> strings (1 character = 1 octet) into longer strings, where
JS> characters
JS> from the upper part of ANSI character table are represented by
JS> two octets, instead of just one.
JS> Fine at it is for Latin-1 strings, it cannot handle anything of
JS> the real interest, where Unicode shines, such as Latin-2
JS> characters,
JS> CJK characters, mathematical symbols, and so on.
JS> So if you want to play with Hungarian, Czech, Dutch, etc.
JS> sequences you need something more general than utf8-encode.
JS> Hopfully, the %utf-8.r I just posted will provide you with the
JS> basic tools for such tasks.
Hmm... yes the utf8-encode was used for only one reason (in the
FlashMX all extended characters must be encoded), so it do not
cover all tasks.... I'm just looking at your script....
JS> Few disclaimers:
JS> ----------------
JS> 1. The code is not optimized at all. I've seen an ellegant approach
JS> taken by Paolo, to improve on original version of Oldes script.
JS> Possibly something of this sort could be also applied to %utf-8.r
JS> as well.
I've take just a quick look at it and what's about the
optimalisations:
a) use to integer! instead of 'to-integer (it's a little bit quicker if you use
it too many times:
t: now/time/precise loop 1000000 [to-integer "1"] now/time/precise - t
;== 0:00:01.282 (== 0:00:01.202 Rebol/Base)
t: now/time/precise loop 1000000 [to integer! "1"] now/time/precise - t
;== 0:00:00.872 (== 0:00:00.801 Rebol/Base)
b) 'for loop is slow (because it's not a native) so if you can find other way:
t: now/time/precise loop 100000 [for k 1 6 1 []] now/time/precise - t
;== 0:00:03.425 (== 0:00:03.014 Rebol/Base)
t: now/time/precise loop 100000 [k: 1 loop 6 [k: k + 1]] now/time/precise - t
;== 0:00:00.361 (== 0:00:00.371 Rebol/Base)
c) if you can use: second us instead of us/2
us: [0 192 224 240 248 252]
t: now/time/precise loop 10000000 [us/2] now/time/precise - t
;== 0:00:06.599 (== 0:00:06.459 R/B)
t: now/time/precise loop 10000000 [second us] now/time/precise - t
;== 0:00:04.076 (== 0:00:04.076 R/B)
;but the problem is that usually you need to use parenthesis:(
t: now/time/precise loop 10000000 [(second us)] now/time/precise - t
;== 0:00:04.928 (== 0:00:04.887 R/B)
d) in you functions encode-integer and decode-integer you are defining
functions 'f and blocks 'us, 'vs and 'cases - that's good to
understand how it works as it's more readable, but for the usage it's a big waste of
time
(what about making the utf-8 as an object with these blocks and
functions defined only once?)
JS> 2. The implementation is based on the verbal descriptions of
JS> algorithms found in some standards (The references are provided
JS> on the cited page and inside the script). The algorithms use
JS> several arithmetic operations: *, / and //. Although in principle
JS> the shifts could replace the first two, Rebol does not
JS> support shift operations. The conversions are therefore not that
JS> efficient, especially considering that TO-INTEGER is also
JS> a part of a game.
Maybe a new native function for shift would be good in new Rebols:)
I hope that the reason that we don't have shift is not because in
Rebol you cannot type << (although you can use >>)
JS> Improvements and suggestions for improvements are obviously
JS> welcome.
I'm testing:
from RFC2279 sec.4:
-----
The UCS-2 sequence representing the Hangul characters for the Korean
word "hangugo" (D55C, AD6D, C5B4) may be encoded as follows:
ED 95 9C EA B5 AD EC 96 B4
-----
It looks, that it's working:
x: #{ED959CEAB5ADEC96B4}
decode-2 x ;== #{D55CAD6DC5B4}
encode-2 decode-2 x ;== #{ED959CEAB5ADEC96B4}
I've made the speed improvements explained above:
Your version:
do %/d/view/public/www.reboltech.com/library/scripts/utf-8.r
t: now/time/precise loop 10000 [encode-2 decode-2 x] now/time/precise - t
;== 0:00:11.677 (== 0:00:07.551 with Rebol/Base)
Improved:
do %/d/library/utf-8.r
t: now/time/precise loop 10000 [utf-8/encode-2 utf-8/decode-2 x] now/time/precise - t
;== 0:00:05.819 (== 0:00:04.346 with Rebol/Base)
Hmm... the Rebol/Base must have some other improvements, not only
removed some functions from the Rebol/Core :)
You have forgot to include the 'intersperse function in the 'to-ucs2
so I've made my own to-ucs2 function. And will try to make some
Latin-2 to UCS2 converter (just if I'll find some doc. how to do it:)
=( Oliva David )=======================( [oliva--david--seznam--cz] )==
=( Earth/Europe/Czech_Republic/Brno )=============================
=( coords: [lat: 49.22 long: 16.67] )=============================
-- Attached file included as plaintext by Listar --
-- File: utf-8.r
REBOL [
Title: "UTF-8"
Date: 26-Nov-2002
Name: "UTF-8"
Version: 1.0.1
File: %utf-8.r
Author: "Jan Skibinski"
Co-author: "Oldes"
Purpose: {Encoding and decoding of UCS-4/UC-2 binaries
to and from UTF-8 binaries.
}
History: [
1.0.1 26-Nov-2002 {
Oldes: Speed optimalizations (not so readable now:(
+ fixed to-ucs2 function}
1.0.0 20-Nov-2002 {
Jan: Basic UTF-8 encoding and decoding functions.
Limitations: Does not handle a big/little endian
signatures yet. Needs thorough testing and algorithms
optimalizations.}
]
Email: [jan--skibinski--sympatico--ca]
Category: [crypt 4]
Acknowledments: {
Inspired by the script 'utf8-encode.r of Romano Paulo Tenca
and Oldes, which encodes Latin-1 strings.
}
]
comment {
UCS means: Universal Character Set (or Unicode)
UCS-2 means: 2-byte representation of a character in UCS.
UCS-4 means: 4-byte representation of a character in UCS.
UTF-8 means: UCS Transformation Format using 8-bit octets.
The following excerpt from:
UTF-8 and Unicode FAQ for Unix/Linux, by Markus Kuhn
http://www.cl.cam.ac.uk/~mgk25/unicode.html
provides motivations for using UTF-8.
<<Using UCS-2 (or UCS-4) under Unix would lead to very severe
problems. Strings with these encodings can contain as parts
of many wide characters bytes like '\0' or '/' which have a
special meaning in filenames and other C library function
parameters. In addition, the majority of UNIX tools expects
ASCII files and can't read 16-bit words as characters without
major modifications. For these reasons, UCS-2 is not a suitable
external encoding of Unicode in filenames, text files,
environment variables, etc.
The UTF-8 encoding defined in ISO 10646-1:2000 Annex D
and also described in RFC 2279 as well as section 3.8
of the Unicode 3.0 standard does not have these problems.
It is clearly the way to go for using Unicode under Unix-style
operating systems.>>
The copy of forementioned Annex D can be found on Markus site:
http://www.cl.cam.ac.uk/~mgk25/ucs/ISO-10646-UTF-8.html.
Encoding and decoding functions implemented here are
based on the descriptions of algorithms found in the Annex D.
Testing: The page http://www.cl.cam.ac.uk/~mgk25/unicode.html
has many pointers to variety of test data. One of them
is a UTF-8 sampler from Kermit pages of Columbia University
http://www.columbia.edu/kermit/utf8.html, where the
phrase "I can eat glass and it doesn't hurt me." is
produced in dozens of world languages.
A shareware unicode editor 'EmEditor, from www.emurasoft.com
can be used for copying, editing and saving unicode samples
from the web browsers. Since it saves its output in UCS-2,
UCS-4 (big and little endians), UTF-8 and UTF-7 formats
it is a very good tool for testing.
}
comment {
------------------------------------------------------------
SUMMARY of script UTF-8.R
------------------------------------------------------------
decode-2 (binary -> binary)
encode-2 (binary -> binary)
decode-4 (binary -> binary)
encode-4 (binary -> binary)
decode-integer (binary -> [integer integer])
encode-integer (integer -> binary)
to-ucs2 (string -> binary) ; auxiliary
}
utf-8: context [
allchars: complement charset []
to-ucs2: func [
{
Converts ANSI (Latin-1) string to UCS-2 octet string.
This is an auxiliary function, just for testing.
}
ascii [string!]
/local result [binary!]
][
result: make binary! 2 * length? ascii
parse/all ascii [
any [copy c allchars (insert result join c #{00})]
]
head reverse result
]
encode-2: func [
{
Encode a binary string of UCS-2 octets into a UTF-8
encoded binary octet stream.
}
us [binary!]
/local x result [binary!]
][
result: copy #{}
while [not tail? us][
x: (256 * first us) + second us
insert tail result encode-integer x
us: skip us 2
]
result
]
encode-4: func [
{
Encode a binary string of UCS-4 octets into a UTF-8
encoded binary octet stream.
}
us [binary!]
/local x result [binary!]
][
result: copy #{}
while [not tail? us][
x: (16777216 * first us) + (65536 * second us) + (256 * third us) + fourth us
insert tail result encode-integer x
us: skip us 4
]
result
]
decode-2: func [
{
Decode a UTF-8 encoded binary string
to a UCS-2 binary string
}
xs [binary!]
/local z vs us result [binary!]
][
result: copy #{}
while [not tail? xs][
us: decode-integer xs
vs: copy []
z: to integer! ((first us) / 256)
insert vs z
z: (first us) - (z * 256)
insert tail vs z
insert tail result to binary! vs
xs: skip xs second us
]
result
]
decode-4: func [
{
Decode a UTF-8 encoded binary string
to UCS-4 binary string
}
xs [binary!]
/local z1 z vs us result [binary!]
][
result: copy #{}
while [not tail? xs][
us: decode-integer xs
vs: copy []
z: us/1
foreach k [16777216 65536 256][
z1: to integer! (z / :k)
insert tail vs z1
z: z - (z1 * :k)
]
insert tail vs z
insert tail result to binary! vs
xs: skip xs second us
]
result
]
encode-integer: func [
{
Encode 4-byte (32-bit) UCS-4 integer to a sequence
of UTF-8 octets.
}
[throw]
x [integer!]
/local f k result [binary!]
][
k: 1 loop 6 [
if x <= encases/:k [
result: to binary! enf :k x
break
]
k: k + 1
]
result
]
decode-integer: func [
{
Decode sequence of 1-6 octets into 32-bit unsigned
integer. Return a pair made of a decoded integer
and a count of bytes used from the input string.
}
xs [binary!]
/local f k result [block!]
][
k: 1 loop 6 [
if (first xs) <= pick decases k [
result: to block! def :k xs
insert tail result :k
break
]
k: k + 1
]
result
]
;-----functions and values extracted from the decode/encode integer
enf: func [
k x
/local result
][
result: to block! (us/:k + to integer! (x / vs/:k))
if k > 1 [
for z (k - 1) 1 -1 [
insert tail result (
(to integer! (x / vs/:z)) // 64 + 128)
]
]
result
]
def: func [
k xs
/local m result
][
result: ((first xs) - us/:k) * vs/:k
if k >= 2 [
for z 2 k 1 [
m: k - :z + 1
result: result + ((xs/:z - 128) * vs/:m)
]
]
result
]
us: [0 192 224 240 248 252]
vs: [1 64 4096 262144 16777216 1073741824]
encases: [
127 ; 0000 007F
2047 ; 0000 07FF
65535 ; 0000 FFFF
2097151 ; 0001 FFFF
67108863 ; 03FF FFFF
2147483647 ; 7FFF FFFF
]
decases: [127 223 239 247 251 253]
]
[3/12] from: jan:skibinski:sympatico:ca at: 26-Nov-2002 14:51
Hello RebOldes,
Thank you for you response and suggestions. I already have had
some optimization done, and discovered some of the things you just
mentioned: some by trial and error, some by a common sense.
So the "for" loops have been replaced, subroutines removed, etc.
But I would not even think about other things you mention
here, such as using to integer! instead of to-integer, etc.
So thank you so much for those.
---------------------------------------------------
Anyway, here is a status:
I think my latest unpublished version is quite acceptable in speed.
But I am too tired now to do the final cleanup, and comparison
with what you just posted today. This must wait till I get few
hours of sleep. But I'll post here the encode function for your revision.
I was using your little sentence of Latin-1 characters
chars: =EC=9A=E8=F8=9E=FD=E1=ED=E9
looping 1000 times or so. And while your
version was doing it in about 0.20 s, my original
non-optimized version was dragging its feet at about 13 s,
or so. That was clearly unacceptable. To cut the long story
short, my latest version is working it in about 0.60 s in
a general case and in about 0.35 using specialization
to Latin-1 but still keeping the same framework as for other
more general cases. Further specialization would obviously
degenerate to utf8-encode.
I am attaching a skeleton of one function only, so you will
see what I am bragging about here. There are few subtle points
to be made :
+ If you can do all your arithmetic on chars, then you can be as
fast as in utf8-encode. First of all, the time consuming
to-integer is not required, because in this case the
operation / behaves as the integer division. That means that
a / b gives you a character, which can be directly stored
in an output string - no convertion is required.
However, you also have to remember that the multiplication
works modulo 256 too. This is why am fiddling with (unfinished
for the case k=4) function 'f at the beginning of 'encode.
Adding 0 and assuming correct order of multiplication
is important if one expects values to be greater than 256.
+ To assure that after the division of two integers I
still get the integer, I use little trick of 'and-ing with
negative numbers as in "x and -64 / 64". This seem to be
faster than to-integer. I have not tried yor suggested
to integer!
yet.
The cascade below goes down from the fastest/cheapest case to
some more elaborate cases: nothing much to do for ascii
characters, a bit more for ansi (latin 1), and quite a bit
of work with to-char convertion for the most generic case.
Notice that, compared to the original version,
I simplified the entire scheme as well.
Best wishes,
Jan
encode: func [
{
Encode string of k-wide characters into UTF-8 string,
where k: 1, 2 or 4.
Case k = 1 could have been isolated for much
improved speed.
(integer -> string -> string)
}
k [integer!]
ucs [string!]
/local x m result [string!]
][
result: make ucs 0
f: switch :k [
1 [func[u][u/1]]
2 [func[
u
][
either u/1 > 0 [0 + u/2 + (256 * u/1)][u/2]
]]
4 [func[u][u/4 + (256 * u/3)
+ (65536 * u/2) + (16777216 * u/1)]]
]
while [not tail? ucs][
x: f ucs
result: tail result
either x < 128 [
insert result x
][
either x < 256 [
insert result x and 63 or 128
insert result x / 64 or 192
][
m: 1
while [x > 127 ][
insert result to-char (x and 63 or 128)
x: x and -64 / 64
m: m + 1
]
insert result to-char (x or udata/3/:m)
]
]
ucs: skip ucs k
]
head result
]
Oh, you will need this too:
udata/3
== [0 192 224 240 248 252]
I hope I did not miss anything here.
[4/12] from: rotenca:telvia:it at: 27-Nov-2002 16:13
Hi Jan,
perhaps this can help:
udata: [1 2 [0 192 224 240 248 252]]
table: reduce [
func[u][first u]
func[u][either 0 < first u [0 + (second u) + (256 * first u)][second u]]
3
func[u][to integer! to binary! u]
]
encode: func [
{
Encode string of k-wide characters into UTF-8 string,
where k: 1, 2 or 4.
Case k = 1 could have been isolated for much
improved speed.
(integer -> string -> string)
}
k [integer!]
ucs [string!]
/local c f x m result [string!]
][
result: make string! length? ucs
f: pick table k
parse/all ucs [
any [
c: k skip (
either 128 > x: f c [insert tail result x][
either x < 256 [
insert insert tail result x / 64 or 192 x and 63 or 128
][
result: tail result
m: 1
while [x > 127][
insert result to char! x and 63 or 128
x: x and -64 / 64
m: m + 1
]
insert result to char! x or udata/3/:m
]
]
)
]
]
head result
]
---
Ciao
Romano
[5/12] from: jan:skibinski:sympatico:ca at: 27-Nov-2002 11:54
Hi Romano,
Romano Paolo Tenca wrote:
> Hi Jan,
>
> perhaps this can help:
>
Thanks, that will certainly help! Although I have not
tested it yet I can clearly see the advantages of your
changes. I already know that 'parse itself shaves quite
a bit of time compared to 'while. And there are other
goodies: precompiled fetch, to char! instead of to-char,
and inlining. I was using to-binary (your to binary!)
before, but I dropped it since it seemed to me slower
than the explicit construction of integers. But I probably
missed the case of 4-byte integers where to binary!
might be in fact faster than the by-hand construction.
I'll report back, after the final revision of the
decoder and the verification whether or not
the latest simplified algorithms properly work for
all cases.
The final step will handle the little endian vs. big
endian issues.
All the best,
Jan
[6/12] from: rotenca:telvia:it at: 27-Nov-2002 18:22
Hi jan,
try also this:
result: tail result
until [
insert result to char! x and 63 or 128
128 > x: x and -64 / 64
]
insert result to char! x or pick pick udata 3 1 + length?
result
instead of:
> result: tail result
> m: 1
<<quoted lines omitted: 4>>
> ]
> insert result to char! x or udata/3/:m
and delete the unuseful m local word from function.
---
Ciao
Romano
[7/12] from: jan:skibinski:sympatico:ca at: 27-Nov-2002 14:10
Hi Romano,
While the first set of changes reduces the timings to about 65%
the second has much lesser impact - 61% at best, which is about
0.34s for 1000 loops on "chars: =EC=9A=E8=F8=9E=FD=E1=ED=E9" Latin-1 sequence (case k=1)
where utf8-encode gets its best timing of 0.18.
But heck, every single percent counts! :-)
Timings vary, so the above data is just for your orientation. But I am
sure you already know the results. :-)
I found it quite easy to get the first improvements in my original
version down to 5s, but then I got stuck on 1.35s. I was so "desperate"
that I even tried simulated bit registers. Injecting "10" bits in front
of
every six bits, travelling from the tail.
Amazingly, I was reaching there similar timings of 1.5s, so do not
discard such approaches off hand, if you ever need shifts and other
such manipulations.
But breaking of the 1s barrier happened only after I completely revised
the algorithm and started working from the least significant bits up.
This way I could get rid of most of the tables and use hardcoded
magic "64" integer instead.
I was so caught up in the official algorithm description that I missed
the obvious - which is what 'utf8-encode is in fact based on.
The register simulation clearly helped me here.
Best regards,
Jan
Romano Paolo Tenca wrote:
[8/12] from: rotenca:telvia:it at: 27-Nov-2002 21:40
Hi Jan,
While the first set of changes reduces the timings to about 65%
What is exactly the first?
the second has much lesser impact - 61% at best, which is about
And the second?
0.34s for 1000 loops on "chars: ìsèøzýáíé" Latin-1 sequence (case k=1)
where utf8-encode gets its best timing of 0.18.
But if you try utf8-encode with a long string (>6000 chars), it becomes more
slow than encode. What is the most frequent case?
But breaking of the 1s barrier happened only after I completely revised
the algorithm and started working from the least significant bits up.
This way I could get rid of most of the tables and use hardcoded
magic "64" integer instead.
This is a smart approach. Thanks for info on your work.
---
Ciao
Romano
[9/12] from: jan:skibinski:sympatico:ca at: 28-Nov-2002 11:10
When doing some tests on Latin-2 scripts I came
across the term BIELARUSIAN "Lacinka", get somehow
surprised (I was assuming the predominance of
the Cyrillic script over there), searched the web
and found this article entitled <<BIELARUSIAN "Lacinka">>:
[ http://www.cus.cam.ac.uk/~np214/lacin.htm ].
Written by a Bielarus, Mikalaj Packajeu
[The name is spelled with some hats and slashes].
I found it quite interesting, but that's me.
This is about historical develoment of two systems
of writing in Bielarusian: a Latin based script
- Lacinka, and a Cyrillic script - Kirilica.
And this is also about the political and cultural paranoia:
....
First of all, it is necessary to establish who is more
interested in the reform of alphabets on the basis of the
Latin script: the proletariat or the bourgeoisie?..
- insisted Uladzimier Dubouka, a promenent Soviet Belarusian
writer, in his "Kirylica ci Lacinika?" brochure of 1926.
....
The post-1933 Soviet Belarus publications categorically
declared that sympathies to Lacinka constituted
the highest degree of counter-revolutionary activity
.
...
In the post-war USSR, the use of Belarusian Lacinka
would be regarded as something deeply subversive,
nationalist and anti-Soviet.
....
Since Belarus became independent in 1991, however,
some efforts were also made to revive Lacinka, this
original script of the modern Belarusian literary language.
....
Still, the revival of Lacinka lost some momentum after 1995,
when the regime of Lukashenka re-introduced the Russian
language as official and began effectively to expel the
Belarusian language, in any form, from every area of official
and public use in Belarus.
...
The same Lukashenka, who complains that EU countries
decided to start treating him as a persona non grata
and who threatens in response to unleash a flood of
illegal immigrants and drugs.
Jan
[10/12] from: jan:skibinski:sympatico:ca at: 29-Nov-2002 10:48
Hi Romano, RebOldes and All,
Version 1.0.1 of utf-8 has been posted to the library.
Of the three functions there: 'encode, 'decode and 'to-ucs
only the 'decode does not use the 'parse. It differs from the other
two in this respect that it works on a variable number of input
bytes (1-6) for every wide character to be decoded.
I would not know how to convert its 'while loop to the 'parse loop
due to the fact that a 'skip value would not be constant. But if you find
it doable and beneficial for speed improvement please take a shot at that.
Otherwise the 'decode behaves quite well and is only about 20% slower
than the 'encode.
The new version is completely redsigned, much simplified
and well documented. It also contains a sample of a simple
phrase in a bunch of languages from Latin-1, 2, 4 and 5.
If anyone is interested in hosting it I can provide a bit bigger
(7K) UTF-8 sample (lokomotywa.html), which is a Polish-English
side-by-side onomatopeic poem for children. Good for testing
and fun for kids too. I did the UTF-8-ization, someone else did
the translation, which I found well done and rythmicly superb.
Jan
[11/12] from: rotenca:telvia:it at: 29-Nov-2002 21:17
Hi Jan
>I would not know how to convert its 'while loop to the 'parse loop
>due to the fact that a 'skip value would not be constant. But if you find
>it doable and beneficial for speed improvement please take a shot at that.
Can be done, but with too little speed advantage.
---
Ciao
Romano
[12/12] from: rebol-list2:seznam:cz at: 30-Nov-2002 22:25
Hello Jan,
Tuesday, November 26, 2002, 8:51:52 PM, you wrote:
JS> Hello RebOldes,
JS> Thank you for you response and suggestions. I already have had
JS> some optimization done, and discovered some of the things you just
JS> mentioned: some by trial and error, some by a common sense.
JS> So the "for" loops have been replaced, subroutines removed, etc.
JS> But I would not even think about other things you mention
JS> here, such as using to integer! instead of to-integer, etc.
JS> So thank you so much for those.
Not at all... I was not connected a few days (as usually) so I will
have to read all these new mails in this thread now:)
=( Oliva David )=======================( [oliva--david--seznam--cz] )==
=( Earth/Europe/Czech_Republic/Brno )=============================
=( coords: [lat: 49.22 long: 16.67] )=============================
Notes
- Quoted lines have been omitted from some messages.
View the message alone to see the lines that have been omitted