Mailing List Archive: Re: UTF-8 revisited

[REBOL] Re: UTF-8 revisited

From: jan:skibinski:sympatico:ca at: 27-Nov-2002 14:10


    Hi Romano,

    While the first set of changes reduces the timings to about 65%
    the second has much lesser impact - 61% at best, which is about

    0.34s for 1000 loops on "chars: =EC=9A=E8=F8=9E=FD=E1=ED=E9" Latin-1 sequence (case k=1)
    where utf8-encode gets its best timing of 0.18.

    But heck, every single percent counts! :-)
    Timings vary, so the above data is just for your orientation. But I am
    sure you already know the results. :-)

    I found it quite easy to get the first improvements in my original
    version down to 5s, but then I got stuck on 1.35s. I was so "desperate"
    that I even tried simulated bit registers. Injecting "10" bits in front
of
    every six bits, travelling from the tail.
    Amazingly, I was reaching there similar timings of 1.5s, so do not
    discard such approaches off hand, if you ever need shifts and other
    such manipulations.

    But breaking of the 1s barrier happened only after I completely revised
    the algorithm and started working from the least significant bits up.
    This way I could get rid of most of the tables and use hardcoded
    magic "64" integer instead.
    I was so caught up in the official algorithm description that I missed
    the obvious - which is what 'utf8-encode is in fact based on.
    The register simulation clearly helped me here.

    Best regards,
    Jan

Romano Paolo Tenca wrote: