r3wp [groups: 83 posts: 189283]
  • Home
  • Script library
  • AltME Archive
  • Mailing list
  • Articles Index
  • Site search
 

World: r3wp

[Script Library] REBOL.org: Script library and Mailing list archive

Sunanda
11-Mar-2009
[743]
Thanks for the script -- and for the problem report.

Looks to me like the script, as uploaded, contained non-UTF8 characters, 
and they are being treated as multi-byte characters.

REBOL, and REBOL.org can really only handle ASCII....Unicode and 
such like is for R3.


If you email (preferably in a ZIP to prevent email software from 
chewing up the code) the original script, I'll take a look.
Oldes
11-Mar-2009
[744]
so you support utf8?good to know.. next time I will upload it as 
utf8
PeterWood
11-Mar-2009
[745x4]
The library doesn't support utf-8 yet. We  have found that many people's 
browser are set so that the browser renders the output from rebol.org 
as utf-8.
In this way, the library accidentally supports utf-8 in the sense 
that if you upload utf-8 and display it in a browser set to display 
utf-8 everything will be displayed properly.
The core of the library system is old enough that it was written 
without considering character encoding at all.
Supporting utf-8 will require a lot of changes ..... though probably 
not quite as many as moving to R3.
Chris
11-Mar-2009
[749x3]
If most of it is currently ascii, would it not just be a case of 
adding a few filters?
'it', being current content.
Preventing any new content from posting invalid sequences, for example...
Sunanda
12-Mar-2009
[752]
Peter beat me to it, thanks.

Sorry Oldes, the Library does not support utf-8, despite my confused 
suggestion that it did.

Because we use a charset of UTF-8 in the browser header, it is _possible_ 
that we can more-or-less handle scripts with 2+ byte UTF-8 codings 
in REBOL strings! But that's not been tested.


Good point, Chris -- we already have such a filter, but it is not 
used to turn back conributions.
swall
12-Mar-2009
[753]
Sunanda, I have emailed the zipped script to you.
PeterWood
12-Mar-2009
[754]
Chris, one issue that we face with the mailing list archive is not 
knowing how the imput is encoded. I think this is also true of scripts.
Sunanda
12-Mar-2009
[755x2]
Thanks, Scott.
--The email script looks fine
-- it's identical to what is in the Library
-- Viewing the scriot works fine
-- Downloading it doesn't

.....Which is exactly what you reported. We are now both on the same 
page :-)
We've had a similar problem before (I've just checked the source 
code, and it's prompted my memory).


To solve it, we analyse the script for various extended ascii chars 
and then perform some messing around on HTTP content-type headers.


It's messy, and it's worked up til now......But obviously, we need 
some more analysing and messing around for this script.
Sunanda
13-Mar-2009
[757]
Results of a tiny bit of debugging on the ascii chars problem:
-- problem seems to be at the input stage:

     -- if you have exended ascii characters (top bit set, like the 1/4 
     used in the script) what we get from the webserver is bad (extra, 
     unexpected extended ascii chars)

    -- only download is (visibly) affected, although the extra extended 
    ascii chars are present in the text streams

     -- though there is some REBOL mezz code (decode-cgi) that may be 
     doing something I do not understand

    -- I can replicate the problem with both Apache and Xitami which 
    suggests the problem may be in REBOL rather than a given server.


-- the quick fix would be to add accept-charset="ISO-8859-1" to the 
<form ....> or <textarea ....>

    -- but that stops all extended ascii, including the ones we want. 
    So we won't do that.

-- the slower fix has yet to emerge from the available options.
Gabriele
14-Mar-2009
[758]
why not standardize everything on UTF-8?
Sunanda
14-Mar-2009
[759]
As far as I know, Core 2.5.6 (what the Library CGIs runs on) does 
not support UTF-8.
Gabriele
14-Mar-2009
[760]
does not support UTF-8

 - what do you mean by "support"? if you mean having native encoders/decoders, 
 no, it does not. but, utf-8 is just 8 bit characters, and it is backwards 
 compatible with ascii. if you can handle ascii, and leave alone any 
 char > 127, you already support utf-8.
Sunanda
14-Mar-2009
[761]
Sadly, REBOL running as CGI under the two servers I've tested (Apache 
and Xitami) does not seem to support the whole range of ASCII -- 
ASCII chars with the 8th bit set seem to cause problems.

I don't know where the problem really is, but right now, we do not 
even support 8-bit ASCII, let alone anything more modern :-)
PeterWood
14-Mar-2009
[762]
At the moment, I'd be worried about standarising the Library on utf-8 
as the effect of multibyte characters would have during script and 
mail processing is not understood. It could well be that the system 
handles multibyte characters without a hitch but nobody knows yet.


I have started to write some scripts to try to help move to a consistent 
character encoding of the Library data but, due to time constraints, 
I have been very slow.
Anton
14-Mar-2009
[763x3]
Why worry? Just do it. :-P
What version of rebol is being used by rebol.org ?
Sunanda, can you publish some files with the 8-bit ascii and note 
what the problems are ?
Maxim
14-Mar-2009
[766x3]
sunanda, you can force the character encoding in the html page header... 
I've used that before and it worked for me.
note, I don't mean the http header, but the actual <HEAD> tag.
I had the same kind of issues on another system.  nowadays, the default 
encoding has become UTF-8 for many/most html handlers, so if its 
not specified, many new browers and tools will incorrectly break 
up the character data.
Sunanda
14-Mar-2009
[769x2]
Anton, REBOL.org uses 2.5.6.4.1

The obvious bad file is the one Scott added recently:
http://www.rebol.org/view-script.r?script=ascii-math.r

If you view it with that URL, all looks good.

If you click the [Download script] link you'll see many spurious 
high-ascii chars in the source.

Those high ascii _are_ actually in the source. But where they came 
from is a mystery.
Maxim, REBOL.org emits a header

   <meta http-equiv="Content-Type" content="text/html;charset=utf-8">

Yeah, I know we aren't utf-8 -- but experiment has shown that's the 
moste acceptable charset.

Not sure what you are saying we could put in <head> -- can you be 
more specific.
Maxim
14-Mar-2009
[771]
there is a specific charset for western -iso, which ensure the extra 
127 bytes are correct.


<meta http-equiv="content-type" content="text/html; charset=ISO-8859-1">
Sunanda
14-Mar-2009
[772]
Thanks......We used to have that, but it created some other problems. 
I'll have to try to remember what and why :-)

And it does not solve the download problem (I know, I tried yesterday).
PeterWood
14-Mar-2009
[773]
I think the root of the problem is that when the Library system was 
first written, no account was taken of character encoding. As a result, 
not only is the data encoded as it was when originally submitted 
but the method of encoding is not even known.


Whatever charset is specified in the http header is not going to 
be correct for all scripts and messages. Using charset=utf8 seems 
to cause the least problems. Though for example, it will cause many 
ISO-8859-1 "high bit" characters to be incorrectly displayed.
Chris
14-Mar-2009
[774x3]
Do you have any stats on how many 'high bit' characters are now contained 
in Library content?
Or scope? - minimal; limited; too many to be trivial...
Re. ISO-8859-1 - the most obvious problem is the limitation - 256 
chars vs. UCS-1+
Sunanda
14-Mar-2009
[777]
No actual stats. Just from feel:
* Scripts -- very few
* Posts on the ML -- a few dozen
* AltME archive -- no idea
Gabriele
15-Mar-2009
[778]
Sunanda, I can tell you where does chars come from. if your page 
is set as utf-8, then the script as been uploaded by the browser 
as utf-8. when you view it in the brower, it shows correctly as utf-8. 
when you download it, it is still utf-8, but if you view it with 
something that believes it's latin1 (eg. the rebol 2 console on windows 
set as latin1), it won't show up correctly.
Anton
15-Mar-2009
[779x6]
Sunanda, you're right about that ascii-math.r file. When I clicked 
the [Download script] link, the browser (konqueror) downloaded and 
directly opened it with the editor (SciTE). SciTE thought it was 
8-bit ascii, and showed the characters incorrectly. All I had to 
do was change the file encoding from 8-bit to utf-8 and the characters 
appeared correctly. I guess the editor had no way of determining 
the encoding, and incorrectly guessed 8-bit ascii.
The view-script.r html source for the page correctly advertises the 
encoding as utf-8, so the browser shows it correctly.
So I'm pretty happy with the way that script was handled by the software 
here.
Except for R2 console, of course.
R3 console seems to handle it better.
Any other scripts you can find showing problems ?
Sunanda
16-Mar-2009
[785x2]
Thanks Gabriele -- that's a clear explanation, and has helped me 
work out what is going on.


Anton and Gabriele -- I have tried changing the charset we emit on 
the download to say UTF-8. But that makes little difference. As both 
of you note, once the file has been saved then (without a MAC-type 
resource fork) there is no obvious indication of the encoding. And 
several editors I have tried get it wrong -- thus "revealing" the 
extra ASCII chars.


Not sure what the solution is other than to de-UTF-8 files on download.
Anton -- not yet run a crawl to check for other scripts with high 
ascii chars.
Anton
16-Mar-2009
[787x3]
Which editors?

I think most editors these days allow manually changing the encoding, 
so developers who notice strange characters can just change it themselves.

Maybe it would be helpful to add a rebol.org library script header 
advertising the encoding (when it is known, and when not).

I don't recommend 'de-UTF-8'ing files on download - that's just going 
to confuse things more, especially when the file is view-script.r'd 
as utf-8 just beforehand.
It seems the responsibility lies with the clients to interpret encodings 
properly. As we move to a unicode world, software assuming 8-bit 
encodings are some ASCII encoding should drop off. But until the 
transition is complete, there's not much we can do about client software 
guessing wrong like that, except stating the encoding in the script 
header, in the web page that provides the download link, and by helping 
confused newbies.
Are rebol.org uploaders asked to declare the encoding used?
swall
16-Mar-2009
[790]
If the offending downloaded script is executed in Rebol/Core, the 
extra ASCII chars are also present in the executed code.  The script 
defines ½ to be 0.5. If "help ½" is typed into the console, the result 
is "Found these words:   ½              decimal!  0.5". However, 
if the script is executed in Rebol/View, the result is "½ is a decimal 
of value: 0.5". It seems that View handles it correctly, while Core 
doesn't.
Sunanda
16-Mar-2009
[791]
Thanks guys.
Other scripts with the same problem.....there are a couple. 

About 10% of all scripts have at least one extended ASCII char....But 
most of them are acceptable in LATIN-1 code page / charset (eg copyright 
symbol, some accented letters). It's just a very few scripts that 
use 1/4 and similar symbols that cause the problem.


What other editors? Windows NOTEPAD is one example of a common one 
that gets this wrong.
swall
16-Mar-2009
[792]
Vim and Editor² display the chars incorrectly.
Notepad++ shows the chars correctly.