Rebol & XML encoding; use encoding="windows-1252"

[1/6] from: al::bri::xtra::co::nz at: 5-Jul-2002 12:31

After a long and exhausting day or two, I discovered that I've been using the wrong XML character encoding. For Rebol running on Windows PCs creating XML or XHTML files or driving a CGI program from Rebol scripts or plain text files (like windows .txt files), it's best to use this tag: <?xml version="1.0" encoding="windows-1252"?> The problems one gets for not using the above tag, is that MS Internet Explorer (but not Opera or Netscape!) sometimes generates CGI query strings that can look like Chinese characters or long strings of gibberish. I tried the unicode encoding of "UTF-8" and "UTF-16" but get the problem that Rebol doesn't understand scripts written in unicode. Rebol seems only to read 8 bit characters, not the 16 bits (I think?) of unicode. This site: http://www.w3schools.com/xml/xml_encoding.asp helped me the most. Andrew Martin ICQ: 26227169 http://valley.150m.com/

[2/6] from: al:bri:xtra at: 5-Jul-2002 21:40

Actually, I'm fairly sure now that I'm partially wrong! I believe it's a bug in the MS operating system. I've been reading Ed Batutis' web site here: http://www.batutis.com/i18n/papers/mlang/samples/ and I've been trying out his MLangDet on my Windows XP system (with all the latest upgrades from Microsoft) on a text file, and came across a interesting problem with the MLangDet software. With a simple .txt file that contains just the following: Telephone: +64-6-9748241 with one empty line before and after, the MLangDet program reports this .txt file as Unicode (UTF-7). If I simply replace both of the "-" with a space, like this: Telephone: +64 6 9748241 Then MLangDet reports the .txt file as US-ASCII. I've also noticed that in MS Internet Explorer, when the first line of text is placed in XML/XHTML, the browser also declares that the page is now UTF-7 (instead of UTF-8) and shows the telephone number as: 6-9748241 instead of: +64-6-9748241 I think this behaviour is because both MS Internet Explorer and MLangDet use the same operating system function to detect the various encoding scheme. When I turn off MS Internet Explorer automatic detection, then the correct telephone number is shown. This is a very curious problem! Andrew Martin ICQ: 26227169 http://valley.150m.com/

[3/6] from: bry:itnisk at: 5-Jul-2002 14:51

Some questions: when you replace "-" with a space what did you replace it in, I mean where you using a text editor to look at output from your program, cause then it can be that the text editor is saving as US-ASCII instead of Unicode. Has anyone confirmed that Rebol won't write Unicode? Can you post the xml? You might at any rate consider writing ISO-8859-1 for the encoding as windows-1252 is windows specific, and ISO is cross-platform.

[4/6] from: al:bri:xtra at: 5-Jul-2002 14:56

Also (for MS Internet Explorer), be sure to turn *OFF* the menu option: View Encoding Auto-Select. Andrew Martin ICQ: 26227169 http://valley.150m.com/

[5/6] from: al::bri::xtra::co::nz at: 6-Jul-2002 11:23

Re: Rebol & XML encoding

bryan wrote:

> Has anyone confirmed that Rebol won't write Unicode?

I don't know about writing, but it's definite that Rebol can't read Unicode scripts. I wrote a Hello World program in Windows XP notepad and successively saved it with each of the four encodings: ANSI Unicode Unicode big endian UTF-8 and Rebol only understood the first encoding. For example:

>> do %Test.r

Script: "Untitled" (none) Hello World!

>> do %Test.r

** Syntax Error: Script is missing a REBOL header ** Near: do %Test.r

>> do %Test.r

** Syntax Error: Script is missing a REBOL header ** Near: do %Test.r

>> do %Test.r

** Syntax Error: Script is missing a REBOL header ** Near: do %Test.r And then back to the beginning again:

>> do %Test.r

Script: "Untitled" (none) Hello World!

> You might at any rate consider writing ISO-8859-1 for the encoding as

windows-1252 is windows specific, and ISO is cross-platform. I've now done that. Thanks for the suggestion!

> when you replace "-" with a space what did you replace it in, I mean where

you using a text editor to look at output from your program, cause then it can be that the text editor is saving as US-ASCII instead of Unicode. I've used Windows XP Notepad (saving as ANSI) and Metapad.

> Can you post the xml?

I've got the latest XHTML on my site here: http://valley.150m.com/Rebol/Telephone.html http://valley.150m.com/Rebol/Telephone with blanks.html The plain .txt (which are used by Rebol to generate the above) are at: http://valley.150m.com/Rebol/Telephone.txt http://valley.150m.com/Rebol/Telephone with blanks.txt Andrew Martin ICQ: 26227169 http://valley.150m.com/

[6/6] from: bry:itnisk at: 7-Jul-2002 17:28

I think part of this problem may be solved by reference to a thread on a different list, anyway I'll quote from the xsl-mulberrytech.com list David:

>But to get the html browser to detect the right encoding you need to

add

>a meta element to the head, the html output method does that >automatically but in XML you need to do it by hand

Julian:

>IE either supports XML (+CSS) (so the HTML engine won't >even look at it), *or* HTML. If you're producing HTML, the XML

declaration

>is irrelevant (the only thing that counts are the encoding declaration

from

>the HTTP response and/or the META tag in the HTML).

Ahaaa, so if I use the xml output method with xhtml doctypes, coupled with a hand coded <meta> tag I can get xhtml output thats decoded in unicode. After testing - this works fine for me, is it ok?