r3wp [groups: 83 posts: 189283]
  • Home
  • Script library
  • AltME Archive
  • Mailing list
  • Articles Index
  • Site search
 

World: r3wp

[!REBOL3-OLD1]

Maxim
11-Sep-2009
[17482]
AFAIK unicode -> ascii is possible in R3 but don't know how... not 
having done it myself.  IIRC its on the R3 wiki or docs pages somehow.... 
googling it should give you some clues.
Pekr
11-Sep-2009
[17483x2]
REBOL 3.0 accepts UTF-8 encoded scripts, and because UTF-8 is a superset 
of ASCII, that standard is also accepted.

If you are not familiar 
with the UTF-8 Unicode standard, it is an 8 bit encoding that accepts 
ASCII directly (no special encoding is needed), but allows the full 
Unicode character set by encoding them with characters that have 
values 128 or greater.
It should accept Ascii directly ....
Maxim
11-Sep-2009
[17485x4]
that's on input.
print spits out unicode.
AFAIK
string! printing, to be more precise.  UTF and ASCII are  converted 
to two byte strings IIRC.  which is why you must re-encode them before 
spitting them via print.
Pekr
11-Sep-2009
[17489]
see the system/catalog/codecs for a list of loaded codecs

 - hmm, docs need an update. Dunno why the section was moved to system/codecs 
 ... will ask on R3 chat ...
PeterWood
11-Sep-2009
[17490]
Max - I believe that Carl has written sone tricky string code and 
strings can be either single or double byte depending on their content.
Maxim
11-Sep-2009
[17491]
possible, but I've always seen them output as double byte...  this 
topic has come around a few times in the last months
PeterWood
11-Sep-2009
[17492]
Running R3 from the Mac terminal the output from the print function 
is definitely utf-8 encoded.
Pekr
11-Sep-2009
[17493]
I tried to look-up some codecs, but there are none for text encodings 
as of yet:

SYSTEM/CODECS is an object of value:
   bmp             object!   [entry title name type suffixes]
   gif             object!   [entry title name type suffixes]
   png             object!   [entry title name type suffixes]
   jpeg            object!   [entry title name type suffixes]
PeterWood
11-Sep-2009
[17494]
I think that to binary! will decode a Rebol string! to utf-8 :

>> to binary! "^(20ac)"  ;; Unicode code point for Euro sign     
== #{E282AC} ;; utf-8 character sequence for Euro sign
Maxim
11-Sep-2009
[17495x3]
maybe peter's excellent encoding script on rebol.org could be used 
as a basis for converting between ascii -> utf8  when using R3 binary 
 as an input.  while R3 has them built-in
while = until
sort of like:

print to-ascii to-binary "some text"
Pekr
11-Sep-2009
[17498]
I don't want to encode anything for simple CGI purposes, gee ;-)
Maxim
11-Sep-2009
[17499x2]
but R3 is now fully encoded, which is REALLY nice.  you don't have 
a choice.  Resistance is futile  ;-)
and the fact that binary gives us the real byte array without any 
automatic conversion is also VERY nice, for building tcp handlers... 
it would have made my life much simpler in the past in fact.
Pekr
11-Sep-2009
[17501x2]
But this is some low level issue I should not care about. It displays 
Czech codepage correctly. Also the script is said being by default 
UTF-8, which is superset to ASCII. IIRC it was said, that unless 
we will not use special chars, it will work transparently. If it 
works on input, it should work also on output, no?
OK, so we have http headers, which are supposed to be in ASCII, and 
then html content, which can be encoded. Which responsibility is 
it to provide correct encoding? A coder, or an http server? Hmm, 
maybe coder, as I am issuing http content headers in my scripts?
PeterWood
11-Sep-2009
[17503]
Pekr: Just try a quick test with: 
 print to binary! "Content-type: text/html^/"
 print to binary! get-env "REQUEST_METHOD"
 print to binary! get-env "QUERY_STRING"
 print to binary! get-env "REMOTE_ADDR"

to see if it is an encoding problem.
Pekr
11-Sep-2009
[17504x2]
I think I tried, but it printed binaries ...
#{436F6E74656E742D74797065 #{474 #{ #{3132372E3 #{0
Maxim
11-Sep-2009
[17506]
but the loading actually does a re-encoding.  utf-8 is compact, buts 
its slow because you cannot skip unless you traverse the string char 
by char.  which is why they are internally converted to 8 or 16 bit 
unicode chars... it seems strings become 16 bits a bit too often 
(maybe a change in later releases, where they are always converted 
to 16 bits for some reason).
PeterWood
11-Sep-2009
[17507x2]
The content of the binaries are fine but their format is a probelm. 
Sorry, I forgot about that when I suggested to try them.
I tested you show.cgi with Apache on OS X. It runs fine and displays 
the expected output

GET 10.0.1.198
Pekr
11-Sep-2009
[17509]
Should I test with Apache too? I don't think Cheyenne is the problem 
though. But I already downloaded WAMP, so I will unpack it and check 
over the weekend ...
Maxim
11-Sep-2009
[17510x5]
possibly the windows version defaults to 16 bits more quickly than 
linux and OSX versions...  :-/
cause IIRC linux shell doesn't expect unicode as much as window's 
console.
(as per a past reading on R3 blogs and previous discussions about 
this)
probably why people say that cgi isn't working on windows.
or maybe the windows console (or some versions of the OS) doesn't 
understand ut8 at all, just 8 or 16 bit unicode... so that could 
explain why the windows version is dumping to stdout in 16 bits all 
the time. :-(
PeterWood
11-Sep-2009
[17515]
As I understand it the Windows console only handles single-byte encoding 
(ie Windows CodePages).
BrianH
11-Sep-2009
[17516]
Windows Unicode works in UTF-16. Linux and OSX work in UTF-8.
PeterWood
11-Sep-2009
[17517]
Pekr: One difference when I ran the cgi was that I used the -c option 
not the -q option. Perhaps you could try with the -c option in case 
Carl has done something under the surface about character encoding.
Pekr
11-Sep-2009
[17518]
Peter - it is the same for both options -c, and -q ...
BrianH
11-Sep-2009
[17519]
When last I heard, CGI wasn't working on Windows yet. Thanks for 
the info - now I know why.
Maxim
11-Sep-2009
[17520x2]
yep its pretty clear now  :-)
maybe a cgi-specific version of print could be added as a mezz which 
handles the proper encoding issues to make sure that console and 
cgi printing are both functional on all distros without needing to 
change the source.
BrianH
11-Sep-2009
[17522]
Maybe there's a trick that --cgi could do with I/O ports.
Maxim
11-Sep-2009
[17523x4]
ah yess.. --cgi could just tell the core to prevent the UTF-16 encoding 
being done on stdout...
but if we need to output latin-1 afterwards (while dumping the html 
content, for example), the output encoding  should be selectable 
as a "current default", and all the --cgi would do is set that default 
to UTF-8 for example.
since AFAICT, the internal string! representation is encoded to whatever 
is needed by the host, in the 'PRINT native already.


Choosing what that is manually would simplify the porting to other 
platforms, since the default host code would already have this flexibility 
covered.
and some systems pipe the std to have it pushed remotely to other 
systems... which can expect a different encoding than what is being 
used by the local engine... I've had this situation in my render-farm 
management software, as a real-life example.
BrianH
11-Sep-2009
[17527]
The trick is that the headers are pushed in ASCII, but the contents 
in whatever binary encoding the headers specify.
Maxim
11-Sep-2009
[17528x2]
yep... which is why it should be switcheable since rebol now does 
the encoding for us.  :-)
some systeme like RSS even support multiple encodings in the same 
xml document!
Pekr
11-Sep-2009
[17530]
how is that Linux and OS-X don't experience any problems? They do 
use UTF-8, but that is not ASCII either, no?
Maxim
11-Sep-2009
[17531]
UTF lower's 127 odes are the same as ASII and single byte.  so if 
you don't use special chars, or the null char, you are basically 
dumping ASCII... this is the reason for its existence.