• Home
  • Script library
  • AltME Archive
  • Mailing list
  • Articles Index
  • Site search
 

AltME groups: search

Help · search scripts · search articles · search mailing list

results summary

worldhits
r4wp115
r3wp287
total:402

results window for this page: [start: 301 end: 400]

world-name: r3wp

Group: !AltME ... Discussion about AltME [web-public]
Kaj:
16-Jan-2011
I think they're each interpreting text according to their own native 
character set. Windows in UTF-16, OS X and Linux probably in UTF-8. 
AltME doesn't compensate
PeterWood:
16-Jan-2011
Yes, AltME simply ignores character encoding.It simply regurgitates 
the text it receives from the client. I believe that REBOL/View uses 
the default codepage under Windows(not utf-16) and MacRoman under 
OS/X. I suspect it uses ISO-8859-1 under Linux but am not sure.
Group: Parse ... Discussion of PARSE dialect [web-public]
Henrik:
5-Dec-2011
That's fine by me, as I read the file into memory once due to the 
need for one-time UTF-8 conversion, so that will happen outside LOAD-CSV.
Group: !REBOL3-OLD1 ... [web-public]
Maxim:
30-Oct-2009
I also think the "default" user text format should be configurable. 
  I have absolutely no desire to start using utf-8 for my code and 
data, especially when I have a lot of stuff that already is in iso 
latin-1 encoding.
Pekr:
30-Oct-2009
It is really not good, that I can't load my own local codepage. How 
should I make my source-file UTF-8? My Notepad will not probably 
add any BOM header for me automatically ...
Maxim:
30-Oct-2009
utf-8 needs no BOM... its only used as a signature.
Maxim:
30-Oct-2009
since rebol will load files as UTF-8 by default code doesn't need 
it.
PeterWood:
30-Oct-2009
..and sticking to the old ways means living with the old problems 
... like not knowing how to interprete characters properly ... like 
AlrME for example ... it assumes makes the assumption that all text 
in messages is encoded as though it was entered on your own machine. 
So messages from Mac users are incorrecly displayed on Windows machines 
and vice-versa.


For me, moving to utf-8 is a much easier problem to live with than 
not being able to properly share text across different platforms. 
It may be different for you.
Henrik:
30-Oct-2009
REBOL3's philosophy should be simple: UTF-8 is default. Anything 
else is possible, but must be optionally selected.
sqlab:
30-Oct-2009
Then I would prefer, that name and the string to compare have an 
unicode datatype, 
as in
>> type? name
== UTF-8.
Maxim:
30-Oct-2009
but utf-8 editors aren't rare nowadays, and using utf-8 sequences 
isn't hard... really, if you tuely want to keep using as ascii editor
Maxim:
30-Oct-2009
at least converging to utf-8, all scripts by all authors will work 
the same on all systems.
Maxim:
30-Oct-2009
I put a suggestion on the blog about allowing user-creating encoding 
maps... otherwise, you can load it as binary in R3 and just convert 
the czech chars to utf-8 multi-byte sequences and convert the binary 
to string using decode.
Maxim:
30-Oct-2009
R3 will interpret litteral strings and decode them using utf-8 (or 
the header encoding, if its supported) so in this case no.


but if the data is stored within binaries (equivalent to R2 which 
doesn't handle encoding) then, yes, since the binary represents the 
sequence of bytes not chars.


if you use a utf-8 editor, and type characters above 127 and look 
at them in  notepad, you will then see the UTF-8 byte sequences (which 
will look like garbled text, obviously).
Pekr:
30-Oct-2009
Is there utf-8 version of notepad? :-)
PeterWood:
30-Oct-2009
Notepad can apparently handle both UTF-8 and UTF-16 http://en.wikipedia.org/wiki/Notepad_(Windows)
Maxim:
30-Oct-2009
it tries to detect UTF based on text content... broken up until vista.
http://en.wikipedia.org/wiki/Notepad_%28Windows%29
Gabriele:
31-Oct-2009
Max: maybe you should start using a real operating system. But, that 
aside, if you have any software that does not handle utf-8, simply 
trash it. guys, really, this is crazy, we are in 2009, let's put 
an end to this codepage crap!
Gabriele:
31-Oct-2009
sqlab: what you say would make some sense if converting files was 
in any way difficult. (apart from the fact that you should have stopped 
using latin1 almost 10 years ago...). I've been using utf-8 with 
R2 for years...
Gabriele:
31-Oct-2009
Petr: notepad, as most windows stuff, uses utf-16. much easier to 
detect though, and R3 could do that (actually, didn't Carl just add 
that recently?) most "real" editors allow you to use whatever encoding 
you want, and definitely support utf-8.
Pekr:
31-Oct-2009
Aha, I just realised that I have to use Save-as, and choose UTF-8 
or Unicode, instead of default ANSI preset of notepad
Gabriele:
1-Nov-2009
Max, maybe i was not clear. If your rebol scripts are latin1 by default, 
while my rebol scripts are utf-8 by default, when i send you a rebol 
script IT WILL NOT WORK in the same way in your machine. the *script*'s 
encoding *must* be a standard everyone agrees on. then, the script 
can do whatever it wants with the data, it's your fault if you make 
it so data cannot be exchanged easily among systems.
Pekr:
1-Nov-2009
jocko - the same happened to me here under Windows. The problem is, 
that I used plain Notepad, which by default stores in ANSI compatible 
charset. Then I realised, that on a Save-as dialog, there is a button, 
where I can change ANSI to UTF-8 unicode. Then my strings loaded 
correctly. So - you have to be sure that your editor by default saves 
in UTF-8.
jocko:
1-Nov-2009
Yes, that was the problem ... and I already had it. But it will really 
be a trap for many non english users, from many countries. Another 
point to consider is that we may have difficulties reading normal 
(non-UTF-8) text files coming from other environments. In R2, this 
constraint did not exist.
Maxim:
1-Nov-2009
actually, it is a problem in R2.  if you store your code, and I open 
it with a different codepage version of windows... some letters will 
be skewed. 


In an application I wrote, I couldn't write out proper strings for 
the netherlands, as an example.


unicode is slowly becoming the standard for text... especially utf-8. 
 but yes, users have to be educated.  


within your apps, though, you can handle the encoding as you want... 
only the rebol sources have to be UTF-8 .  as R3 matures, more encodings 
will be most probably be included in string codecs to support 8 bit 
Extended ascii from different areas of the world.


and even high-profile applications like Apple's iweb have issues 
with text encoding... so this is a problem for the whole industry 
& users to adapt to.
BrianH:
1-Nov-2009
One interesting thing about R3 scripts is that they are UTF-8 *binary*, 
not converted strings. A header setting would just require R3 to 
convert the script to string! and then back to UTF-8 binary before 
reading the file. This is why we recommend that people DO [1 + 1] 
instead of DO "1 + 1", because that string needs to be converted 
to binary before it can be parsed.
BrianH:
1-Nov-2009
Even if we had a text encoding header for R3, it would be a *bad* 
idea to ever use encodings other than UTF-8. So don't.
Maxim:
14-Dec-2009
My only problem with R3 right now is that there is no codec for text 
reading .  This means I can't properly import C files for example, 
unless I convert them to utf-8 with something else first.


Has anyone done (or started to work on) a simple character mapping 
engine?
Group: !Cheyenne ... Discussions about the Cheyenne Web Server [web-public]
Graham:
19-Aug-2009
this is the request


GET /md/creategoogledoc.rsp?gdoc=simple-letter.rtf&patientid=2832&encounter=none 
HTTP/1.1
Host: gchiu.no-ip.biz:8000

User-Agent: Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US; rv:1.9.1.2) 
Gecko/20090729 Firefox/3.5.2 (.NET CLR 3.5.30729)

Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
Accept-Language: en-us,en;q=0.5
Accept-Encoding: gzip,deflate
Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7
Keep-Alive: 300
Connection: keep-alive
Referer: http://gchiu.no-ip.biz:8000/md/Listgoogledocs.rsp
Cookie: RSPSID=QZPTPCZIWWMMYBKWHWRQETGM
Will:
19-Aug-2009
answer from the redirection:
HTTP/1.1 302 Moved Temporarily
Content-Type: text/html; charset=UTF-8
Cache-Control: no-cache, no-store, max-age=0, must-revalidate
Pragma: no-cache
Expires: Fri, 01 Jan 1990 00:00:00 GMT
Date: Wed, 19 Aug 2009 21:43:58 GMT

Set-Cookie: WRITELY_UID=001dfpwvx2b|928b9de9e7bf56448b665282fc69988b; 
Path=/; HttpOnly

Set-Cookie: GDS_PREF=hl=en;Expires=Sat, 17-Aug-2019 21:43:58 GMT;HttpOnly

Set-Cookie: SID=DQAAAHcAAAB0kldc4zZSC_0FoiL6efkWE11k9SQkAIn-N3WfAzIOVe1cM-remnLUtV3Z4M-BFRf5eknz7hr_U3YzW94nECo0-aDnpxrLGiBglWGN4VkfLr5Hh7t2XNyRCA3VWd005SfCmZ9D8-1MUltjRI8X56VLde5Wy8HD92gh-8YkJBJxQA;Domain=.google.com;Path=/;Expires=Sat, 
17-Aug-2019 21:43:58 GMT

Location: https://www.google.com/accounts/ServiceLogin?service=writely&passive=true&nui=1&continue=http%3A%2F%2Fdocs.google.com%2FDoc%3Fdocid%3D0AcdrOHdpKfrWZGZwd3Z4MmJfMnNxcDJkNmZu%26amp%3Bhl%3Den&followup=http%3A%2F%2Fdocs.google.com%2FDoc%3Fdocid%3D0AcdrOHdpKfrWZGZwd3Z4MmJfMnNxcDJkNmZu%26amp%3Bhl%3Den&ltmpl=homepage&rm=false
Content-Encoding: gzip
X-Content-Type-Options: nosniff
Content-Length: 325
Server: GFE/2.0
Will:
19-Aug-2009
more redirection:
HTTP/1.1 302 Moved Temporarily

Set-Cookie: WRITELY_SID=DQAAAHoAAADh80lBIw7e5Hg06TLEBgCY33XQGJ1aUH5OrCF_ir1xLwffKNaCqNdUL6qYfvgjNppDBI4lTNBSTjJWMG_Ze0_qJnveBCAtihBDFwBlOb-H7RlkfgJwM7pBbyKV7bm4M3mqUivD1emtpxgl32vG8CEP1poQ2479HQXrlobsp7Egzw;Domain=docs.google.com;Path=/;Expires=Thu, 
03-Sep-2009 21:43:59 GMT

Location: http://docs.google.com/Doc?docid=0AcdrOHdpKfrWZGZwd3Z4MmJfMnNxcDJkNmZu&amp%3Bhl=en&pli=1
Content-Type: text/html; charset=UTF-8
Content-Encoding: gzip
Date: Wed, 19 Aug 2009 21:43:59 GMT
Expires: Wed, 19 Aug 2009 21:43:59 GMT
Cache-Control: private, max-age=0
X-Content-Type-Options: nosniff
Content-Length: 232
Server: GFE/2.0
Will:
19-Aug-2009
and the the target page:
HTTP/1.1 200 OK

Set-Cookie: WRITELY_SID=DQAAAHoAAADh80lBIw7e5Hg06TLEBgCY33XQGJ1aUH5OrCF_ir1xLwffKNaCqNdUL6qYfvgjNppDBI4lTNBSTjJWMG_Ze0_qJnveBCAtihBDFwBlOb-H7RlkfgJwM7pBbyKV7bm4M3mqUivD1emtpxgl32vG8CEP1poQ2479HQXrlobsp7Egzw;Domain=docs.google.com;Path=/;Expires=Thu, 
03-Sep-2009 21:43:59 GMT

Set-Cookie: GDS_PREF=hl=en;Expires=Sat, 17-Aug-2019 21:43:59 GMT;HttpOnly

Set-Cookie: user=; Expires=Tue, 18-Aug-2009 21:43:59 GMT; Path=/; 
HttpOnly

Set-Cookie: login=; Expires=Tue, 18-Aug-2009 21:43:59 GMT; Path=/; 
HttpOnly
Content-Type: text/html; charset=UTF-8
Cache-Control: no-cache, no-store, max-age=0, must-revalidate
Pragma: no-cache
Expires: Fri, 01 Jan 1990 00:00:00 GMT
Date: Wed, 19 Aug 2009 21:43:59 GMT
Content-Encoding: gzip
Transfer-Encoding: chunked
X-Content-Type-Options: nosniff
Server: GFE/2.0
Will:
21-Aug-2009
a noter (ds les headers poste ds group Cheyenne) le premier redirect 
envoie:
Cache-Control: no-cache, no-store, max-age=0, must-revalidate
Pragma: no-cache
Expires: Fri, 01 Jan 1990 00:00:00 GMT
le deuxieme:
Expires: Wed, 19 Aug 2009 21:43:59 GMT
Cache-Control: private, max-age=0
a noter aussi, le premier envoie aussi:
Content-Type: text/html; charset=UTF-8
mais pas de Content-Length
le deuxieme envoie:
Content-Encoding: gzip
Content-Length: 232
et pas de Content-Type

.. un vrai mess.. normalment j'ai confiance en google, ils sont tres 
peeki mais la je comprends pas
Dockimbel:
25-Dec-2009
Important notice wrt web sockets : IIRC, all data sent on both sides 
have to be UTF-8 encoded. The current Cheyenne implementation doesn't 
enforce that encoding, so it's up to the developer to send the right 
data format.
Terry:
25-Dec-2009
UTF-8 support is icing on the cake.
Graham:
25-Dec-2009
Not using the default config .. but I get this

26/12-10:17:23.838-[RSP] ##RSP Script Error: 

	URL  = /ws.rsp
	File = www/ws.rsp

	** Script Error : Invalid path value: data 
	** Where: rsp-script 
	** Near:  [prin request/content/data] 


Request  = make object! [

    headers: [Host "localhost:8000" Connection "keep-alive" User-Agent 
    {Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/532.5 
    (KHTML, like Gecko) Chrome/4.0.249.43 Safari/532.5} Accept {application/xml,application/xhtml+xml,text/html;q=0.9,text/plain;q=0.8,image/png,*/*;q=0.5} 
    Accept-Encoding "gzip,deflate" Accept-Language "en-GB,en-US;q=0.8,en;q=0.6" 
    Accept-Charset "ISO-8859-1,utf-8;q=0.7,*;q=0.3"]
    status-line: #{474554202F77732E72737020485454502F312E310D0A}
    method: 'GET
    url: "/ws.rsp"
    content: none
    path: "/"
    target: "ws.rsp"
    arg: none
    ext: '.rsp
    version: none
    file: %www/ws.rsp
    script-name: none
    ws?: none
]
Graham:
13-Feb-2010
POST /cgi-bin/rebdev HTTP/1.0
Accept: */*
Accept-Charset: utf-8
Host: host.rebol.net
User-Agent: REBOL
Content-Type: application/x-www-form-urlencoded; charset=utf-8
Content-Length: 56


[0.4.0 "Graham" password  login]HTTP/1.1 500 Internal error in REBOL 
CGI Proxy
Date: Sun, 14 Feb 2010 02:05:45 GMT
Server: Apache/2.0.53 (Fedora)
Content-Type: text/plain; charset=UTF-8
Via: 1.1 bc7
Connection: close

Cannot connect (3)
Dockimbel:
9-Dec-2011
Just something to keep in mind when working on websockets: the transer 
mode used by Cheyenne to reply to clients is "text" mode. This mode 
requires UTF-8 encoding and IIRC, the browser is allowed to reject 
your response and close the connection if the response is wrongly 
encoded.
Endo:
9-Dec-2011
Is it UTF-8 in your chat example? Cheyenne converts text to UTF-8?
Text mode is ok to me.


By the way, I tested ws.html in Cheyenne sources on my XP/Home yesterday 
with Chrome, it closes the connection immediately.
But it works here now, on XP/Pro with Chrome.
Dockimbel:
9-Dec-2011
Chat demo: no conversion, it's UTF-8 as long as everyone talks in 
english. ;-)
Dockimbel:
9-Dec-2011
Chat demo: in fact, it should work ok in all cases, because the UTF-8 
encoding is done by the browser and the chat back-end just broadcast 
it as is to others.
Dockimbel:
21-Dec-2011
Sounds like a good idea (making INCLUDE remove UTF-8 BOM, if found).
Group: !REBOL2 Releases ... Discuss 2.x releases [web-public]
BrianH:
23-Jan-2010
Yes, this means that we have fully working SINGLE?, COLLECT-WORDS, 
INVALID-UTF? and RESOLVE. Even R3 doesn't have a fully working INVALID-UTF? 
yet; since R2/Forward is mezzanine we can fix bugs that are still 
pending in the R3 natives.
BrianH:
30-Jan-2010
Posted mezzanine changes for 2.7.8, ported from R2/Forward 2.100.80.0:

- Added COLLECT-WORDS, RESOLVE, SINGLE?, IMMEDIATE!, INTERNAL!, INVALID-UTF?,
  CD, MORE, and the convenience words LS, PWD, RM and MKDIR.

- Removed buggy binary! support from ASCII? and LATIN1?, as done 
in 2.100.60.


See mezz-control.r #6763, mezz-file.r #6776, mezz-series.r #6772, 
mezz-string.r

#6773 and mezz-reflect.r #6771 for the relevant changes. Details 
in R3's docs.


Note: The APPEND and REMOLD rewrites are too awkward to incorporate 
without a

native APPLY function. UNBIND hasn't been written yet (hoping for 
a native).
BrianH:
26-Mar-2010
2.7.8 additions from R2/Forward: RESOLVE, CD, MORE, LS, PWD, RM, 
MKDIR, SINGLE?, COLLECT-WORDS, INVALID-UTF?, and some compatibility 
fixes to ASCII? and LATIN1?.
BrianH:
31-Dec-2010
Some of what is coming in 2.7.8:

- Bug fixes and enhancements to improve Cheyenne, and other apps 
that have to do similar stuff.

- Some native fixes for non-Windows platforms, particularly Linux.

- Environment variable stuff: GET-ENV expansion on Windows, SET-ENV, 
LIST-ENV

- Function fixes: RUN enabled, LIST-REG/values, possibly TO-LOCAL-FILE

- R2/Forward: FUNCT/extern, LAST?, COLLECT-WORDS, EXTRACT fixes, 
ASCII? fixes, LATIN1? fixes, INVALID-UTF?, CD, LS, MORE, PWD, RM

- (Still pending) Natives: ASSERT, APPLY, RESOLVE, FOREACH set-word 
support
BrianH:
2-Jan-2011
What we got in 2.7.8, that I know of:

- Bug fixes and enhancements to improve Cheyenne, and other apps 
that have to do similar stuff.

- Some native fixes for non-Windows platforms, particularly Linux. 
See ACCESS-OS.

- Environment variable stuff: GET-ENV expansion on Windows, SET-ENV, 
LIST-ENV

- Function fixes: SELECT object!, FIND object!, RUN enabled, LIST-REG/values

- R2/Forward: FUNCT/extern, LAST?, COLLECT-WORDS, RESOLVE, APPLY 
fixes, EXTRACT fixes, ASCII? fixes, LATIN1? fixes, INVALID-UTF?, 
CD, LS, MORE, PWD, RM
Group: !REBOL3 Extensions ... REBOL 3 Extensions discussions [web-public]
Robert:
8-Dec-2009
If the c-level side uses UTF-8 strings as well, can I just use the 
Rebol series as is? get_string returns a decoded string.
PeterWood:
11-Nov-2010
Oldes: HAve you tested the function with a string including a unicode 
code point which translates to a three-byte utf-8 character? The 
size of utf8str appears to beonly  twice the number of codepoints 
in the REBOL stirng.


A good example of a three-byte utf-8 character is the Euro sign - 
Unicode 20AC UTF-8 E2 82 AC
PeterWood:
11-Nov-2010
The maximum length of a utf-8 translation of a UCS-2 string would 
be 1.5 times the length of the string. So if wcslen returns the number 
of codepoints in a string, the length of the utf-8 should be the 
length of the str multiplied by 3 integer divided by 2 plus 1.
Maxim:
11-Nov-2010
Oldes, thanks for that UTF-8 function converter  :-)
Oldes:
12-Nov-2010
Again with Cyphre's help, here is a function which converts MultiByte 
(utf-8) string from C side to REBSER used to return the string to 
REBOL side:

REBSER* MultiByteToRebser(char* mbStr) {

    int len = MultiByteToWideChar( CP_UTF8, 0, mbStr, -1, NULL, 0);
    //the len is length of the string + null terminator
    wchar_t *wcStr = malloc(len * sizeof(wchar_t));

    int result = MultiByteToWideChar(CP_UTF8, 0, mbStr, strlen(mbStr), 
    wcStr, len);
    if (result == 0) {
        int err = GetLastError();
        RL->print("ERROR: MultiByteToWideChar -> %d\n", err);
        exit(-1); //how to throw ERROR on REBOL side?
    }
    REBSER *ser = RL_MAKE_STRING(len-1,TRUE);
    REBUNI *dst;

    //hack! - will set the tail to len
    REBINT *s = (REBINT*)ser;
    s[1] = len-1;

    RL_GET_STRING(ser,0,(void**)&dst);
    wcscpy(dst, wcStr);

    free(wcStr);
    wcStr = NULL;
    return ser;
}

I'm not sure how safe it is, but it seems to be working.
To return the string value I use:

RXA_TYPE(frm, 1) = RXT_STRING;
RXA_SERIES(frm, 1) = (REBSER *)MultiByteToRebser(utf8str);
return RXR_VALUE;
Group: !REBOL3 Schemes ... Implementors guide [web-public]
Graham:
5-Jan-2010
read and write are very similar ... can we do this?

		read: func [
			port [port!]
			/write data
		] [
			either any-function? :port/awake [

    unless open? port [cause-error 'Access 'not-open port/spec/ref] 
				if port/state/state <> 'ready [http-error "Port not ready"] 
				port/state/awake: :port/awake 
				do-request port 
				port
			] [
				sync-op port either write [ data ] [[]]
			]
		] 
		write: func [
			port [port!] 
			value
		] [

   unless any [block? :value any-string? :value] [value: form :value] 

   unless block? value [value: reduce [[Content-Type: "application/x-www-form-urlencoded; 
   charset=utf-8"] value]] 
			read/write port data 
		]
Graham:
5-Jan-2010
spec/headers: body-of make make object! [
		Accept: "*/*" 
		Accept-Charset: "utf-8" 
		Host: either spec/port-id <> 80 [
			rejoin [form spec/host #":" spec/port-id]
		] [
			form spec/host
		] 
		User-Agent: "REBOL"
	] spec/headers 
what exactly is this code doing?
Graham:
5-Jan-2010
I wonder why he can't do this 

spec/headers: make spec/headers [
		Accept: "*/*" 
		Accept-Charset: "utf-8" 
		Host: either spec/port-id <> 80 [
			rejoin [form spec/host #":" spec/port-id]
		] [
			form spec/host
		] 
		User-Agent: "REBOL"
]
Graham:
6-Jan-2010
HEAD / HTTP/1.0
Accept: */*
Accept-Charset: utf-8
Host: www.rebol.com
User-Agent: REBOL

HTTP/1.1 200 OK
Date: Wed, 06 Jan 2010 07:28:08 GMT

Server: Apache/1.3.37 (Unix) mod_auth_passthrough/1.8 mod_log_bytes/1.2 
mod_bwlimited/1.4 PHP/4.4.7 FrontPage/5.0.2.2635.SR1.2 mod_ssl/2.8.28 
OpenSSL/0.9.7a
Last-Modified: Fri, 01 Jan 2010 21:19:01 GMT
ETag: "3f44376-2667-4b3e66c5"
Accept-Ranges: bytes
Content-Type: text/html
Via: 1.1 bc1
Content-Length: 9831
Connection: close
Graham:
6-Jan-2010
You're sending 

HEAD www.rebol.com HTTP/1.0
Accept: */*
Accept-Charset: utf-8
Host: www.rebol.com
User-Agent: REBOL

which is invalid
Gabriele:
6-Jan-2010
read returns binary if it can't convert the content to string (ie. 
content-type is not text/* and charset is not UTF-8.) this was a 
quick addition after the unicode changes, and needs much more work.
Graham:
24-Jan-2010
payload: create-sdb-message "ListDomains" listDomains 10


result: to-string write http://sdb.amazonaws.comcompose [ POST [ 
Content-Type: {text/xml; charset="utf-8"} SOAPaction: "ListDomains"] 
(payload) ]
Group: !REBOL3 GUI ... [web-public]
Cyphre:
12-Aug-2010
There is no charset selection. You just provide valid UTF-8 codes 
for the appropriate unicode chars, thats all. Also ofcourse you need 
to have font that contains that chars. I was using 'Arial Unicode 
MS' font in the test screens which is a huge font containing big 
chunk from all the unicode pages.
Group: !REBOL3 ... [web-public]
joannak:
26-Jan-2010
Returns ..  Utf-8 encoded string. 
== #{C3A441424344}
Andreas:
14-Feb-2010
performance will be less of an issue once we have support for a fast 
codec (utf-32/ucs-4), leaving mostly the extra function call(s)
Claude:
9-May-2010
brianH in /etc/default/local i have  LANG="fr_BE.UTF-8"   (ubuntu 
lucid 10.4)
Maxim:
26-May-2010
if all you do is:


rebol_source = "PARSE {.... UTF-8 data from scintilla ...} parse-rules 
";
do_string (rebol_source);


probably very fast, enough for real time,  if script isn't huge :-)
Robert:
20-Aug-2010
Added it as a codec so you can access it:

>> ml: decode 'markup read http://www.rebol.com
>> foreach tag ml [probe tag]
<!doctype html>
^/
<html>
<head>
^/
<meta name="generator" content="REBOL WIP Wiki"/>
^/
<meta name="date" content="10-Aug-2010/12:18:33-7:00"/>
^/
<meta name="rebol-version" content="2.100.97.4.2"/>
^/

<meta http-equiv="Content-Type" content="text/html;charset=UTF-8"/>
BrianH:
21-Sep-2010
Now for the other binding stuff:


* SET is a low-level function that would be slowed down immensely 
by adding any refinements.

* SET does handle the unbound scenario: It triggers an error. You 
can then handle the error.

* R2 and R3 get their speed from the direct binding model. The core 
speedup of that model is that SET doesn't bind.


* LOAD in R3 is a high-level mezzanine function. It is meant to be 
as fast as possible given its purpose, but being fast is not its 
main goal; user-level flexibility is. Most of the overhead of LOAD 
is in handling all of its various options, as refinements, file extensions 
and script header settings. If you know what you are doing, you can 
always optimize your code by doing it directly instead of having 
LOAD try to figure out that is what you want to do. LOAD is not meant 
for use in tight loops.


* Henrik, ChristianE, the R3 standard answer to the question of how 
to make BIND TO-WORD "a" more efficient or friendly in R3 is this: 
You are encouraged to not program that way in R3. Converting strings 
to words is something you should do once, not all the time in tight 
loops. Your code will be much more efficient if you work in REBOL 
data rather than storing your code in strings and converting at runtime. 
Strings are for data, or script source, not for containing scripts 
at runtime. This is a good rule for all REBOL versions, but especially 
for R3 with its Unicode strings vs. shared UTF-8 words.


* I have recently refactored LOAD so that it is broken into smaller, 
more efficient functions. You might find that those functions would 
work better for you in lower-level code. But this was done to let 
us make LOAD *more* powerful, not less, so the same advice I gave 
above about not using LOAD in tight loops still applies. I don't 
yet know if the new LOAD is faster or slower, but it is safer and 
easier to understand and modify, and you can make your own LOAD replacement 
that calls the same low-level functions if you like. Plus, you get 
compressed scripts :)
BrianH:
7-Oct-2010
Here's a low-level function to parse and process script headers, 
which shows how many features are built into the base script model 
in R3:

load-script: funct [
	"Decode a script into [header-obj script-ref body-ref]"

 source [binary! string!] "Source code (string will be UTF-8 encoded)"
	/header "Return the header object only, no script processing"

 ;/check "Calculate checksum and assign it to the header checksum 
 field"
	/original "Use original source for Content header if possible"
] compose [
	data: either string? source [to-binary source] [
		unless find [0 8] tmp: utf? source [ ; Not UTF-8
			cause-error 'script 'no-decode ajoin ["UTF-" abs tmp]
		]
		source
	]

 ; Checksum all the data, even that before the header or outside the 
 block
	;sum: if check [checksum/secure data]  ; saved for later
	
	if tmp: script? data [data: tmp] ; Find the start of the script
	
	; Check for a REBOL header
	set/any [hdr: rst:] transcode/only data
	unless case [
		:hdr = 'rebol [ ; Possible REBOL header
			set/any [hdr rst] transcode/next/error rst
			block? :hdr ; If true, hdr is header spec
		]
		:hdr = [rebol] [ ; Possible script-in-a-block
			set/any [hdr rst] transcode/next/error rst
			if block? :hdr [ ; Is script-in-a-block
				unless header [ ; Don't decode the rest if /header
					data: first transcode/next data
					rst: skip data 2
				]
				true
			] ; If true, hdr is header spec
		]
	] [ ; No REBOL header, use default
		hdr: [] rst: data
	]
	; hdr is the header spec block, rst the position afterwards

 ;assert/type [hdr block! data [binary! block!] rst [binary! block!]]
	;assert [same? head data head rst]
	
	; Make the header object, or fail if we can't

 unless hdr: attempt [construct/with :hdr system/standard/header] 
 [
		cause-error 'syntax 'no-header data
	]
	; hdr is a correct header object! here, or you don't get here

 ;if check [append hdr 'checksum  hdr/checksum: sum]  ; calculated 
 earlier

 ;assert [sum =? select hdr 'checksum]  ; Should hdr/checksum be reserved?
	

 if header [return hdr] ; If /header, no further processing necessary

 ; Note: Some fields may not be final because post-processing is not 
 done.
	
	; Skip any whitespace after the header

 ws: (charset [1 - 32]) ; For whitespace skipping (DEL not included)
	if binary? rst [parse rst [any ws rst:]] ; Skip any whitespace
	
	; Check for compressed data and decompress if necessary
	case [
		; Magic autodetection of compressed binary
		tmp: attempt [decompress rst] [
			data: rst: tmp  ; Use decompressed data (no header source)
			append hdr 'compressed  hdr/compressed: true ; Just in case
		]
		; Else not directly compressed (without encoding)
		(select hdr 'compressed) != true [] ; Not declared, do nothing
		; Else it's declared to be compressed, thus should be
		binary? rst [ ; Regular script, check for encoded binary
			set/any [tmp rst] transcode/next/error rst
			either tmp: attempt [decompress :tmp] [
				data: rst: tmp  ; Use the decoded binary (no header source)
				hdr/compressed: 'script  ; So it saves the same way
				; Anything after the first binary! is ignored
			] [cause-error 'script 'bad-press -3] ; Else failure
		]
		; Else it's a block, check for script-encoded compressed binary
		tmp: attempt [decompress first rst] [

   data: rst: tmp  hdr/compressed: 'script  ; It's binary again now
		]
		; Else declared compressed but not compressed, so fail
		'else [cause-error 'script 'bad-press -3]
	]
	
	; Save the script content in the header if specified
	if :hdr/content = true [
		hdr/content: either original [source] [copy source]
	]
	

 ;assert/type [hdr object! data [binary! block!] rst [binary! block!]]
	;assert [same? head data head rst]

 reduce [hdr data rst]  ; Header object, start of source, start of 
 body
]


Note all the commented assert statements: they are for testing (when 
uncommented) and documentation. Also, I later removed the checksum 
calculation from this code because it was the wrong place to put 
it, at least as far as modules are concerned. However, Carl didn't 
know this because he was working on it while I was offline for a 
few days.
BrianH:
7-Oct-2010
Here is the corresponding function in the code reorg, renamed. The 
friendly empty lines and comments haven't been added yet.

load-header: funct/with [
	"Loads script header object and body binary (not loaded)."

 source [binary! string!] "Source code (a string! will get UTF-8 encoded)"

 no-decompress [logic!] "Skip decompression of body (because we want 
 to look at header mainly)"
][
	; This function decodes the script header from the script body.

 ; It checks the 'checksum, 'compress and 'content fields of the header.

 ; It will set the 'content field to the binary source if 'content 
 is true.

 ; It will set the 'compress field to 'script for compressed embedded 
 scripts.

 ; If body is compressed, it will be decompressed (header required).

 ; Normally, returns the header object and the body text (as binary).

 ; If no-decompress is false and the script is embedded and not compressed
	; then the body text will be a decoded block instead of binary.
	; Errors are returned as words:
	;    no-header
	;    bad-header
	;    bad-checksum
	;    bad-compress
	; Note: set/any and :var used - prevent malicious code errors.
	case/all [
		binary? source [data: assert-utf8 source]
		string? source [data: to binary! source]
		not data: script? data [return reduce [none data]] ; no header

  set/any [key: rest:] transcode/only data none ; get 'rebol keyword

  set/any [hdr: rest:] transcode/next/error data none ; get header 
  block

  not block? :hdr [return 'no-header] ; header block is incomplete

  not attempt [hdr: construct/with :hdr system/standard/header][return 
  'bad-header]

  :hdr/content = true [hdr/content: data] ; as of start of header (??correct 
  position??)
		:key = 'rebol [ ; regular script

   rest: any [find rest non-ws rest] ; skip whitespace after header

   ;rest: any [find rest #[bitset! [not bits #{7FFFFFFF80}]] rest] ; 
   skip whitespace
			case/all [

    all [:hdr/checksum :hdr/checksum != checksum/secure rest] [return 
    'bad-checksum]

    no-decompress [return reduce [hdr rest]] ; decompress not done

    :hdr/compress = 'script [set/any 'rest first transcode/next rest]
			] ; rest is now suspect, use :rest
		]

  :key = [rebol] [ ; embedded script, only 'script compression supported
			case/all [
				:hdr/checksum [return 'bad-checksum] ; checksum not supported

    no-decompress [return reduce [hdr rest]] ; decompress not done

    rest: skip first transcode/next data 2 none ; decode embedded script

    :hdr/compress [hdr/compress: unbind 'script  set/any 'rest first 
    rest]
			] ; rest is now suspect, use :rest
		]

  :hdr/compress [rest: attempt [decompress :rest]] ; :rest type-checked 
  by decompress

  not :rest [return 'bad-compress] ; only happens if above decompress 
  failed
	]

 ;assert/type [hdr object! rest [binary! block!]] ; just for documentation
	reduce [hdr rest]
][
	non-ws: charset [not 1 - 32]
]

Notes:

- The other half of the CASE/all style is a lot of explicit shortcut 
RETURN statements, whenever the normal flow differs.

- Errors are returned as a word from the error catalog, which is 
later passed to CAUSE-ERROR.

- Carl redid the checksum calculation so that scripts can verify 
against a checksum in their header, to detect corruption.

- The checksum in the header probably can't be used for the module 
checksum because the header itself matters for modules.

- Compressed scripts lost a couple minor, unimportant features that 
we are likely better without. Quiz: What features?

- Part, but not all of the reason the code is shorter is because 
the doc comments haven't been added yet. The CASE/all style helps 
though.
ChristianE:
13-Oct-2010
IIRC, READ at one point only returned the data read as a binary stream, 
forcing you to DELINE TO STRING! READ ... because of the transition 
to UTF-8, but /STRING was added back later. Found nothing in the 
change log, though.
BrianH:
18-Nov-2010
One thing will definitely be easier though: JSON and Javascript define 
that they have Unicode source, but don't have a way to specify the 
encoding (they are text standards, not binary). They can be handled 
easily in R3 once the source is converted to a string though, since 
that conversion will handle the encoding issues. In R2 you'd have 
to either stick to ASCII data or use Gabriele's text codecs and then 
parse the UTF-8.
BrianH:
11-Jan-2011
Some *? functions that might be better off as *-OF: ENCODING?, FILE-TYPE?, 
INDEX?, LENGTH?, SIGN? and SIZE?. Except for the first two the old 
names would need to stick around because of the legacy naming rules. 
Strangely enough, UTF? is OK because it is short for "UTF what?". 
The series contents functions have an implicit -OF :)
PeterWood:
17-Feb-2011
That sounds both very worrying and a challenge - how big were the 
XML files? Were they utf-8 encoded? Did you verify the utf-8 encoding 
in the XML or could it have contained invalid utf-8 sequences?
PeterWood:
20-Apr-2011
So, if I understand correctly, I would write someting like:

iso-ch: union #"^(40) utf-ch-2

and 

utf-ch: rejoin [#"^{C3}" difference #"^(40)" iso-ch]
Andreas:
12-Oct-2011
The only function in R3 that operates that way is TRANSCODE, so as 
long as it doesn't choke on overlong combinations

#{c0ae} is an overlong encoding for #"." (#{2e}).

>> invalid-utf? #{c0ae}
== #{C0AE}

>> transcode #{c0ae}
== [® #{}]

>> transcode #{2e}
== [. #{}]
BrianH:
12-Oct-2011
So, on R3 INVALID-UTF? flags overlong encodings? Sorry I missed that. 
Better fix the R2/Forward version accordingly.
BrianH:
12-Oct-2011
INVALID-UTF? returns the series at the position of the first invalid 
sequence. If it doesn't flag it returns none.
Andreas:
12-Oct-2011
Ok. R2's invalid-utf? catches all 2-byte overlong forms, but not 
all 3 or 4-byte overlong forms.
Andreas:
12-Oct-2011
R2>> invalid-utf? #{e080af}
R2== none

R3>> invalid-utf? #{e080af}
R3== #{e080af}
Andreas:
12-Oct-2011
So, R3's invalid-utf? seems to flag overlong encodings in general. 
R2(/Forward)'s invalid-utf? only catches overlong forms for 2-byte 
sequences, but not for 3- or 4-byte sequences.
Group: Power Mezz ... Discussions of the Power Mezz [web-public]
Gabriele:
27-Jan-2010
Will: I tend to have strings as UTF-8, but char! values need to be 
latin1 for R2. The .r files are the result of MOLD so although I 
have #"^(XX)" in the RLP source, you get the actual latin1 char in 
the .r.
Janko:
22-Sep-2010
I can use decode-email-field to decode various encodings of subject. 
But I wasn't able to figure out how can I decode the content of an 
email which is in my case encoded with quoted-printable / utf8 . 


I found to-netmsg library on codeconscius.com code that loads the 
email text and parses it into structure. It doesn't decode the subject 
=?UTF-8?B...?= but it does the content. I could use that and power 
mezz to get the effect I want. If there is a way to encode content 
in power-mezz I would rather use it alone.
Group: !REBOL3 Host Kit ... [web-public]
Kaj:
2-Jan-2011
I'd be surprised if AGG couldn't work with UTF-8, and that wouldn't 
be the default on Unix
Kaj:
2-Jan-2011
They probably use UTF-8
BrianH:
2-Jan-2011
That might also require some conversion, but at least then the conversion 
would be there to use. R3 uses UCS for strings internally for speed 
and code simplicity, though strangely enough words are stored in 
UTF-8 internally, since you don't have to access and change words 
on a character basis.
BrianH:
2-Jan-2011
Windows uses UTF-16 for its APIs, not UCS-2, so by using UCS-2 R3 
is limited to the BMP codepoints.
BrianH:
2-Jan-2011
If it is UCS-2 or UTF-16, then all that would need to be done is 
to convert UCS-1 model R3 strings to UCS-2 mode somewhere before 
rendering. (He says glibly, having not analyzed the AGG sources or 
APIs.)
Kaj:
2-Jan-2011
I'm still guessing this only applies to AGG on Windows, using UTF-16. 
On other platforms, AGG uses FreeType, and I guess that would accept 
UTF-8
Oldes:
2-Jan-2011
Also I'm not sure REBOL is using UTF-8 internally, I think it has 
only ANSI or UCS2
Kaj:
2-Jan-2011
No, as Brian says, it's using fixed width vectors internally. You 
get UTF-8 only from conversions
Group: Core ... Discuss core issues [web-public]
Ashley:
11-Apr-2011
OK, this is freaky:

>> system/version
== 2.7.8.2.5
>> a: list-env
== [
    "TERM_PROGRAM" "Apple_Terminal" 
    "TERM" "xterm-color" 
    "SHELL" "/bin/bash" 
    "TMPDIR" "/var/folders/6O/6OnXy9XG...
>> help a
A is a block of value: [
    "TERM_PROGRAM" "Apple_Terminal" 
    "TERM" "xterm-color" 
    "SHELL" "/bin/bash" 

    "TMPDIR" "/var/folders/6O/6OnXy9XGEjiDp3wDqfCJo++++TI/-Tmp-/" 
    "Apple_PubSub_Socket_Render" "/tmp/launch-BrITkG/Render" 
    "TERM_PROGRAM_VERSION" "273.1" 
    "USER" "Ash" 
    "COMMAND_MODE" "legacy" 
    "SSH_AUTH_SOCK" "/tmp/launch-HlnoPI/Listeners" 
    "__CF_USER_TEXT_ENCODING" "0x1F5:0:0" 

    "PATH" {/usr/bin:/bin:/usr/sbin:/sbin:/usr/local/bin:/usr/X11/bin} 
    "PWD" "/Users/Ash" 
    "LANG" "en_AU.UTF-8" 
    "SHLVL" "1" 
    "HOME" "/Users/Ash" 
    "LOGNAME" "Ash" 
    "DISPLAY" "/tmp/launch-U0Gaqw/org.x:0" 
    "_" "/Users/Ash/REBOL/rebol"
]
>> length? a    
== 18
>> select a "USER"
== "Ash"
>> select a "HOME"
== none
Oldes:
10-Nov-2011
I can imagine length? on char! value in unicode context - it could 
return number of bytes needed to store the char with the utf-8 encoding:) 
But I'm sure I can live without it. It would just add overheat to 
the length! action.
Ladislav:
11-Nov-2011
I want to share with you an "interoperability problem" I encountered. 
In Windows (at least in not too old versions) there are two versions 
of string-handling functions:

- ANSI (in fact using a codepage for latin charset)
- widechar (in fact UNICODE, restricted to 16 bits, I think)


It looks, that Apple OS X "prefers" to use decomposed UNICODE, also 
known as UTF-8MAC, I guess. That means, that it e.g. for a Robert's 
file it generates a filename looking (in a transcription) as follows:

%"Mu^(combining-umlaut)nch.r"

As far as the UNICODE goes, this is canonically equivalent to

%"M^(u-with-umlaut)nch.r"

, but:


- Windows don't consider these file names equivalent, i.e. you can 
have both in one directory

- When using the former, the ANSI versions of Windows system functions 
"translate" the name to: %"Mu^(umlaut)nch.r"

-- the %"Mu^(umlaut)nch.r" is a third file name, distinct from both 
of the above, so, if the R2 reads it in a directory, it is unable 
to open it
Group: Red ... Red language group [web-public]
Dockimbel:
28-Feb-2011
I plan to support UTF-8 scripts for both Red & Red/System. The memory 
storage model is not yet decided, could be UTF-8 or UCS-2.
Dockimbel:
29-Mar-2011
Brian: right, but I'm not sure that Red/System needs to be Unicode-aware, 
at least not to implement UTF-8 Red's sources parsing.
Dockimbel:
29-Mar-2011
Well, by default, Red/System could be transparent to UTF-8 (that's 
what will be used in Red for strings I/O), as is string! in R2. Will 
add char! as unicode codepoint to possible evolutions anyway.
Dockimbel:
11-Oct-2011
Anyone knows where to find exhaustive lists of invalid UTF-8 encoding 
ranges?
Andreas:
11-Oct-2011
C0, C1, F5-FF must never occur in UTF-8.
BrianH:
11-Oct-2011
You might also consider looking at the source of INVALID-UTF? in 
R2, which is MIT licensed from R2/Forward.
BrianH:
11-Oct-2011
It would still be a good idea to review the Unicode standard to determine 
which of the characters should be treated as spaces, but that would 
still be a problem for R3 because all of the delimiters it currently 
supports are one byte in UTF-8 for efficiency. If other delimiters 
are supported, R3's parser will be much slower.
Andreas:
12-Oct-2011
Completely forgot about INVALID-UTF? :)
301 / 402123[4] 5