• Home
  • Script library
  • AltME Archive
  • Mailing list
  • Articles Index
  • Site search
 

AltME groups: search

Help · search scripts · search articles · search mailing list

results summary

worldhits
r4wp115
r3wp287
total:402

results window for this page: [start: 201 end: 300]

world-name: r3wp

Group: All ... except covered in other channels [web-public]
BrianH:
31-Mar-2009
Processing UTF-8 to parse REBOL data, yes.
BrianH:
31-Mar-2009
I say UTF-8 because this is R3 LOAD we are talking about - R2's LOAD 
won't change again.
Group: Core ... Discuss core issues [web-public]
Jerry:
19-Oct-2006
About the out-of-memory error, the story is ...


I am trying to compare two complete Windows Registry, which are both 
huge. I export them into files (in little-endian 16-bit Unicode), 
which are both 300+ MB. To save the space and to make REBOL easy 
to handle them, I encode these files into UTF-8. they are now 150+ 
MB. I try to load these two UTF-8 files into memory:

>> lines1: read/lines %/c/reg1.reg

== ["Windows Registry Editor Version 5.00" "" "[HKEY_LOCAL_MACHINE]" 
"" ...
>> lines3: read/lines %/c/reg2.reg
== ** Script Error: Not enough memory
** Where: halt-view
** Near: halt 
>> rebol/version
== 1.3.2.3.1
Rebolek:
20-Oct-2006
Jerry: For conversion from/to UTF/UCS... you can use Oldes' unicode 
tools, it handles it very well (unfortunately you have to look around 
AltMe for some link, because Oldes does not upload to rebol.org and 
has his files all around the web - shame on you, Oldes! ;)
DanielSz:
14-Nov-2007
There is a nice script that encodes strings to utf-8. It is by Romano 
Paolo & Oldes. I'd like the reverse:  decoding utf-8 strings. I found 
a script by Jan Skibinski proposing to do that, but the script doesn't 
load in rebol, exiting with an error ('map has no value). What's 
next?
DanielSz:
14-Nov-2007
BTW, I noticed that rebol.org serves pages in utf-8 encoding, but 
the scripts themselves are latin-1. This is not a problem for the 
code, but it is a problem for the comments, which may contain accented 
characters. For example, names of authors (hint: Robert Müench), 
and they consequently appear garbled. I'm not saying pages should 
be served as latin-1, on the contrary, I am an utf-8 enthusiast, 
I think rebol scripts themselves should be encoded as utf-8, (it 
is possible with python, for example). I hope Rebol3 will be an all 
encompassing utf-8 system (am I dreaming?).
btiffin:
14-Nov-2007
UTF-8 is being discussed as part of R3 Unicode support.  All encompassing? 
 Dunno.  Well thought out and robust?  I'd bet on that one.
DanielSz:
14-Nov-2007
That is to say, not only should there be utf-8 string datatype, but 
words themselves should be utf-8 compliant.
Gabriele:
15-Nov-2007
Daniel, afaik, R2 words are already utf-8 compliant. (thanks to the 
way utf-8 is designed)
Brock:
3-Sep-2008
Any way to get a copy of the R3 Alpha?  I have a very simple script 
that has to deal with utf-8 text that this would really help with.
Louis:
23-Sep-2008
Henrick, from what you say, I think I see what is happened. I'm copying 
the string from a utf-8 encoded file to an assci encoded file. The 
copy converts the string to different characters. But how do I get 
around this?
BrianH:
5-Mar-2009
kib2: "Does that mean that we can use unicode encoding with the help 
of r2-forward ?"

No, I only can only spoof datatypes that don't exist in R2, and R2 
has a string! type. The code should be equivalent if the characters 
in the string are limited to the first 256 codepoints of Unicode 
(aka Latin-1), though only the first 128 codepoints (aka ASCII) can 
be converted from binary! to string and have the binary data be the 
same as minimized UTF-8.
Gabriele:
10-Apr-2009
if you can wait for it (release does not depend on me), i have any-charset 
to utf-8 and utf-8 to html (and vice-versa, with support for all 
known named entities as well)
Henrik:
29-May-2009
http://www.openldap.org/lists/openldap-devel/200304/msg00123.html


Anyone made a REBOL version of this? It's a UTF-8 <-> ISO-8859-1 
converter in C.
Graham:
8-Aug-2009
But if I do a wireshark trace, I see this

GET /20090806.7z HTTP/1.0
Accept: */*
Connection: close
User-Agent: REBOL View 2.7.6.3.1
Host: remr.s3.amazonaws.com

HTTP/1.0 403 Forbidden
Date: Sat, 08 Aug 2009 21:08:07 GMT
Content-Type: application/xml
x-amz-request-id: D03B3FA12CC875D5

x-amz-id-2: u3b7TkPzJc5NBwvov4HRQuMsCsosD7le9xfRMSGiCN2BXgeae6kKMVQAbhzqRDwY
Server: AmazonS3
Via: 1.1 nc1 (NetCache NetApp/6.0.5P1)

<?xml version="1.0" encoding="UTF-8"?>

<Error><Code>AccessDenied</Code><Message>Access Denied</Message><RequestId>D03B3FA12CC875D5</RequestId><HostId>u3b7TkPzJc5NBwvov4HRQuMsCsosD7le9xfRMSGiCN2BXgeae6kKMVQAbhzqRDwY</HostId></Error>
BrianH:
30-Jan-2010
invalid-utf?: funct [

 "Checks for proper UTF encoding and returns NONE if correct or position 
 where the error occurred."
	data [binary!]
	/utf "Check encodings other than UTF-8"
	num [integer!] "Bit size - positive for BE negative for LE"
] compose [
	ascii: (charset [#"^(00)" - #"^(7F)"])
	utf8+1: (charset [#"^(C2)" - #"^(DF)"])
	utf8+2: (charset [#"^(E0)" - #"^(EF)"])
	utf8+3: (charset [#"^(F0)" - #"^(F4)"])
	utf8rest: (charset [#"^(80)" - #"^(BF)"])
	switch/default any [num 8] [
		8 [ ; UTF-8
			unless parse/all/case data [(pos: none) any [
				pos: ascii | utf8+1 utf8rest |
				utf8+2 2 utf8rest | utf8+3 3 utf8rest
			]] [as-binary pos]
		]
		16 [ ; UTF-16BE
			pos: data
			while [not tail? pos] [
				hi: first pos
				case [
					none? lo: pick pos 2 [break/return pos]
					55296 > w: hi * 256 + lo [pos: skip pos 2]  ; #{D800}
					57343 < w [pos: skip pos 2]  ; #{DFFF}
					56319 < w [break/return pos]  ; #{DBFF}
					none? hi: pick pos 3 [break/return pos]
					none? lo: pick pos 4 [break/return pos]
					56320 > w: hi * 256 + lo [break/return pos]  ; #{DC00}
					57343 >= w [pos: skip pos 4]  ; #{DFFF}
				]
				none
			] ; none = valid, break/return pos = invalid
		]
		-16 [ ; UTF-16LE
			pos: data
			while [not tail? pos] [
				lo: first pos
				case [
					none? hi: pick pos 2 [break/return pos]
					55296 > w: hi * 256 + lo [pos: skip pos 2]  ; #{D800}
					57343 < w [pos: skip pos 2]  ; #{DFFF}
					56319 < w [break/return pos]  ; #{DBFF}
					none? lo: pick pos 3 [break/return pos]
					none? hi: pick pos 4 [break/return pos]
					56320 > w: hi * 256 + lo [break/return pos]  ; #{DC00}
					57343 >= w [pos: skip pos 4]  ; #{DFFF}
				]
				none
			] ; none = valid, break/return pos = invalid
		]
		32 [ ; UTF-32BE
			pos: data
			while [not tail? pos] [
				if any [
					4 > length? pos
					negative? c: to-integer pos
					1114111 < c  ; to-integer #{10FFFF}
				] [break/return pos]
			]
		]
		-32 [ ; UTF-32LE
			pos: data
			while [not tail? pos] [
				if any [
					4 > length? pos

     negative? c: also to-integer reverse/part pos 4 reverse/part pos 
     4
					1114111 < c  ; to-integer #{10FFFF}
				] [break/return pos]
			]
		]
	] [
		throw-error 'script 'invalid-arg num
	]
]

; Note: Native in R3, which doesn't support or screen the /utf option 
yet.

; See http://en.wikipedia.org/wiki/Unicodefor charset/value explanations.
Geomol:
25-May-2010
This can be even more complicated when talking UTF encoding. Hm, 
who knows how R3 do this...
Henrik:
13-Jun-2010
>> str-enc-utils/iso-8859-15-to-utf-8 "aø"
== "" ; bad
>> str-enc-utils/iso-8859-15-to-utf-8 "ø"
== "ø" ; good
>> str-enc-utils/iso-8859-1-to-utf-8 "aø" ; hangs
Group: View ... discuss view related issues [web-public]
Jerry:
9-Dec-2006
Gabriele, 

Actually, Oldes is right. Showing two-byte characters is good enough. 
IME is not necessary for REBOL/View, because every Chinese/Japanese/Korea 
OS has proper IMEs installed. IME sends the codes encoded in the 
OS codepage to the focused window. For Example, If the codepage used 
by Windows XP is Big5 and I type in the Character which means one 
( #{A440} in Big5, #{4E00} in Unicode, see http://www.unicode.org/cgi-bin/GetUnihanData.pl?codepoint=4E00
), my REBOL/View program will get two key events sequentially, which 
are #{A4} and #{40}. REBOL/View shows it as two characters instead 
of one. I hope that REBOL/View can let the OS do the text-drawing, 
like the REBOL/core console does. REBOL/core console doesn't have 
the Chinese-Character-Showing issue, because it basically send the 
#{A4} and #{40} to the console, and let the OS do the text-drawing. 
the OS knows that #{A4} and #{40} should be combine to one Big5 Character, 
so it just show it as one character. Of course, if I type in two 
ASCII characters, the OS is smart enough not to combine them into 
one "non-existing" Big5 Character. CJK encodings are supersets of 
ASCII, just like UTF-8 is a superset of ASCII.


It's nothing to do with Unicode, so it is not too difficult to fix, 
I guess. Please fix this in 2.7.5 or 2.7.6 please ... 

It's on my wish list for Santa Claus this year.
PeterWood:
30-Oct-2008
I've come across what seems to be an oddity with View on the Mac.Iit 
seems that the Rebol/View console is using UTF-8 encoding but that 
View is using MacRoman.
Gabriele:
31-Oct-2008
the "console" on Mac and Linux is just a terminal (OS provided), 
and they are usually UTF-8. That has nothing to do with View.
ChristianE:
29-Apr-2010
A hard-space, this can be encoded by UTF-8 as 0xC2 0xA0, 0xC2 is 
#"Â".
Group: I'm new ... Ask any question, and a helpful person will try to answer. [web-public]
Gabriele:
26-Jan-2010
You never notice this (and in fact, don't need to know), because 
strings are always converted on i/o. to binary! string will convert 
it to UTF-8. other i/o will convert it to the platform's standard 
encoding (UTF-8 on Mac and Linux, UTF-16 on Windows)
jack-ort:
2-Jul-2010
Hello - hope someone can find the newbie mistake I'm making here. 
 Wanted to use REBOL to tackle a need to get data from Salesforce 
using their SOAP API.  New to SOAP, WSDL and Salesforce, but using 
SoapUI mananged to do this POST (edited only to hide personal info):

POST https://login.salesforce.com/services/Soap/u/19.0HTTP/1.1
Accept-Encoding: gzip,deflate
Content-Type: text/xml;charset=UTF-8
SOAPAction: ""
User-Agent: Jakarta Commons-HttpClient/3.1
Host: login.salesforce.com
Content-Length: 525


<soapenv:Envelope xmlns:soapenv="http://schemas.xmlsoap.org/soap/envelope/"
xmlns:urn="urn:partner.soap.sforce.com">
   <soapenv:Header>
      <urn:CallOptions>
         <urn:client></urn:client>
         <urn:defaultNamespace></urn:defaultNamespace>
      </urn:CallOptions>
   </soapenv:Header>
   <soapenv:Body>
      <urn:login>
         <urn:username>[jort-:-xxxxxxxxxxxxx-:-com]</urn:username>

         <urn:password>xxxxxxxxxx78l6g7iFac5uaviDnJLFxxxxx</urn:password>
      </urn:login>
   </soapenv:Body>
</soapenv:Envelope>

and get the desired response:

HTTP/1.1 200 OK
Server: 
Content-Encoding: gzip
Content-Type: text/xml; charset=utf-8
Content-Length: 736
Date: Fri, 02 Jul 2010 20:32:14 GMT


<?xml version="1.0" encoding="UTF-8"?><soapenv:Envelope xmlns:soapenv="http://schemas.xmlsoap.org/soap/envelope/"
xmlns="urn:partner.soap.sforce.com" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"><soapenv:Body><loginResponse> 
......

Then using SoapUI I am able to send a successful Logout message.


Using REBOL 2.7.7.3.1, I created an "upload" string containing the 
POST block above without the "POST " at the beginning, set my url 
to:

>> url
== https://login.salesforce.com/services/Soap/u/19.0

and tried this:

>> response: read/custom url reduce ['POST upload]

but consistently get a Server 500 error:


** User Error: Error.  Target url: https://login.salesforce.com:443/services/Soap/u/19.0 
could not be retrieved.  Se
rver response: HTTP...
** Near: response: read/custom url reduce ['POST upload]

For completeness, here's the upload value:

>> print mold upload
{https://login.salesforce.com/services/Soap/u/19.0HTTP/1.1
Accept-Encoding: gzip,deflate
Content-Type: text/xml;charset=UTF-8
SOAPAction: ""
User-Agent: Jakarta Commons-HttpClient/3.1
Host: login.salesforce.com
Content-Length: 525


<soapenv:Envelope xmlns:soapenv="http://schemas.xmlsoap.org/soap/envelope/"
xmlns:urn="urn:partner.soap.sforce.com">

   <soapenv:Header>
      <urn:CallOptions>
         <urn:client></urn:client>
         <urn:defaultNamespace></urn:defaultNamespace>
      </urn:CallOptions>
   </soapenv:Header>
   <soapenv:Body>
      <urn:login>
         <urn:username>[jort-:-researchpoint-:-com]</urn:username>

         <urn:password>metrics12378l6g7iFac5uaviDnJLFVprDl</urn:password>
      </urn:login>
   </soapenv:Body>
</soapenv:Envelope>}

Would appreciate any help you can give!
Group: Tech News ... Interesting technology [web-public]
onetom:
30-Aug-2011
i tried gedit too recently on a mac. luckily there was a binary version, 
because the compilation segfaulted... well, it's quite nice. i could 
see that as an open source alternative, but despite of the fact it's 
supposed to support utf-8, it didn't...
Dockimbel:
22-Sep-2011
That should be doable, with the "wearable" version of Arduino boards 
(the Lilypad): http://www.arduino.cc/en/Main/ArduinoBoardLilyPad


They are also some wrist watch level Arduino-based prototypes (often 
using an OLED display): http://www.google.fr/search?gcx=w&q=wrist+watch+arduino&um=1&ie=UTF-8&tbm=isch&source=og&sa=N&hl=fr&tab=wi&biw=1113&bih=1036


A few more very creative Arduino watches: http://hackaday.com/tag/watch/

There's even one you can already buy: http://www.getinpulse.com
Group: !REBOL3-OLD1 ... [web-public]
Jerry:
13-Jul-2007
According to http://www.rebol.net/r3blogs/0076.html, in REBOL 3, 
CHAR! is a 8bit and 16bit character.


This could be problematic, I guess. Why don't we have two different 
datatypes instead: 16-bit CHAR! and 8-bit BYTE! The 16-bit CHAR! 
is in UTF-16, just like Java.

STRING! is BYTE! string.
UNICODE! is CHAR! string.

How you you think about that?
Jerry:
14-Jul-2007
Kai, my point is, I don't want an atomic type (which is CHAR!) to 
present two different sizes. I don't want to write my code like the 
following:

myfunc: func [ ch [ char!] ] 
[
    either ( size? ch ) = 8 [
        ; do something about 8-bit char 
    ] [ ;
        ; do something about UTF-16 char
    ]
]
Gabriele:
14-Jul-2007
currenly, unicode is not there yet, so this has not been defined 
yet (i think char! is still 8 bit). but in principle, append a-string 
char-gt-255 will either error out or automatically encode to utf-8 
(latter would be nice but it must be done for values gt than 127, 
so it would be a problem if you don't want utf-8)
PeterWood:
14-Dec-2007
Louis: From what I can tell from DocBase; initially the unicode support 
will be that the Rebol source will be UTF-8 encoded. The next step 
seems to be changing string! to UTF-8 encoding.


It looks as though work hasn't yet started on the unicode! datatype.
BrianH:
14-Dec-2007
UTF-8 is a strict extention of ASCII, but ASCII is only defined between 
0 and 127. Characters 128+ are not ASCII, they are extensions, and 
their meaning depends on the codepage. The codepage of an 8-bit string 
is unknown, unless you specify it externally (or do a lot of statistical 
calculations). Strings or scripts with characters extended by a codepage 
will have to be translated by a codepage-to-utf-8 function or process 
specific to the particular codepage, ahead of time. Fortunately, 
such a process can be fast and even implemented in a byte-oriented 
language easily.
PeterWood:
14-Dec-2007
BrianH: I undertstood that UTF-8 can be multi-byte depending on the 
Unicode of the character being represented.
PeterWood:
14-Dec-2007
Is my reading of Docbase correct that string! values will be UTF-8 
encoded?
BrianH:
25-Jul-2008
So to answer Louis' question: Not yet, as far as we know. The data 
structures for Unicode strings are there, as are UTF-8 word! values, 
but binary encoding and decoding is not yet there, and there are 
some limts to Unicode input and output (mostly due to the Windows 
console). The encoding/decoding work seems likely to get done as 
a part of Carl's GUI work, as that will probably include text display. 
The console IO limits are likely to remain until the written-in-REBOL 
GUI console is adopted.
PeterWood:
27-Oct-2008
I'm confused by these statements in Docbase:


 "1. The READ-STRING function is a temporary function used to read 
 files and convert them from binary (and possibly in Unicode format) 
 into a string datatype."

I thought that the string datatype was now UTF-8 encoded.
Anton:
28-Oct-2008
Peter, binary mode is the default for READ.

READ-STRING looks at the binary and tries to interpret it, checking 
for unicode format (and maybe other formats), before converting to 
rebol string, which internally is UTF-8.
PeterWood:
28-Oct-2008
So does this mean that the graphics library is still treating a string 
as being 8-bit encoded?  No doubt according to the current Windows 
codepage?


does READ-STRING convert  utf-8 to whatever 8-bit encoding the graphics 
library is using?
Gabriele:
28-Oct-2008
string! internally is NOT utf-8 in R3.
Anton:
28-Oct-2008
Oops. Isn't it utf-16, at least when necessary ?
BrianH:
28-Oct-2008
As far as your code is concerned, a string! will be a series of Unicode 
codepoints. Internally, who cares? The implementation of string! 
is likely to be the same as the native implementation on the platform 
is running on, or whatever is more efficient. I think that string! 
is now UTF-16 on Windows, and the symbols behind word! values are 
internally UTF-8.


Still, it doesn't matter what strings are internally because AS-STRING 
and AS-BINARY are gone. All string-to-binary conversions will need 
encoding. REBOL scripts are going to be UTF-8 encoded though, as 
I recall.
Gabriele:
29-Oct-2008
string! internals are not OS dependent afaik, and technically it's 
not UTF-16 either. currently, R3 switches automatically between an 
array of 8-bit unsigned values, and an array of 16-bit unsigned values. 
i assume a 32-bit mode will be added in the future as not all codepoints 
will fit 16 bits, though those that don't are very rare.
BrianH:
29-Oct-2008
Peter, the array of unsigned values would effectively be UCS-2 if 
it behaves the way Gabriele says. This would mean it would be faster, 
but use more memory for those with characters outside the BMP. It 
would also cause a problem on Windows because Windows >= 2000 is 
internally UTF-16, as are all of its Unicode APIs.
BrianH:
29-Oct-2008
You could store UTF-16 in an array of unsigned 16-bit values as long 
as your length, insertion and deletion routines are UTF-16 aware.
PeterWood:
29-Oct-2008
BrianH: As I understand UCS-2 cannot be used to encode characters 
outside the BMP. It is a full subset of UTF-16. It should not cause 
problems with WIndows Unicode API except that it would not be able 
to display characters outside the BMP. (It would instead include 
 an non-displayable character for each 2-bytes returned by Windows.)
Gabriele:
31-Oct-2008
Brian: same as there is a conversion between the Linux UTF-8 APIs 
and the internal 16 bit array, there is a conversion between the 
Windows UTF-16 APIs and the internal 16 bit array. In the same way, 
we can eventually support 32 bit arrays and convert those UTF-16 
strings that can fit into it to just 8 bit arrays.
BrianH:
31-Dec-2008
I would not trust non-ascii characters for now. With any luck the 
server saves the messages as binary UTF-8, don't know yet.
Gabriele:
2-Jan-2009
considering that R3 will probably just send everything as UTF-8, 
I don't think the server has any reason at all to worry about encodings.
Gabriele:
2-Jan-2009
you have to worry about encodings when you do conversions. i don't 
see where the R2 server is doing any of that. Also, with UTF-8 there 
is no need to worry about encodings on searches and things like that. 
The only issue could be sorting, but that is also region specific 
so it's a completely different issue that R3 cannot solve globally 
either.
PeterWood:
2-Jan-2009
As you say, if all the input is UTF-8 everything will be fine. I 
mistakenly thought that the inputs from different environments would 
have been differenlty encoded as they are with R2.
BrianH:
2-Jan-2009
That would have to be the case with R2 clients, as the client is 
the part that handles character encoding. However, there are no R2 
clients yet. The messages appear to be UTF-8 encoded end-to-end, 
stored in binary on the server, which is encoding agnostic. Once 
we have R2 clients, they will have to handle the codepage-to-UTF-8 
encoding, or just stick to ASCII.
Sunanda:
3-Jan-2009
REBOL.org shows a ? because if blindly emits all Alte pages as charset=utf-8.

If (this works in Firefox....You change your default for the page 
-- view/character encoding / western iso-8859-1) then:
-- Peter's post shows a GBP [for his char 163]
-- Chris' post shows a 1/2 [for his char 189]
btiffin:
3-Jan-2009
If I was a betting man, by 2020 UTF-8 will reign and compsci grads 
will need a history book to learn about ASCII.
PeterWood:
4-Jan-2009
Reichart ...you are right thep problem is one of encoding. My point 
is that because Rebol/View uses different encoding systems on different 
platforms it  is left to the application to either ignore the encoding 
differences or handle them.


This may be quite difficult if, as Chris indicated, it is not possible 
to determine which Windows Codepage is in use from Rebol/View. 


Tthere is a single unified character system (Unicode ) but there 
are at least five different ways of representing it (UTF-8, UTF-16LE, 
UTF-16BE, UTF-32LE & UTF-32BE). Standardisation is a long way off.
Gabriele:
4-Jan-2009
Reichart, what I mean is that you don't even need tools, as long 
as the server software properly emits only utf-8 and reports that 
it accepts only utf-8... after doing that, if there are still browsers 
that do not comply, then we can start talking about tools (which 
are trivial, most of the time, by the way).
Sunanda:
4-Jan-2009
Another part of the problem, at least from the webpage viewpoint, 
is that each of us could be posting AltME messages in different charsets.


All the HTML emitters for AltME worlds that I know of (AltME's own, 
REBOL.org, REBOL.net) emit a single webpage file, so it can only 
have one charset.


To do it right, each post should be emitted as a separate document/frame 
item. Then they'll each have their own charset.....That's a lot of 
extra work. Let's hope Gabriele's solution (a utf-8 universe) happens 
before that becomes essential.
Chris:
4-Jan-2009
Brian -- ASCII is a subset of UTF-8...
Chris:
4-Jan-2009
With QM, I try to assume (and enforce) UTF-8 (declaring on forms, 
html escaping everything ASCII+), but it's definitely a chore.
BrianH:
7-Jan-2009
Here's the current source for LOAD:

load: func [
	{Loads a file, URL, or string.}
	source [file! url! string! any-block! binary!]

 /header  {Includes REBOL header object if present. Preempts /all.}

;	/next    {Load the next value only. Return block with value and 
new position.}

;	/library {Force file to be a dynamic library. (Command version)}
;	/markup  {Convert HTML and XML to a block of tags and strings.}
	/all     {Load all values. Does not evaluate REBOL header.}
	/unbound {Do not bind the block.}
	/local data tmp
][
	; Note: Avoid use of ALL func, because of /all option
	if any-block? :source [return :source]

	data: case [
		string? source [to-binary source]
		binary? source [source]
		; Check for special media load cases: (temporary code)
		find [%.jpg %.jpeg %.jpe] suffix? source [
			return load-jpeg read/binary source
		]

  url? source [read source] ; can this possibly return not binary! 
  ?
		file? source [read source] ; binary! or block of file!
	]

 ; At this point, data is binary!, a block of file!, or something 
 weird.

	if binary? :data [
		unless find [0 8] tmp: utf? data [
			cause-error 'script 'no-decode ajoin ['UTF tmp]
		]

		; Only load script data:
		if any [header not all] [ ; Note: refinement /all
			if tmp: script? data [data: tmp]
		]
	]

	unless block? :data [data: to block! :data] ; reduce overhead

 ; data is a block! here, unless something really weird is going on
	tmp: none
	
	; Is there a REBOL script header:
	if any [header not all] [ ; /header preempts /all
		tmp: unless any [

   ;not any [file? source url? source] ; removed: hdr in string is same
			unset? first data ; because <> doesn't work with unset!
			'rebol <> first data
			not block? second data
		][ ; Process header:
			attempt [construct/with second data system/standard/script]
		]
		; tmp is header object or none here
		case [
			tmp [
				remove data
				either header [change data tmp][remove data]
				tmp: tmp/type = 'module ; tmp true if module
			]
			header [cause-error 'syntax 'no-header data]
		]
	]
	; tmp is true if module, false or none if not

 ; data is a block!, with possible header object in first position

	; Bind to current global context if not a module:
	unless any [
		unbound
		tmp ; not a module
	][
		bind/new data system/contexts/current
	]

 ; data is a block! here, unless something really weird is going on

	; If appropriate and possible, return singular data value:
	unless any [ ; avoid use of ALL
		all
		header ; This fixes a design flaw in R2's LOAD
		;not block? :data ; can this ever happen?
		empty? data ; R2 compatibility
		not tail? next data
	][data: first data]
	; If /all or /header, data is a block here

	:data
]
Steeve:
9-Feb-2009
hum, or you can pass a header block to the write function as is:
>> write [ url!  [ User-Agent: "TOTO" ... ]  #{...data...}]

but it's bugy, you have to add yourself some missing header properties 
in the block to generate a valid request.

like Content-Type: "application/x-www-form-urlencoded; charset=utf-8"
BrianH:
15-Feb-2009
Kib2, likely when the chat server is finished being ported to R3 
on Linux. We were running into problems with Unicode user names on 
R2, since R2 can't do case-insensitive comparisons of Unicode strings, 
even when encoded in UTF-8.
DideC:
26-Feb-2009
R3 assume text file are UTF-8 encoded. Probably yours is ANSI or 
something else
Gabriele:
27-Feb-2009
Brian, are you sure that R3 handles case-insensitiveness in Unicode 
text better than R2 with UTF-8?
BrianH:
28-Feb-2009
Gabriele, I know that R3 handles case-insensitiveness of Latin1 characters 
in Unicode text better than R2 with UTF-8, but beyond that I don't 
know. There is still some stuff to do relative to Unicode. The implications 
of Unicode support on the rest of the system are mostly implemented 
though, with the glaring exception of PARSE.
Gabriele:
1-Mar-2009
Brian: if it's just latin1, does it really change things? it won't 
always work anyway. but, there's plenty of code to convert utf-8 
to latin1 in R2, so why not just doing that if it's really important 
to have case insensitive accented character in the user names?
BrianH:
1-Mar-2009
Gabriele, Unicode isn't done, so I don't understand this: "does it 
really change things?". I was just answering your question about 
R3. Whether R2 can handle case insensitivity of UTF-8 when third-party 
libraries are added is another issue.


I can't say what Carl's priorities are beyond where he had me mark 
CureCode tickets as urgent, so I can't say when or if Unicode will 
be "finished". User account creation is reenabled, so the whole reason 
this conversation started is moot now.
BrianH:
3-Apr-2009
load: func [
	{Loads a file, URL, or string.}

 source [file! url! string! binary! block!] {Source or block of sources}

 /header  {Includes REBOL header object if present. Preempts /all.}

 /next    {Load the next value only. Return block with value and new 
 position.}

;	/library {Force file to be a dynamic library. (Command version)}
;	/markup  {Convert HTML and XML to a block of tags and strings.}
	/all     {Load all values. Does not evaluate REBOL header.}
	/unbound {Do not bind the block.}
	/local data content val rst tmp

][  ; Note: Avoid use of ALL and NEXT funcs, because of /all and 
/next options
	content: val: rst: tmp: none ; In case people call LOAD/local
	
	; Retrieve the script data
	data: case [
		block? source [ ; Load all in block
			return map x source [apply :load [:x header next all unbound]]
		]
		string? source [source] ; Will convert to binary! later
		binary? source [source]
		; Otherwise source is file or url
		'else [
			; See if a codec exists for this file type
			tmp: find find system/catalog/file-types suffix? source word!
			; Get the data, script required if /header
			content: read source  ; Must be a value, not unset
			case [
				binary? :content [content] ; Assumed script or decodable
				string? :content [content] ; Assumed script or decodable
				header [cause-error 'syntax 'no-header source]
				block? :content [content]
				'else [content: reduce [:content]]
			] ; Don't LOAD/header non-script data from urls and files.

  ] ; content is data if content doesn't need copying, or none if it 
  does
	]
	;print [1 "data type?" type? :data 'content true? :content]
	if string? :data [data: to-binary data] ; REBOL script is UTF-8

 assert/type [data [binary! block!] content [binary! string! block! 
 none!]]
	assert [any [binary? :data not header]]
	if tmp [ ; Use a codec if found earlier
		set/any 'data decode first tmp :data

  ; See if we can shortcut return the value, or fake a script if we 
  can't
		case [

   block? :data [if header [insert data val: make system/standard/script 
   []]]

   header [data: reduce [val: make system/standard/script [] :data]]

   (to logic! unbound) and not next [return :data] ; Shortcut return

   any [next any-block? :data any-word? :data] [data: reduce [:data]]
			'else [return :data] ; No binding needed, shortcut return
		]
		assert/type [data block!] ; If we get this far
	]
	;print [2 'data mold to-string :data]
	
	if binary? :data [ ; It's a script
		unless find [0 8] tmp: utf? data [ ; Not UTF-8
			cause-error 'script 'no-decode ajoin ["UTF-" abs tmp]
		]
		; Process the header if necessary
		either any [header not all] [
			if tmp: script? data [data: tmp] ; Load script data
			; Check for a REBOL header
			set/any [val rst] transcode/only data
			unless case [
				:val = [rebol] [ ; Possible script-in-a-block
					set/any [val rst] transcode/next/error rst
					if block? :val [ ; Is script-in-a-block
						data: first transcode/next data
						rst: skip data 2
					] ; If true, val is header spec
				]
				:val = 'rebol [ ; Possible REBOL header
					set/any [val rst] transcode/next/error rst
					block? :val ; If true, val is header spec
				]
			] [ ; No REBOL header, use default
				val: [] rst: data
			]
			; val is the header spec block, rst the position afterwards

   assert/type [val block! rst [binary! block!] data [binary! block!]]
			assert [same? head data head rst]
			; Make the header object

   either val: attempt [construct/with :val system/standard/script] 
   [
				if (select val 'content) = true [
					val/content: any [:content copy source]
				]
			] [cause-error 'syntax 'no-header data]
			; val is correct header object! here, or you don't get here
			; Convert the rest of the data if necessary and not /next
			unless any [next block? data] [data: rst: to block! rst]
			if block? data [ ; Script-in-a-block or not /next
				case [

     header [change/part data val rst] ; Replace the header with the object

     not all [remove/part data rst]	; Remove the header from the data
				]
				rst: none ; Determined later
			]
		] [rst: data] ; /all and not /header
	]

 ; val is the header object or none, rst is the binary position after 
 or none

 assert/type [val [object! none!] rst [binary! none!] data [binary! 
 block!]]

 assert [any [none? rst same? head data head rst] any [val not header]]

 ;print [3 'val mold/all :val 'data mold/all :data "type?" type? :data]
	
	; LOAD/next or convert data to block - block either way
	assert [block? data: case [
		not next [ ; Not /next
			unless any [block? data not binary? rst] [data: to block! rst]
			data
		]
		; Otherwise /next

  block? data [reduce pick [[data] [first+ data data]] empty? data]
		header [reduce [val rst]] ; Already transcoded above
		binary? rst [transcode/next rst]
	]]
	
	; Bind to current global context if not a module
	unless any [ ; Note: NOT ANY instead of ALL because of /all
		unbound
		(select val 'type) = 'module
	][
		bind/new data system/contexts/current
	]
	;print [6 'data mold/all :data 'tmp mold/all :tmp]
	
	; If appropriate and possible, return singular data value
	unless any [
		all header next  ; /all /header /next
		empty? data
		1 < length? data
	][set/any 'data first data]
	;print [7 'data mold/all :data]
	
	:data
]
shadwolf:
9-Apr-2009
and since text editing is related to UTF-8  better to get all the 
things ended to not have to constantly have to redo part of the job
Gabriele:
21-Apr-2009
Geomol, the difference I'm pointing out is the following: suppose 
you have an array of unicode code points. each element in the array 
is an integer that represents a character. you can "encode" it to 
UTF-8. there is no magic, for each integer you have a corresponding 
sequence of bytes.
Gabriele:
21-Apr-2009
Now, if your array was representing a url, you could encode it to 
UTF-8 using the % encoding as well to stay in the ascii subset. This 
is encoding, but still, it will not solve your @ problem. each @ 
in the array of integers will become an @ (which is an ascii char) 
in the final string.
Geomol:
21-Apr-2009
Maybe we got unicode encoding end escape encoding confused.


As I see it, given correct rules, auto converting of user input to 
correct url can be achieved. I made this function to illustrate, 
what I mean (it's not optimized, but should be easy to read):

encode-url: func [input /local url components host] [
	components: parse input "@"
	host: back tail components

	url: clear ""
	append url components/1
	components: next components

	forall components [
		either components = host [
			append url "@"
			append url components/1
		][
			append url "%40"
			append url components/1
		]
	]
	url
]


I can use it both with and without specifying %40 for the first @ 
in the url:

>> encode-url "ftp://[name-:-home-:-net]:[pass-:-server-:-net]"
== "ftp://name%40home.net:[pass-:-server-:-net]"
>> encode-url "ftp://name%40home.net:[pass-:-server-:-net]"
== "ftp://name%40home.net:[pass-:-server-:-net]"


It will give correct result in both cases (I use strings, but of 
course it should be url! datatype in REBOL). Now comes unicode. Given 
precise rules, how that should happen, I see no problem with encoding 
this in e.g. UTF-8.


So I think, it's possible to do this correctly. But maybe it's better 
to keep it simple and not do such auto convertions. In any case, 
the behaviour needs to be well documented, so users can figure out, 
how to create a valid url. I had same problem as Pekr years ago, 
and I missed documentation of that.
Pekr:
9-Jul-2009
Brian - I don't understand proposal for invalid-utf-8 function. What 
is it good for? Is it about some binary code not being able to be 
converted to regular char?
BrianH:
9-Jul-2009
It's about finding UTF-8 encoding errors, particularly the overlong 
forms that are used for security breaches. We can't do that check 
in TO-STRING because of the overhead (+50%), but it can still be 
a good idea to check in some cases, and the code is better written 
in C than REBOL.
BrianH:
9-Jul-2009
TO-STRING is the primary decoder of UTF-8 in REBOL.. TO-CHAR is the 
other one, and it complains about invalid UTF because it can.
BrianH:
31-Jul-2009
Except in binary. TRANSCODE works on UTF-8 binaries now. I need to 
adjust that ticket accordingly.
BrianH:
31-Jul-2009
All standard functions and syntax in REBOL fit within 7-bit ASCII, 
which is why R3 source is UTF-8.
BrianH:
31-Jul-2009
UTF-8 encoded binary!
PeterWood:
4-Aug-2009
Pekr - it is actually an a with a grave accent over it in UTF-8
Paul:
16-Aug-2009
I see Carl is going to add Read/Text functionality with UTF conversion. 
 That is going to be sweet.  That alone should begin to make REBOL3 
useful.
Pekr:
11-Sep-2009
REBOL 3.0 accepts UTF-8 encoded scripts, and because UTF-8 is a superset 
of ASCII, that standard is also accepted.

If you are not familiar 
with the UTF-8 Unicode standard, it is an 8 bit encoding that accepts 
ASCII directly (no special encoding is needed), but allows the full 
Unicode character set by encoding them with characters that have 
values 128 or greater.
Maxim:
11-Sep-2009
string! printing, to be more precise.  UTF and ASCII are  converted 
to two byte strings IIRC.  which is why you must re-encode them before 
spitting them via print.
PeterWood:
11-Sep-2009
Running R3 from the Mac terminal the output from the print function 
is definitely utf-8 encoded.
PeterWood:
11-Sep-2009
I think that to binary! will decode a Rebol string! to utf-8 :

>> to binary! "^(20ac)"  ;; Unicode code point for Euro sign     
== #{E282AC} ;; utf-8 character sequence for Euro sign
Pekr:
11-Sep-2009
But this is some low level issue I should not care about. It displays 
Czech codepage correctly. Also the script is said being by default 
UTF-8, which is superset to ASCII. IIRC it was said, that unless 
we will not use special chars, it will work transparently. If it 
works on input, it should work also on output, no?
Maxim:
11-Sep-2009
but the loading actually does a re-encoding.  utf-8 is compact, buts 
its slow because you cannot skip unless you traverse the string char 
by char.  which is why they are internally converted to 8 or 16 bit 
unicode chars... it seems strings become 16 bits a bit too often 
(maybe a change in later releases, where they are always converted 
to 16 bits for some reason).
BrianH:
11-Sep-2009
Windows Unicode works in UTF-16. Linux and OSX work in UTF-8.
Maxim:
11-Sep-2009
ah yess.. --cgi could just tell the core to prevent the UTF-16 encoding 
being done on stdout...
Maxim:
11-Sep-2009
but if we need to output latin-1 afterwards (while dumping the html 
content, for example), the output encoding  should be selectable 
as a "current default", and all the --cgi would do is set that default 
to UTF-8 for example.
Pekr:
11-Sep-2009
how is that Linux and OS-X don't experience any problems? They do 
use UTF-8, but that is not ASCII either, no?
Maxim:
11-Sep-2009
UTF lower's 127 odes are the same as ASII and single byte.  so if 
you don't use special chars, or the null char, you are basically 
dumping ASCII... this is the reason for its existence.
Maxim:
11-Sep-2009
(UTF-8)
Pekr:
11-Sep-2009
hmm, and why Windows uses UTF-16? Is it because of Windows console 
defaulting to UTF-16?
Maxim:
11-Sep-2009
probably it doesn't even support UTF-8 in any way.
Maxim:
11-Sep-2009
IIRC the whole windows API is either ASCII or UTF-16.
Pekr:
8-Oct-2009
I am curious about HOW do we actually fix the unicode issues. This 
might be more deep problem, that might seem. Because If I am not 
able to print in UTF-8, I need to first print the header, using some 
conversion, and then the content = the code is not easily cross-platform 
...
BrianH:
8-Oct-2009
CGI output should be binary, and the headers output in 7bit ASCII 
(not UTF-8) through that binary output.
Henrik:
22-Oct-2009
A91 released with some UTF-16 support
BrianH:
26-Oct-2009
Chris: "Is 'load/next supposed to return binary as the second part 
of the result?"

Yes. R3 source is defined as binary encoded in UTF-8, not as a string. 
LOAD/next of a dir or url which returns a block on read, or of a 
script-in-a-block will return a block reference as the next though.
Carl:
26-Oct-2009
Note that the header would remain clear text, UTF-8.
Pekr:
29-Oct-2009
hmm, interesting. R3 scripts should be UTF-8 by default, but dunno 
if it should, or should not work ...
BrianH:
29-Oct-2009
Showing the correct character in a string might be a console font 
thing. Is %test encoded in UTF-8?
201 / 40212[3] 45