AltME groups: search

Help · search scripts · search articles · search mailing list

results summary

world	hits
r4wp	115
r3wp	287
total:	402

results window for this page: [start: 201 end: 300]

world-name: r3wp

Group: All ... except covered in other channels [web-public]
BrianH: 31-Mar-2009	Processing UTF-8 to parse REBOL data, yes.
BrianH: 31-Mar-2009	I say UTF-8 because this is R3 LOAD we are talking about - R2's LOAD won't change again.
Group: Core ... Discuss core issues [web-public]
Jerry: 19-Oct-2006	About the out-of-memory error, the story is ... I am trying to compare two complete Windows Registry, which are both huge. I export them into files (in little-endian 16-bit Unicode), which are both 300+ MB. To save the space and to make REBOL easy to handle them, I encode these files into UTF-8. they are now 150+ MB. I try to load these two UTF-8 files into memory: >> lines1: read/lines %/c/reg1.reg == ["Windows Registry Editor Version 5.00" "" "[HKEY_LOCAL_MACHINE]" "" ... >> lines3: read/lines %/c/reg2.reg == Script Error: Not enough memory Where: halt-view ** Near: halt >> rebol/version == 1.3.2.3.1
Rebolek: 20-Oct-2006	Jerry: For conversion from/to UTF/UCS... you can use Oldes' unicode tools, it handles it very well (unfortunately you have to look around AltMe for some link, because Oldes does not upload to rebol.org and has his files all around the web - shame on you, Oldes! ;)
DanielSz: 14-Nov-2007	There is a nice script that encodes strings to utf-8. It is by Romano Paolo & Oldes. I'd like the reverse: decoding utf-8 strings. I found a script by Jan Skibinski proposing to do that, but the script doesn't load in rebol, exiting with an error ('map has no value). What's next?
DanielSz: 14-Nov-2007	BTW, I noticed that rebol.org serves pages in utf-8 encoding, but the scripts themselves are latin-1. This is not a problem for the code, but it is a problem for the comments, which may contain accented characters. For example, names of authors (hint: Robert M�ench), and they consequently appear garbled. I'm not saying pages should be served as latin-1, on the contrary, I am an utf-8 enthusiast, I think rebol scripts themselves should be encoded as utf-8, (it is possible with python, for example). I hope Rebol3 will be an all encompassing utf-8 system (am I dreaming?).
btiffin: 14-Nov-2007	UTF-8 is being discussed as part of R3 Unicode support. All encompassing? Dunno. Well thought out and robust? I'd bet on that one.
DanielSz: 14-Nov-2007	That is to say, not only should there be utf-8 string datatype, but words themselves should be utf-8 compliant.
Gabriele: 15-Nov-2007	Daniel, afaik, R2 words are already utf-8 compliant. (thanks to the way utf-8 is designed)
Brock: 3-Sep-2008	Any way to get a copy of the R3 Alpha? I have a very simple script that has to deal with utf-8 text that this would really help with.
Louis: 23-Sep-2008	Henrick, from what you say, I think I see what is happened. I'm copying the string from a utf-8 encoded file to an assci encoded file. The copy converts the string to different characters. But how do I get around this?
BrianH: 5-Mar-2009	kib2: "Does that mean that we can use unicode encoding with the help of r2-forward ?" No, I only can only spoof datatypes that don't exist in R2, and R2 has a string! type. The code should be equivalent if the characters in the string are limited to the first 256 codepoints of Unicode (aka Latin-1), though only the first 128 codepoints (aka ASCII) can be converted from binary! to string and have the binary data be the same as minimized UTF-8.
Gabriele: 10-Apr-2009	if you can wait for it (release does not depend on me), i have any-charset to utf-8 and utf-8 to html (and vice-versa, with support for all known named entities as well)
Henrik: 29-May-2009	http://www.openldap.org/lists/openldap-devel/200304/msg00123.html Anyone made a REBOL version of this? It's a UTF-8 <-> ISO-8859-1 converter in C.
Graham: 8-Aug-2009	But if I do a wireshark trace, I see this GET /20090806.7z HTTP/1.0 Accept: / Connection: close User-Agent: REBOL View 2.7.6.3.1 Host: remr.s3.amazonaws.com HTTP/1.0 403 Forbidden Date: Sat, 08 Aug 2009 21:08:07 GMT Content-Type: application/xml x-amz-request-id: D03B3FA12CC875D5 x-amz-id-2: u3b7TkPzJc5NBwvov4HRQuMsCsosD7le9xfRMSGiCN2BXgeae6kKMVQAbhzqRDwY Server: AmazonS3 Via: 1.1 nc1 (NetCache NetApp/6.0.5P1) <?xml version="1.0" encoding="UTF-8"?> <Error><Code>AccessDenied</Code><Message>Access Denied</Message><RequestId>D03B3FA12CC875D5</RequestId><HostId>u3b7TkPzJc5NBwvov4HRQuMsCsosD7le9xfRMSGiCN2BXgeae6kKMVQAbhzqRDwY</HostId></Error>
BrianH: 30-Jan-2010	invalid-utf?: funct [ "Checks for proper UTF encoding and returns NONE if correct or position where the error occurred." data [binary!] /utf "Check encodings other than UTF-8" num [integer!] "Bit size - positive for BE negative for LE" ] compose [ ascii: (charset [#"^(00)" - #"^(7F)"]) utf8+1: (charset [#"^(C2)" - #"^(DF)"]) utf8+2: (charset [#"^(E0)" - #"^(EF)"]) utf8+3: (charset [#"^(F0)" - #"^(F4)"]) utf8rest: (charset [#"^(80)" - #"^(BF)"]) switch/default any [num 8] [ 8 [ ; UTF-8 unless parse/all/case data [(pos: none) any [ pos: ascii \| utf8+1 utf8rest \| utf8+2 2 utf8rest \| utf8+3 3 utf8rest ]] [as-binary pos] ] 16 [ ; UTF-16BE pos: data while [not tail? pos] [ hi: first pos case [ none? lo: pick pos 2 [break/return pos] 55296 > w: hi * 256 + lo [pos: skip pos 2] ; #{D800} 57343 < w [pos: skip pos 2] ; #{DFFF} 56319 < w [break/return pos] ; #{DBFF} none? hi: pick pos 3 [break/return pos] none? lo: pick pos 4 [break/return pos] 56320 > w: hi * 256 + lo [break/return pos] ; #{DC00} 57343 >= w [pos: skip pos 4] ; #{DFFF} ] none ] ; none = valid, break/return pos = invalid ] -16 [ ; UTF-16LE pos: data while [not tail? pos] [ lo: first pos case [ none? hi: pick pos 2 [break/return pos] 55296 > w: hi * 256 + lo [pos: skip pos 2] ; #{D800} 57343 < w [pos: skip pos 2] ; #{DFFF} 56319 < w [break/return pos] ; #{DBFF} none? lo: pick pos 3 [break/return pos] none? hi: pick pos 4 [break/return pos] 56320 > w: hi * 256 + lo [break/return pos] ; #{DC00} 57343 >= w [pos: skip pos 4] ; #{DFFF} ] none ] ; none = valid, break/return pos = invalid ] 32 [ ; UTF-32BE pos: data while [not tail? pos] [ if any [ 4 > length? pos negative? c: to-integer pos 1114111 < c ; to-integer #{10FFFF} ] [break/return pos] ] ] -32 [ ; UTF-32LE pos: data while [not tail? pos] [ if any [ 4 > length? pos negative? c: also to-integer reverse/part pos 4 reverse/part pos 4 1114111 < c ; to-integer #{10FFFF} ] [break/return pos] ] ] ] [ throw-error 'script 'invalid-arg num ] ] ; Note: Native in R3, which doesn't support or screen the /utf option yet. ; See http://en.wikipedia.org/wiki/Unicodefor charset/value explanations.
Geomol: 25-May-2010	This can be even more complicated when talking UTF encoding. Hm, who knows how R3 do this...
Henrik: 13-Jun-2010	>> str-enc-utils/iso-8859-15-to-utf-8 "a�" == "" ; bad >> str-enc-utils/iso-8859-15-to-utf-8 "�" == "ø" ; good >> str-enc-utils/iso-8859-1-to-utf-8 "a�" ; hangs
Group: View ... discuss view related issues [web-public]
Jerry: 9-Dec-2006	Gabriele, Actually, Oldes is right. Showing two-byte characters is good enough. IME is not necessary for REBOL/View, because every Chinese/Japanese/Korea OS has proper IMEs installed. IME sends the codes encoded in the OS codepage to the focused window. For Example, If the codepage used by Windows XP is Big5 and I type in the Character which means one ( #{A440} in Big5, #{4E00} in Unicode, see http://www.unicode.org/cgi-bin/GetUnihanData.pl?codepoint=4E00 ), my REBOL/View program will get two key events sequentially, which are #{A4} and #{40}. REBOL/View shows it as two characters instead of one. I hope that REBOL/View can let the OS do the text-drawing, like the REBOL/core console does. REBOL/core console doesn't have the Chinese-Character-Showing issue, because it basically send the #{A4} and #{40} to the console, and let the OS do the text-drawing. the OS knows that #{A4} and #{40} should be combine to one Big5 Character, so it just show it as one character. Of course, if I type in two ASCII characters, the OS is smart enough not to combine them into one "non-existing" Big5 Character. CJK encodings are supersets of ASCII, just like UTF-8 is a superset of ASCII. It's nothing to do with Unicode, so it is not too difficult to fix, I guess. Please fix this in 2.7.5 or 2.7.6 please ... It's on my wish list for Santa Claus this year.
PeterWood: 30-Oct-2008	I've come across what seems to be an oddity with View on the Mac.Iit seems that the Rebol/View console is using UTF-8 encoding but that View is using MacRoman.
Gabriele: 31-Oct-2008	the "console" on Mac and Linux is just a terminal (OS provided), and they are usually UTF-8. That has nothing to do with View.
ChristianE: 29-Apr-2010	A hard-space, this can be encoded by UTF-8 as 0xC2 0xA0, 0xC2 is #"�".
Group: I'm new ... Ask any question, and a helpful person will try to answer. [web-public]
Gabriele: 26-Jan-2010	You never notice this (and in fact, don't need to know), because strings are always converted on i/o. to binary! string will convert it to UTF-8. other i/o will convert it to the platform's standard encoding (UTF-8 on Mac and Linux, UTF-16 on Windows)
jack-ort: 2-Jul-2010	Hello - hope someone can find the newbie mistake I'm making here. Wanted to use REBOL to tackle a need to get data from Salesforce using their SOAP API. New to SOAP, WSDL and Salesforce, but using SoapUI mananged to do this POST (edited only to hide personal info): POST https://login.salesforce.com/services/Soap/u/19.0HTTP/1.1 Accept-Encoding: gzip,deflate Content-Type: text/xml;charset=UTF-8 SOAPAction: "" User-Agent: Jakarta Commons-HttpClient/3.1 Host: login.salesforce.com Content-Length: 525 <soapenv:Envelope xmlns:soapenv="http://schemas.xmlsoap.org/soap/envelope/" xmlns:urn="urn:partner.soap.sforce.com"> <soapenv:Header> <urn:CallOptions> <urn:client></urn:client> <urn:defaultNamespace></urn:defaultNamespace> </urn:CallOptions> </soapenv:Header> <soapenv:Body> <urn:login> <urn:username>[jort-:-xxxxxxxxxxxxx-:-com]</urn:username> <urn:password>xxxxxxxxxx78l6g7iFac5uaviDnJLFxxxxx</urn:password> </urn:login> </soapenv:Body> </soapenv:Envelope> and get the desired response: HTTP/1.1 200 OK Server: Content-Encoding: gzip Content-Type: text/xml; charset=utf-8 Content-Length: 736 Date: Fri, 02 Jul 2010 20:32:14 GMT <?xml version="1.0" encoding="UTF-8"?><soapenv:Envelope xmlns:soapenv="http://schemas.xmlsoap.org/soap/envelope/" xmlns="urn:partner.soap.sforce.com" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"><soapenv:Body><loginResponse> ...... Then using SoapUI I am able to send a successful Logout message. Using REBOL 2.7.7.3.1, I created an "upload" string containing the POST block above without the "POST " at the beginning, set my url to: >> url == https://login.salesforce.com/services/Soap/u/19.0 and tried this: >> response: read/custom url reduce ['POST upload] but consistently get a Server 500 error: User Error: Error. Target url: https://login.salesforce.com:443/services/Soap/u/19.0 could not be retrieved. Se rver response: HTTP... Near: response: read/custom url reduce ['POST upload] For completeness, here's the upload value: >> print mold upload {https://login.salesforce.com/services/Soap/u/19.0HTTP/1.1 Accept-Encoding: gzip,deflate Content-Type: text/xml;charset=UTF-8 SOAPAction: "" User-Agent: Jakarta Commons-HttpClient/3.1 Host: login.salesforce.com Content-Length: 525 <soapenv:Envelope xmlns:soapenv="http://schemas.xmlsoap.org/soap/envelope/" xmlns:urn="urn:partner.soap.sforce.com"> <soapenv:Header> <urn:CallOptions> <urn:client></urn:client> <urn:defaultNamespace></urn:defaultNamespace> </urn:CallOptions> </soapenv:Header> <soapenv:Body> <urn:login> <urn:username>[jort-:-researchpoint-:-com]</urn:username> <urn:password>metrics12378l6g7iFac5uaviDnJLFVprDl</urn:password> </urn:login> </soapenv:Body> </soapenv:Envelope>} Would appreciate any help you can give!
Group: Tech News ... Interesting technology [web-public]
onetom: 30-Aug-2011	i tried gedit too recently on a mac. luckily there was a binary version, because the compilation segfaulted... well, it's quite nice. i could see that as an open source alternative, but despite of the fact it's supposed to support utf-8, it didn't...
Dockimbel: 22-Sep-2011	That should be doable, with the "wearable" version of Arduino boards (the Lilypad): http://www.arduino.cc/en/Main/ArduinoBoardLilyPad They are also some wrist watch level Arduino-based prototypes (often using an OLED display): http://www.google.fr/search?gcx=w&q=wrist+watch+arduino&um=1&ie=UTF-8&tbm=isch&source=og&sa=N&hl=fr&tab=wi&biw=1113&bih=1036 A few more very creative Arduino watches: http://hackaday.com/tag/watch/ There's even one you can already buy: http://www.getinpulse.com
Group: !REBOL3-OLD1 ... [web-public]
Jerry: 13-Jul-2007	According to http://www.rebol.net/r3blogs/0076.html, in REBOL 3, CHAR! is a 8bit and 16bit character. This could be problematic, I guess. Why don't we have two different datatypes instead: 16-bit CHAR! and 8-bit BYTE! The 16-bit CHAR! is in UTF-16, just like Java. STRING! is BYTE! string. UNICODE! is CHAR! string. How you you think about that?
Jerry: 14-Jul-2007	Kai, my point is, I don't want an atomic type (which is CHAR!) to present two different sizes. I don't want to write my code like the following: myfunc: func [ ch [ char!] ] [ either ( size? ch ) = 8 [ ; do something about 8-bit char ] [ ; ; do something about UTF-16 char ] ]
Gabriele: 14-Jul-2007	currenly, unicode is not there yet, so this has not been defined yet (i think char! is still 8 bit). but in principle, append a-string char-gt-255 will either error out or automatically encode to utf-8 (latter would be nice but it must be done for values gt than 127, so it would be a problem if you don't want utf-8)
PeterWood: 14-Dec-2007	Louis: From what I can tell from DocBase; initially the unicode support will be that the Rebol source will be UTF-8 encoded. The next step seems to be changing string! to UTF-8 encoding. It looks as though work hasn't yet started on the unicode! datatype.
BrianH: 14-Dec-2007	UTF-8 is a strict extention of ASCII, but ASCII is only defined between 0 and 127. Characters 128+ are not ASCII, they are extensions, and their meaning depends on the codepage. The codepage of an 8-bit string is unknown, unless you specify it externally (or do a lot of statistical calculations). Strings or scripts with characters extended by a codepage will have to be translated by a codepage-to-utf-8 function or process specific to the particular codepage, ahead of time. Fortunately, such a process can be fast and even implemented in a byte-oriented language easily.
PeterWood: 14-Dec-2007	BrianH: I undertstood that UTF-8 can be multi-byte depending on the Unicode of the character being represented.
PeterWood: 14-Dec-2007	Is my reading of Docbase correct that string! values will be UTF-8 encoded?
BrianH: 25-Jul-2008	So to answer Louis' question: Not yet, as far as we know. The data structures for Unicode strings are there, as are UTF-8 word! values, but binary encoding and decoding is not yet there, and there are some limts to Unicode input and output (mostly due to the Windows console). The encoding/decoding work seems likely to get done as a part of Carl's GUI work, as that will probably include text display. The console IO limits are likely to remain until the written-in-REBOL GUI console is adopted.
PeterWood: 27-Oct-2008	I'm confused by these statements in Docbase: "1. The READ-STRING function is a temporary function used to read files and convert them from binary (and possibly in Unicode format) into a string datatype." I thought that the string datatype was now UTF-8 encoded.
Anton: 28-Oct-2008	Peter, binary mode is the default for READ. READ-STRING looks at the binary and tries to interpret it, checking for unicode format (and maybe other formats), before converting to rebol string, which internally is UTF-8.
PeterWood: 28-Oct-2008	So does this mean that the graphics library is still treating a string as being 8-bit encoded? No doubt according to the current Windows codepage? does READ-STRING convert utf-8 to whatever 8-bit encoding the graphics library is using?
Gabriele: 28-Oct-2008	string! internally is NOT utf-8 in R3.
Anton: 28-Oct-2008	Oops. Isn't it utf-16, at least when necessary ?
BrianH: 28-Oct-2008	As far as your code is concerned, a string! will be a series of Unicode codepoints. Internally, who cares? The implementation of string! is likely to be the same as the native implementation on the platform is running on, or whatever is more efficient. I think that string! is now UTF-16 on Windows, and the symbols behind word! values are internally UTF-8. Still, it doesn't matter what strings are internally because AS-STRING and AS-BINARY are gone. All string-to-binary conversions will need encoding. REBOL scripts are going to be UTF-8 encoded though, as I recall.
Gabriele: 29-Oct-2008	string! internals are not OS dependent afaik, and technically it's not UTF-16 either. currently, R3 switches automatically between an array of 8-bit unsigned values, and an array of 16-bit unsigned values. i assume a 32-bit mode will be added in the future as not all codepoints will fit 16 bits, though those that don't are very rare.
BrianH: 29-Oct-2008	Peter, the array of unsigned values would effectively be UCS-2 if it behaves the way Gabriele says. This would mean it would be faster, but use more memory for those with characters outside the BMP. It would also cause a problem on Windows because Windows >= 2000 is internally UTF-16, as are all of its Unicode APIs.
BrianH: 29-Oct-2008	You could store UTF-16 in an array of unsigned 16-bit values as long as your length, insertion and deletion routines are UTF-16 aware.
PeterWood: 29-Oct-2008	BrianH: As I understand UCS-2 cannot be used to encode characters outside the BMP. It is a full subset of UTF-16. It should not cause problems with WIndows Unicode API except that it would not be able to display characters outside the BMP. (It would instead include an non-displayable character for each 2-bytes returned by Windows.)
Gabriele: 31-Oct-2008	Brian: same as there is a conversion between the Linux UTF-8 APIs and the internal 16 bit array, there is a conversion between the Windows UTF-16 APIs and the internal 16 bit array. In the same way, we can eventually support 32 bit arrays and convert those UTF-16 strings that can fit into it to just 8 bit arrays.
BrianH: 31-Dec-2008	I would not trust non-ascii characters for now. With any luck the server saves the messages as binary UTF-8, don't know yet.
Gabriele: 2-Jan-2009	considering that R3 will probably just send everything as UTF-8, I don't think the server has any reason at all to worry about encodings.
Gabriele: 2-Jan-2009	you have to worry about encodings when you do conversions. i don't see where the R2 server is doing any of that. Also, with UTF-8 there is no need to worry about encodings on searches and things like that. The only issue could be sorting, but that is also region specific so it's a completely different issue that R3 cannot solve globally either.
PeterWood: 2-Jan-2009	As you say, if all the input is UTF-8 everything will be fine. I mistakenly thought that the inputs from different environments would have been differenlty encoded as they are with R2.
BrianH: 2-Jan-2009	That would have to be the case with R2 clients, as the client is the part that handles character encoding. However, there are no R2 clients yet. The messages appear to be UTF-8 encoded end-to-end, stored in binary on the server, which is encoding agnostic. Once we have R2 clients, they will have to handle the codepage-to-UTF-8 encoding, or just stick to ASCII.
Sunanda: 3-Jan-2009	REBOL.org shows a ? because if blindly emits all Alte pages as charset=utf-8. If (this works in Firefox....You change your default for the page -- view/character encoding / western iso-8859-1) then: -- Peter's post shows a GBP [for his char 163] -- Chris' post shows a 1/2 [for his char 189]
btiffin: 3-Jan-2009	If I was a betting man, by 2020 UTF-8 will reign and compsci grads will need a history book to learn about ASCII.
PeterWood: 4-Jan-2009	Reichart ...you are right thep problem is one of encoding. My point is that because Rebol/View uses different encoding systems on different platforms it is left to the application to either ignore the encoding differences or handle them. This may be quite difficult if, as Chris indicated, it is not possible to determine which Windows Codepage is in use from Rebol/View. Tthere is a single unified character system (Unicode ) but there are at least five different ways of representing it (UTF-8, UTF-16LE, UTF-16BE, UTF-32LE & UTF-32BE). Standardisation is a long way off.
Gabriele: 4-Jan-2009	Reichart, what I mean is that you don't even need tools, as long as the server software properly emits only utf-8 and reports that it accepts only utf-8... after doing that, if there are still browsers that do not comply, then we can start talking about tools (which are trivial, most of the time, by the way).
Sunanda: 4-Jan-2009	Another part of the problem, at least from the webpage viewpoint, is that each of us could be posting AltME messages in different charsets. All the HTML emitters for AltME worlds that I know of (AltME's own, REBOL.org, REBOL.net) emit a single webpage file, so it can only have one charset. To do it right, each post should be emitted as a separate document/frame item. Then they'll each have their own charset.....That's a lot of extra work. Let's hope Gabriele's solution (a utf-8 universe) happens before that becomes essential.
Chris: 4-Jan-2009	Brian -- ASCII is a subset of UTF-8...
Chris: 4-Jan-2009	With QM, I try to assume (and enforce) UTF-8 (declaring on forms, html escaping everything ASCII+), but it's definitely a chore.
BrianH: 7-Jan-2009	Here's the current source for LOAD: load: func [ {Loads a file, URL, or string.} source [file! url! string! any-block! binary!] /header {Includes REBOL header object if present. Preempts /all.} ; /next {Load the next value only. Return block with value and new position.} ; /library {Force file to be a dynamic library. (Command version)} ; /markup {Convert HTML and XML to a block of tags and strings.} /all {Load all values. Does not evaluate REBOL header.} /unbound {Do not bind the block.} /local data tmp ][ ; Note: Avoid use of ALL func, because of /all option if any-block? :source [return :source] data: case [ string? source [to-binary source] binary? source [source] ; Check for special media load cases: (temporary code) find [%.jpg %.jpeg %.jpe] suffix? source [ return load-jpeg read/binary source ] url? source [read source] ; can this possibly return not binary! ? file? source [read source] ; binary! or block of file! ] ; At this point, data is binary!, a block of file!, or something weird. if binary? :data [ unless find [0 8] tmp: utf? data [ cause-error 'script 'no-decode ajoin ['UTF tmp] ] ; Only load script data: if any [header not all] [ ; Note: refinement /all if tmp: script? data [data: tmp] ] ] unless block? :data [data: to block! :data] ; reduce overhead ; data is a block! here, unless something really weird is going on tmp: none ; Is there a REBOL script header: if any [header not all] [ ; /header preempts /all tmp: unless any [ ;not any [file? source url? source] ; removed: hdr in string is same unset? first data ; because <> doesn't work with unset! 'rebol <> first data not block? second data ][ ; Process header: attempt [construct/with second data system/standard/script] ] ; tmp is header object or none here case [ tmp [ remove data either header [change data tmp][remove data] tmp: tmp/type = 'module ; tmp true if module ] header [cause-error 'syntax 'no-header data] ] ] ; tmp is true if module, false or none if not ; data is a block!, with possible header object in first position ; Bind to current global context if not a module: unless any [ unbound tmp ; not a module ][ bind/new data system/contexts/current ] ; data is a block! here, unless something really weird is going on ; If appropriate and possible, return singular data value: unless any [ ; avoid use of ALL all header ; This fixes a design flaw in R2's LOAD ;not block? :data ; can this ever happen? empty? data ; R2 compatibility not tail? next data ][data: first data] ; If /all or /header, data is a block here :data ]
Steeve: 9-Feb-2009	hum, or you can pass a header block to the write function as is: >> write [ url! [ User-Agent: "TOTO" ... ] #{...data...}] but it's bugy, you have to add yourself some missing header properties in the block to generate a valid request. like Content-Type: "application/x-www-form-urlencoded; charset=utf-8"
BrianH: 15-Feb-2009	Kib2, likely when the chat server is finished being ported to R3 on Linux. We were running into problems with Unicode user names on R2, since R2 can't do case-insensitive comparisons of Unicode strings, even when encoded in UTF-8.
DideC: 26-Feb-2009	R3 assume text file are UTF-8 encoded. Probably yours is ANSI or something else
Gabriele: 27-Feb-2009	Brian, are you sure that R3 handles case-insensitiveness in Unicode text better than R2 with UTF-8?
BrianH: 28-Feb-2009	Gabriele, I know that R3 handles case-insensitiveness of Latin1 characters in Unicode text better than R2 with UTF-8, but beyond that I don't know. There is still some stuff to do relative to Unicode. The implications of Unicode support on the rest of the system are mostly implemented though, with the glaring exception of PARSE.
Gabriele: 1-Mar-2009	Brian: if it's just latin1, does it really change things? it won't always work anyway. but, there's plenty of code to convert utf-8 to latin1 in R2, so why not just doing that if it's really important to have case insensitive accented character in the user names?
BrianH: 1-Mar-2009	Gabriele, Unicode isn't done, so I don't understand this: "does it really change things?". I was just answering your question about R3. Whether R2 can handle case insensitivity of UTF-8 when third-party libraries are added is another issue. I can't say what Carl's priorities are beyond where he had me mark CureCode tickets as urgent, so I can't say when or if Unicode will be "finished". User account creation is reenabled, so the whole reason this conversation started is moot now.
BrianH: 3-Apr-2009	load: func [ {Loads a file, URL, or string.} source [file! url! string! binary! block!] {Source or block of sources} /header {Includes REBOL header object if present. Preempts /all.} /next {Load the next value only. Return block with value and new position.} ; /library {Force file to be a dynamic library. (Command version)} ; /markup {Convert HTML and XML to a block of tags and strings.} /all {Load all values. Does not evaluate REBOL header.} /unbound {Do not bind the block.} /local data content val rst tmp ][ ; Note: Avoid use of ALL and NEXT funcs, because of /all and /next options content: val: rst: tmp: none ; In case people call LOAD/local ; Retrieve the script data data: case [ block? source [ ; Load all in block return map x source [apply :load [:x header next all unbound]] ] string? source [source] ; Will convert to binary! later binary? source [source] ; Otherwise source is file or url 'else [ ; See if a codec exists for this file type tmp: find find system/catalog/file-types suffix? source word! ; Get the data, script required if /header content: read source ; Must be a value, not unset case [ binary? :content [content] ; Assumed script or decodable string? :content [content] ; Assumed script or decodable header [cause-error 'syntax 'no-header source] block? :content [content] 'else [content: reduce [:content]] ] ; Don't LOAD/header non-script data from urls and files. ] ; content is data if content doesn't need copying, or none if it does ] ;print [1 "data type?" type? :data 'content true? :content] if string? :data [data: to-binary data] ; REBOL script is UTF-8 assert/type [data [binary! block!] content [binary! string! block! none!]] assert [any [binary? :data not header]] if tmp [ ; Use a codec if found earlier set/any 'data decode first tmp :data ; See if we can shortcut return the value, or fake a script if we can't case [ block? :data [if header [insert data val: make system/standard/script []]] header [data: reduce [val: make system/standard/script [] :data]] (to logic! unbound) and not next [return :data] ; Shortcut return any [next any-block? :data any-word? :data] [data: reduce [:data]] 'else [return :data] ; No binding needed, shortcut return ] assert/type [data block!] ; If we get this far ] ;print [2 'data mold to-string :data] if binary? :data [ ; It's a script unless find [0 8] tmp: utf? data [ ; Not UTF-8 cause-error 'script 'no-decode ajoin ["UTF-" abs tmp] ] ; Process the header if necessary either any [header not all] [ if tmp: script? data [data: tmp] ; Load script data ; Check for a REBOL header set/any [val rst] transcode/only data unless case [ :val = [rebol] [ ; Possible script-in-a-block set/any [val rst] transcode/next/error rst if block? :val [ ; Is script-in-a-block data: first transcode/next data rst: skip data 2 ] ; If true, val is header spec ] :val = 'rebol [ ; Possible REBOL header set/any [val rst] transcode/next/error rst block? :val ; If true, val is header spec ] ] [ ; No REBOL header, use default val: [] rst: data ] ; val is the header spec block, rst the position afterwards assert/type [val block! rst [binary! block!] data [binary! block!]] assert [same? head data head rst] ; Make the header object either val: attempt [construct/with :val system/standard/script] [ if (select val 'content) = true [ val/content: any [:content copy source] ] ] [cause-error 'syntax 'no-header data] ; val is correct header object! here, or you don't get here ; Convert the rest of the data if necessary and not /next unless any [next block? data] [data: rst: to block! rst] if block? data [ ; Script-in-a-block or not /next case [ header [change/part data val rst] ; Replace the header with the object not all [remove/part data rst] ; Remove the header from the data ] rst: none ; Determined later ] ] [rst: data] ; /all and not /header ] ; val is the header object or none, rst is the binary position after or none assert/type [val [object! none!] rst [binary! none!] data [binary! block!]] assert [any [none? rst same? head data head rst] any [val not header]] ;print [3 'val mold/all :val 'data mold/all :data "type?" type? :data] ; LOAD/next or convert data to block - block either way assert [block? data: case [ not next [ ; Not /next unless any [block? data not binary? rst] [data: to block! rst] data ] ; Otherwise /next block? data [reduce pick [[data] [first+ data data]] empty? data] header [reduce [val rst]] ; Already transcoded above binary? rst [transcode/next rst] ]] ; Bind to current global context if not a module unless any [ ; Note: NOT ANY instead of ALL because of /all unbound (select val 'type) = 'module ][ bind/new data system/contexts/current ] ;print [6 'data mold/all :data 'tmp mold/all :tmp] ; If appropriate and possible, return singular data value unless any [ all header next ; /all /header /next empty? data 1 < length? data ][set/any 'data first data] ;print [7 'data mold/all :data] :data ]
shadwolf: 9-Apr-2009	and since text editing is related to UTF-8 better to get all the things ended to not have to constantly have to redo part of the job
Gabriele: 21-Apr-2009	Geomol, the difference I'm pointing out is the following: suppose you have an array of unicode code points. each element in the array is an integer that represents a character. you can "encode" it to UTF-8. there is no magic, for each integer you have a corresponding sequence of bytes.
Gabriele: 21-Apr-2009	Now, if your array was representing a url, you could encode it to UTF-8 using the % encoding as well to stay in the ascii subset. This is encoding, but still, it will not solve your @ problem. each @ in the array of integers will become an @ (which is an ascii char) in the final string.
Geomol: 21-Apr-2009	Maybe we got unicode encoding end escape encoding confused. As I see it, given correct rules, auto converting of user input to correct url can be achieved. I made this function to illustrate, what I mean (it's not optimized, but should be easy to read): encode-url: func [input /local url components host] [ components: parse input "@" host: back tail components url: clear "" append url components/1 components: next components forall components [ either components = host [ append url "@" append url components/1 ][ append url "%40" append url components/1 ] ] url ] I can use it both with and without specifying %40 for the first @ in the url: >> encode-url "ftp://[name-:-home-:-net]:[pass-:-server-:-net]" == "ftp://name%40home.net:[pass-:-server-:-net]" >> encode-url "ftp://name%40home.net:[pass-:-server-:-net]" == "ftp://name%40home.net:[pass-:-server-:-net]" It will give correct result in both cases (I use strings, but of course it should be url! datatype in REBOL). Now comes unicode. Given precise rules, how that should happen, I see no problem with encoding this in e.g. UTF-8. So I think, it's possible to do this correctly. But maybe it's better to keep it simple and not do such auto convertions. In any case, the behaviour needs to be well documented, so users can figure out, how to create a valid url. I had same problem as Pekr years ago, and I missed documentation of that.
Pekr: 9-Jul-2009	Brian - I don't understand proposal for invalid-utf-8 function. What is it good for? Is it about some binary code not being able to be converted to regular char?
BrianH: 9-Jul-2009	It's about finding UTF-8 encoding errors, particularly the overlong forms that are used for security breaches. We can't do that check in TO-STRING because of the overhead (+50%), but it can still be a good idea to check in some cases, and the code is better written in C than REBOL.
BrianH: 9-Jul-2009	TO-STRING is the primary decoder of UTF-8 in REBOL.. TO-CHAR is the other one, and it complains about invalid UTF because it can.
BrianH: 31-Jul-2009	Except in binary. TRANSCODE works on UTF-8 binaries now. I need to adjust that ticket accordingly.
BrianH: 31-Jul-2009	All standard functions and syntax in REBOL fit within 7-bit ASCII, which is why R3 source is UTF-8.
BrianH: 31-Jul-2009	UTF-8 encoded binary!
PeterWood: 4-Aug-2009	Pekr - it is actually an a with a grave accent over it in UTF-8
Paul: 16-Aug-2009	I see Carl is going to add Read/Text functionality with UTF conversion. That is going to be sweet. That alone should begin to make REBOL3 useful.
Pekr: 11-Sep-2009	REBOL 3.0 accepts UTF-8 encoded scripts, and because UTF-8 is a superset of ASCII, that standard is also accepted. If you are not familiar with the UTF-8 Unicode standard, it is an 8 bit encoding that accepts ASCII directly (no special encoding is needed), but allows the full Unicode character set by encoding them with characters that have values 128 or greater.
Maxim: 11-Sep-2009	string! printing, to be more precise. UTF and ASCII are converted to two byte strings IIRC. which is why you must re-encode them before spitting them via print.
PeterWood: 11-Sep-2009	Running R3 from the Mac terminal the output from the print function is definitely utf-8 encoded.
PeterWood: 11-Sep-2009	I think that to binary! will decode a Rebol string! to utf-8 : >> to binary! "^(20ac)" ;; Unicode code point for Euro sign == #{E282AC} ;; utf-8 character sequence for Euro sign
Pekr: 11-Sep-2009	But this is some low level issue I should not care about. It displays Czech codepage correctly. Also the script is said being by default UTF-8, which is superset to ASCII. IIRC it was said, that unless we will not use special chars, it will work transparently. If it works on input, it should work also on output, no?
Maxim: 11-Sep-2009	but the loading actually does a re-encoding. utf-8 is compact, buts its slow because you cannot skip unless you traverse the string char by char. which is why they are internally converted to 8 or 16 bit unicode chars... it seems strings become 16 bits a bit too often (maybe a change in later releases, where they are always converted to 16 bits for some reason).
BrianH: 11-Sep-2009	Windows Unicode works in UTF-16. Linux and OSX work in UTF-8.
Maxim: 11-Sep-2009	ah yess.. --cgi could just tell the core to prevent the UTF-16 encoding being done on stdout...
Maxim: 11-Sep-2009	but if we need to output latin-1 afterwards (while dumping the html content, for example), the output encoding should be selectable as a "current default", and all the --cgi would do is set that default to UTF-8 for example.
Pekr: 11-Sep-2009	how is that Linux and OS-X don't experience any problems? They do use UTF-8, but that is not ASCII either, no?
Maxim: 11-Sep-2009	UTF lower's 127 odes are the same as ASII and single byte. so if you don't use special chars, or the null char, you are basically dumping ASCII... this is the reason for its existence.
Maxim: 11-Sep-2009	(UTF-8)
Pekr: 11-Sep-2009	hmm, and why Windows uses UTF-16? Is it because of Windows console defaulting to UTF-16?
Maxim: 11-Sep-2009	probably it doesn't even support UTF-8 in any way.
Maxim: 11-Sep-2009	IIRC the whole windows API is either ASCII or UTF-16.
Pekr: 8-Oct-2009	I am curious about HOW do we actually fix the unicode issues. This might be more deep problem, that might seem. Because If I am not able to print in UTF-8, I need to first print the header, using some conversion, and then the content = the code is not easily cross-platform ...
BrianH: 8-Oct-2009	CGI output should be binary, and the headers output in 7bit ASCII (not UTF-8) through that binary output.
Henrik: 22-Oct-2009	A91 released with some UTF-16 support
BrianH: 26-Oct-2009	Chris: "Is 'load/next supposed to return binary as the second part of the result?" Yes. R3 source is defined as binary encoded in UTF-8, not as a string. LOAD/next of a dir or url which returns a block on read, or of a script-in-a-block will return a block reference as the next though.
Carl: 26-Oct-2009	Note that the header would remain clear text, UTF-8.
Pekr: 29-Oct-2009	hmm, interesting. R3 scripts should be UTF-8 by default, but dunno if it should, or should not work ...
BrianH: 29-Oct-2009	Showing the correct character in a string might be a console font thing. Is %test encoded in UTF-8?

201 / 402

[3]