AltME groups: search
Help · search scripts · search articles · search mailing listresults summary
world | hits |
r4wp | 115 |
r3wp | 287 |
total: | 402 |
results window for this page: [start: 301 end: 400]
world-name: r3wp
Group: !AltME ... Discussion about AltME [web-public] | ||
Kaj: 16-Jan-2011 | I think they're each interpreting text according to their own native character set. Windows in UTF-16, OS X and Linux probably in UTF-8. AltME doesn't compensate | |
PeterWood: 16-Jan-2011 | Yes, AltME simply ignores character encoding.It simply regurgitates the text it receives from the client. I believe that REBOL/View uses the default codepage under Windows(not utf-16) and MacRoman under OS/X. I suspect it uses ISO-8859-1 under Linux but am not sure. | |
Group: Parse ... Discussion of PARSE dialect [web-public] | ||
Henrik: 5-Dec-2011 | That's fine by me, as I read the file into memory once due to the need for one-time UTF-8 conversion, so that will happen outside LOAD-CSV. | |
Group: !REBOL3-OLD1 ... [web-public] | ||
Maxim: 30-Oct-2009 | I also think the "default" user text format should be configurable. I have absolutely no desire to start using utf-8 for my code and data, especially when I have a lot of stuff that already is in iso latin-1 encoding. | |
Pekr: 30-Oct-2009 | It is really not good, that I can't load my own local codepage. How should I make my source-file UTF-8? My Notepad will not probably add any BOM header for me automatically ... | |
Maxim: 30-Oct-2009 | utf-8 needs no BOM... its only used as a signature. | |
Maxim: 30-Oct-2009 | since rebol will load files as UTF-8 by default code doesn't need it. | |
PeterWood: 30-Oct-2009 | ..and sticking to the old ways means living with the old problems ... like not knowing how to interprete characters properly ... like AlrME for example ... it assumes makes the assumption that all text in messages is encoded as though it was entered on your own machine. So messages from Mac users are incorrecly displayed on Windows machines and vice-versa. For me, moving to utf-8 is a much easier problem to live with than not being able to properly share text across different platforms. It may be different for you. | |
Henrik: 30-Oct-2009 | REBOL3's philosophy should be simple: UTF-8 is default. Anything else is possible, but must be optionally selected. | |
sqlab: 30-Oct-2009 | Then I would prefer, that name and the string to compare have an unicode datatype, as in >> type? name == UTF-8. | |
Maxim: 30-Oct-2009 | but utf-8 editors aren't rare nowadays, and using utf-8 sequences isn't hard... really, if you tuely want to keep using as ascii editor | |
Maxim: 30-Oct-2009 | at least converging to utf-8, all scripts by all authors will work the same on all systems. | |
Maxim: 30-Oct-2009 | I put a suggestion on the blog about allowing user-creating encoding maps... otherwise, you can load it as binary in R3 and just convert the czech chars to utf-8 multi-byte sequences and convert the binary to string using decode. | |
Maxim: 30-Oct-2009 | R3 will interpret litteral strings and decode them using utf-8 (or the header encoding, if its supported) so in this case no. but if the data is stored within binaries (equivalent to R2 which doesn't handle encoding) then, yes, since the binary represents the sequence of bytes not chars. if you use a utf-8 editor, and type characters above 127 and look at them in notepad, you will then see the UTF-8 byte sequences (which will look like garbled text, obviously). | |
Pekr: 30-Oct-2009 | Is there utf-8 version of notepad? :-) | |
PeterWood: 30-Oct-2009 | Notepad can apparently handle both UTF-8 and UTF-16 http://en.wikipedia.org/wiki/Notepad_(Windows) | |
Maxim: 30-Oct-2009 | it tries to detect UTF based on text content... broken up until vista. http://en.wikipedia.org/wiki/Notepad_%28Windows%29 | |
Gabriele: 31-Oct-2009 | Max: maybe you should start using a real operating system. But, that aside, if you have any software that does not handle utf-8, simply trash it. guys, really, this is crazy, we are in 2009, let's put an end to this codepage crap! | |
Gabriele: 31-Oct-2009 | sqlab: what you say would make some sense if converting files was in any way difficult. (apart from the fact that you should have stopped using latin1 almost 10 years ago...). I've been using utf-8 with R2 for years... | |
Gabriele: 31-Oct-2009 | Petr: notepad, as most windows stuff, uses utf-16. much easier to detect though, and R3 could do that (actually, didn't Carl just add that recently?) most "real" editors allow you to use whatever encoding you want, and definitely support utf-8. | |
Pekr: 31-Oct-2009 | Aha, I just realised that I have to use Save-as, and choose UTF-8 or Unicode, instead of default ANSI preset of notepad | |
Gabriele: 1-Nov-2009 | Max, maybe i was not clear. If your rebol scripts are latin1 by default, while my rebol scripts are utf-8 by default, when i send you a rebol script IT WILL NOT WORK in the same way in your machine. the *script*'s encoding *must* be a standard everyone agrees on. then, the script can do whatever it wants with the data, it's your fault if you make it so data cannot be exchanged easily among systems. | |
Pekr: 1-Nov-2009 | jocko - the same happened to me here under Windows. The problem is, that I used plain Notepad, which by default stores in ANSI compatible charset. Then I realised, that on a Save-as dialog, there is a button, where I can change ANSI to UTF-8 unicode. Then my strings loaded correctly. So - you have to be sure that your editor by default saves in UTF-8. | |
jocko: 1-Nov-2009 | Yes, that was the problem ... and I already had it. But it will really be a trap for many non english users, from many countries. Another point to consider is that we may have difficulties reading normal (non-UTF-8) text files coming from other environments. In R2, this constraint did not exist. | |
Maxim: 1-Nov-2009 | actually, it is a problem in R2. if you store your code, and I open it with a different codepage version of windows... some letters will be skewed. In an application I wrote, I couldn't write out proper strings for the netherlands, as an example. unicode is slowly becoming the standard for text... especially utf-8. but yes, users have to be educated. within your apps, though, you can handle the encoding as you want... only the rebol sources have to be UTF-8 . as R3 matures, more encodings will be most probably be included in string codecs to support 8 bit Extended ascii from different areas of the world. and even high-profile applications like Apple's iweb have issues with text encoding... so this is a problem for the whole industry & users to adapt to. | |
BrianH: 1-Nov-2009 | One interesting thing about R3 scripts is that they are UTF-8 *binary*, not converted strings. A header setting would just require R3 to convert the script to string! and then back to UTF-8 binary before reading the file. This is why we recommend that people DO [1 + 1] instead of DO "1 + 1", because that string needs to be converted to binary before it can be parsed. | |
BrianH: 1-Nov-2009 | Even if we had a text encoding header for R3, it would be a *bad* idea to ever use encodings other than UTF-8. So don't. | |
Maxim: 14-Dec-2009 | My only problem with R3 right now is that there is no codec for text reading . This means I can't properly import C files for example, unless I convert them to utf-8 with something else first. Has anyone done (or started to work on) a simple character mapping engine? | |
Group: !Cheyenne ... Discussions about the Cheyenne Web Server [web-public] | ||
Graham: 19-Aug-2009 | this is the request GET /md/creategoogledoc.rsp?gdoc=simple-letter.rtf&patientid=2832&encounter=none HTTP/1.1 Host: gchiu.no-ip.biz:8000 User-Agent: Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US; rv:1.9.1.2) Gecko/20090729 Firefox/3.5.2 (.NET CLR 3.5.30729) Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8 Accept-Language: en-us,en;q=0.5 Accept-Encoding: gzip,deflate Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7 Keep-Alive: 300 Connection: keep-alive Referer: http://gchiu.no-ip.biz:8000/md/Listgoogledocs.rsp Cookie: RSPSID=QZPTPCZIWWMMYBKWHWRQETGM | |
Will: 19-Aug-2009 | answer from the redirection: HTTP/1.1 302 Moved Temporarily Content-Type: text/html; charset=UTF-8 Cache-Control: no-cache, no-store, max-age=0, must-revalidate Pragma: no-cache Expires: Fri, 01 Jan 1990 00:00:00 GMT Date: Wed, 19 Aug 2009 21:43:58 GMT Set-Cookie: WRITELY_UID=001dfpwvx2b|928b9de9e7bf56448b665282fc69988b; Path=/; HttpOnly Set-Cookie: GDS_PREF=hl=en;Expires=Sat, 17-Aug-2019 21:43:58 GMT;HttpOnly Set-Cookie: SID=DQAAAHcAAAB0kldc4zZSC_0FoiL6efkWE11k9SQkAIn-N3WfAzIOVe1cM-remnLUtV3Z4M-BFRf5eknz7hr_U3YzW94nECo0-aDnpxrLGiBglWGN4VkfLr5Hh7t2XNyRCA3VWd005SfCmZ9D8-1MUltjRI8X56VLde5Wy8HD92gh-8YkJBJxQA;Domain=.google.com;Path=/;Expires=Sat, 17-Aug-2019 21:43:58 GMT Location: https://www.google.com/accounts/ServiceLogin?service=writely&passive=true&nui=1&continue=http%3A%2F%2Fdocs.google.com%2FDoc%3Fdocid%3D0AcdrOHdpKfrWZGZwd3Z4MmJfMnNxcDJkNmZu%26amp%3Bhl%3Den&followup=http%3A%2F%2Fdocs.google.com%2FDoc%3Fdocid%3D0AcdrOHdpKfrWZGZwd3Z4MmJfMnNxcDJkNmZu%26amp%3Bhl%3Den<mpl=homepage&rm=false Content-Encoding: gzip X-Content-Type-Options: nosniff Content-Length: 325 Server: GFE/2.0 | |
Will: 19-Aug-2009 | more redirection: HTTP/1.1 302 Moved Temporarily Set-Cookie: WRITELY_SID=DQAAAHoAAADh80lBIw7e5Hg06TLEBgCY33XQGJ1aUH5OrCF_ir1xLwffKNaCqNdUL6qYfvgjNppDBI4lTNBSTjJWMG_Ze0_qJnveBCAtihBDFwBlOb-H7RlkfgJwM7pBbyKV7bm4M3mqUivD1emtpxgl32vG8CEP1poQ2479HQXrlobsp7Egzw;Domain=docs.google.com;Path=/;Expires=Thu, 03-Sep-2009 21:43:59 GMT Location: http://docs.google.com/Doc?docid=0AcdrOHdpKfrWZGZwd3Z4MmJfMnNxcDJkNmZu&%3Bhl=en&pli=1 Content-Type: text/html; charset=UTF-8 Content-Encoding: gzip Date: Wed, 19 Aug 2009 21:43:59 GMT Expires: Wed, 19 Aug 2009 21:43:59 GMT Cache-Control: private, max-age=0 X-Content-Type-Options: nosniff Content-Length: 232 Server: GFE/2.0 | |
Will: 19-Aug-2009 | and the the target page: HTTP/1.1 200 OK Set-Cookie: WRITELY_SID=DQAAAHoAAADh80lBIw7e5Hg06TLEBgCY33XQGJ1aUH5OrCF_ir1xLwffKNaCqNdUL6qYfvgjNppDBI4lTNBSTjJWMG_Ze0_qJnveBCAtihBDFwBlOb-H7RlkfgJwM7pBbyKV7bm4M3mqUivD1emtpxgl32vG8CEP1poQ2479HQXrlobsp7Egzw;Domain=docs.google.com;Path=/;Expires=Thu, 03-Sep-2009 21:43:59 GMT Set-Cookie: GDS_PREF=hl=en;Expires=Sat, 17-Aug-2019 21:43:59 GMT;HttpOnly Set-Cookie: user=; Expires=Tue, 18-Aug-2009 21:43:59 GMT; Path=/; HttpOnly Set-Cookie: login=; Expires=Tue, 18-Aug-2009 21:43:59 GMT; Path=/; HttpOnly Content-Type: text/html; charset=UTF-8 Cache-Control: no-cache, no-store, max-age=0, must-revalidate Pragma: no-cache Expires: Fri, 01 Jan 1990 00:00:00 GMT Date: Wed, 19 Aug 2009 21:43:59 GMT Content-Encoding: gzip Transfer-Encoding: chunked X-Content-Type-Options: nosniff Server: GFE/2.0 | |
Will: 21-Aug-2009 | a noter (ds les headers poste ds group Cheyenne) le premier redirect envoie: Cache-Control: no-cache, no-store, max-age=0, must-revalidate Pragma: no-cache Expires: Fri, 01 Jan 1990 00:00:00 GMT le deuxieme: Expires: Wed, 19 Aug 2009 21:43:59 GMT Cache-Control: private, max-age=0 a noter aussi, le premier envoie aussi: Content-Type: text/html; charset=UTF-8 mais pas de Content-Length le deuxieme envoie: Content-Encoding: gzip Content-Length: 232 et pas de Content-Type .. un vrai mess.. normalment j'ai confiance en google, ils sont tres peeki mais la je comprends pas | |
Dockimbel: 25-Dec-2009 | Important notice wrt web sockets : IIRC, all data sent on both sides have to be UTF-8 encoded. The current Cheyenne implementation doesn't enforce that encoding, so it's up to the developer to send the right data format. | |
Terry: 25-Dec-2009 | UTF-8 support is icing on the cake. | |
Graham: 25-Dec-2009 | Not using the default config .. but I get this 26/12-10:17:23.838-[RSP] ##RSP Script Error: URL = /ws.rsp File = www/ws.rsp ** Script Error : Invalid path value: data ** Where: rsp-script ** Near: [prin request/content/data] Request = make object! [ headers: [Host "localhost:8000" Connection "keep-alive" User-Agent {Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/532.5 (KHTML, like Gecko) Chrome/4.0.249.43 Safari/532.5} Accept {application/xml,application/xhtml+xml,text/html;q=0.9,text/plain;q=0.8,image/png,*/*;q=0.5} Accept-Encoding "gzip,deflate" Accept-Language "en-GB,en-US;q=0.8,en;q=0.6" Accept-Charset "ISO-8859-1,utf-8;q=0.7,*;q=0.3"] status-line: #{474554202F77732E72737020485454502F312E310D0A} method: 'GET url: "/ws.rsp" content: none path: "/" target: "ws.rsp" arg: none ext: '.rsp version: none file: %www/ws.rsp script-name: none ws?: none ] | |
Graham: 13-Feb-2010 | POST /cgi-bin/rebdev HTTP/1.0 Accept: */* Accept-Charset: utf-8 Host: host.rebol.net User-Agent: REBOL Content-Type: application/x-www-form-urlencoded; charset=utf-8 Content-Length: 56 [0.4.0 "Graham" password login]HTTP/1.1 500 Internal error in REBOL CGI Proxy Date: Sun, 14 Feb 2010 02:05:45 GMT Server: Apache/2.0.53 (Fedora) Content-Type: text/plain; charset=UTF-8 Via: 1.1 bc7 Connection: close Cannot connect (3) | |
Dockimbel: 9-Dec-2011 | Just something to keep in mind when working on websockets: the transer mode used by Cheyenne to reply to clients is "text" mode. This mode requires UTF-8 encoding and IIRC, the browser is allowed to reject your response and close the connection if the response is wrongly encoded. | |
Endo: 9-Dec-2011 | Is it UTF-8 in your chat example? Cheyenne converts text to UTF-8? Text mode is ok to me. By the way, I tested ws.html in Cheyenne sources on my XP/Home yesterday with Chrome, it closes the connection immediately. But it works here now, on XP/Pro with Chrome. | |
Dockimbel: 9-Dec-2011 | Chat demo: no conversion, it's UTF-8 as long as everyone talks in english. ;-) | |
Dockimbel: 9-Dec-2011 | Chat demo: in fact, it should work ok in all cases, because the UTF-8 encoding is done by the browser and the chat back-end just broadcast it as is to others. | |
Dockimbel: 21-Dec-2011 | Sounds like a good idea (making INCLUDE remove UTF-8 BOM, if found). | |
Group: !REBOL2 Releases ... Discuss 2.x releases [web-public] | ||
BrianH: 23-Jan-2010 | Yes, this means that we have fully working SINGLE?, COLLECT-WORDS, INVALID-UTF? and RESOLVE. Even R3 doesn't have a fully working INVALID-UTF? yet; since R2/Forward is mezzanine we can fix bugs that are still pending in the R3 natives. | |
BrianH: 30-Jan-2010 | Posted mezzanine changes for 2.7.8, ported from R2/Forward 2.100.80.0: - Added COLLECT-WORDS, RESOLVE, SINGLE?, IMMEDIATE!, INTERNAL!, INVALID-UTF?, CD, MORE, and the convenience words LS, PWD, RM and MKDIR. - Removed buggy binary! support from ASCII? and LATIN1?, as done in 2.100.60. See mezz-control.r #6763, mezz-file.r #6776, mezz-series.r #6772, mezz-string.r #6773 and mezz-reflect.r #6771 for the relevant changes. Details in R3's docs. Note: The APPEND and REMOLD rewrites are too awkward to incorporate without a native APPLY function. UNBIND hasn't been written yet (hoping for a native). | |
BrianH: 26-Mar-2010 | 2.7.8 additions from R2/Forward: RESOLVE, CD, MORE, LS, PWD, RM, MKDIR, SINGLE?, COLLECT-WORDS, INVALID-UTF?, and some compatibility fixes to ASCII? and LATIN1?. | |
BrianH: 31-Dec-2010 | Some of what is coming in 2.7.8: - Bug fixes and enhancements to improve Cheyenne, and other apps that have to do similar stuff. - Some native fixes for non-Windows platforms, particularly Linux. - Environment variable stuff: GET-ENV expansion on Windows, SET-ENV, LIST-ENV - Function fixes: RUN enabled, LIST-REG/values, possibly TO-LOCAL-FILE - R2/Forward: FUNCT/extern, LAST?, COLLECT-WORDS, EXTRACT fixes, ASCII? fixes, LATIN1? fixes, INVALID-UTF?, CD, LS, MORE, PWD, RM - (Still pending) Natives: ASSERT, APPLY, RESOLVE, FOREACH set-word support | |
BrianH: 2-Jan-2011 | What we got in 2.7.8, that I know of: - Bug fixes and enhancements to improve Cheyenne, and other apps that have to do similar stuff. - Some native fixes for non-Windows platforms, particularly Linux. See ACCESS-OS. - Environment variable stuff: GET-ENV expansion on Windows, SET-ENV, LIST-ENV - Function fixes: SELECT object!, FIND object!, RUN enabled, LIST-REG/values - R2/Forward: FUNCT/extern, LAST?, COLLECT-WORDS, RESOLVE, APPLY fixes, EXTRACT fixes, ASCII? fixes, LATIN1? fixes, INVALID-UTF?, CD, LS, MORE, PWD, RM | |
Group: !REBOL3 Extensions ... REBOL 3 Extensions discussions [web-public] | ||
Robert: 8-Dec-2009 | If the c-level side uses UTF-8 strings as well, can I just use the Rebol series as is? get_string returns a decoded string. | |
PeterWood: 11-Nov-2010 | Oldes: HAve you tested the function with a string including a unicode code point which translates to a three-byte utf-8 character? The size of utf8str appears to beonly twice the number of codepoints in the REBOL stirng. A good example of a three-byte utf-8 character is the Euro sign - Unicode 20AC UTF-8 E2 82 AC | |
PeterWood: 11-Nov-2010 | The maximum length of a utf-8 translation of a UCS-2 string would be 1.5 times the length of the string. So if wcslen returns the number of codepoints in a string, the length of the utf-8 should be the length of the str multiplied by 3 integer divided by 2 plus 1. | |
Maxim: 11-Nov-2010 | Oldes, thanks for that UTF-8 function converter :-) | |
Oldes: 12-Nov-2010 | Again with Cyphre's help, here is a function which converts MultiByte (utf-8) string from C side to REBSER used to return the string to REBOL side: REBSER* MultiByteToRebser(char* mbStr) { int len = MultiByteToWideChar( CP_UTF8, 0, mbStr, -1, NULL, 0); //the len is length of the string + null terminator wchar_t *wcStr = malloc(len * sizeof(wchar_t)); int result = MultiByteToWideChar(CP_UTF8, 0, mbStr, strlen(mbStr), wcStr, len); if (result == 0) { int err = GetLastError(); RL->print("ERROR: MultiByteToWideChar -> %d\n", err); exit(-1); //how to throw ERROR on REBOL side? } REBSER *ser = RL_MAKE_STRING(len-1,TRUE); REBUNI *dst; //hack! - will set the tail to len REBINT *s = (REBINT*)ser; s[1] = len-1; RL_GET_STRING(ser,0,(void**)&dst); wcscpy(dst, wcStr); free(wcStr); wcStr = NULL; return ser; } I'm not sure how safe it is, but it seems to be working. To return the string value I use: RXA_TYPE(frm, 1) = RXT_STRING; RXA_SERIES(frm, 1) = (REBSER *)MultiByteToRebser(utf8str); return RXR_VALUE; | |
Group: !REBOL3 Schemes ... Implementors guide [web-public] | ||
Graham: 5-Jan-2010 | read and write are very similar ... can we do this? read: func [ port [port!] /write data ] [ either any-function? :port/awake [ unless open? port [cause-error 'Access 'not-open port/spec/ref] if port/state/state <> 'ready [http-error "Port not ready"] port/state/awake: :port/awake do-request port port ] [ sync-op port either write [ data ] [[]] ] ] write: func [ port [port!] value ] [ unless any [block? :value any-string? :value] [value: form :value] unless block? value [value: reduce [[Content-Type: "application/x-www-form-urlencoded; charset=utf-8"] value]] read/write port data ] | |
Graham: 5-Jan-2010 | spec/headers: body-of make make object! [ Accept: "*/*" Accept-Charset: "utf-8" Host: either spec/port-id <> 80 [ rejoin [form spec/host #":" spec/port-id] ] [ form spec/host ] User-Agent: "REBOL" ] spec/headers what exactly is this code doing? | |
Graham: 5-Jan-2010 | I wonder why he can't do this spec/headers: make spec/headers [ Accept: "*/*" Accept-Charset: "utf-8" Host: either spec/port-id <> 80 [ rejoin [form spec/host #":" spec/port-id] ] [ form spec/host ] User-Agent: "REBOL" ] | |
Graham: 6-Jan-2010 | HEAD / HTTP/1.0 Accept: */* Accept-Charset: utf-8 Host: www.rebol.com User-Agent: REBOL HTTP/1.1 200 OK Date: Wed, 06 Jan 2010 07:28:08 GMT Server: Apache/1.3.37 (Unix) mod_auth_passthrough/1.8 mod_log_bytes/1.2 mod_bwlimited/1.4 PHP/4.4.7 FrontPage/5.0.2.2635.SR1.2 mod_ssl/2.8.28 OpenSSL/0.9.7a Last-Modified: Fri, 01 Jan 2010 21:19:01 GMT ETag: "3f44376-2667-4b3e66c5" Accept-Ranges: bytes Content-Type: text/html Via: 1.1 bc1 Content-Length: 9831 Connection: close | |
Graham: 6-Jan-2010 | You're sending HEAD www.rebol.com HTTP/1.0 Accept: */* Accept-Charset: utf-8 Host: www.rebol.com User-Agent: REBOL which is invalid | |
Gabriele: 6-Jan-2010 | read returns binary if it can't convert the content to string (ie. content-type is not text/* and charset is not UTF-8.) this was a quick addition after the unicode changes, and needs much more work. | |
Graham: 24-Jan-2010 | payload: create-sdb-message "ListDomains" listDomains 10 result: to-string write http://sdb.amazonaws.comcompose [ POST [ Content-Type: {text/xml; charset="utf-8"} SOAPaction: "ListDomains"] (payload) ] | |
Group: !REBOL3 GUI ... [web-public] | ||
Cyphre: 12-Aug-2010 | There is no charset selection. You just provide valid UTF-8 codes for the appropriate unicode chars, thats all. Also ofcourse you need to have font that contains that chars. I was using 'Arial Unicode MS' font in the test screens which is a huge font containing big chunk from all the unicode pages. | |
Group: !REBOL3 ... [web-public] | ||
joannak: 26-Jan-2010 | Returns .. Utf-8 encoded string. == #{C3A441424344} | |
Andreas: 14-Feb-2010 | performance will be less of an issue once we have support for a fast codec (utf-32/ucs-4), leaving mostly the extra function call(s) | |
Claude: 9-May-2010 | brianH in /etc/default/local i have LANG="fr_BE.UTF-8" (ubuntu lucid 10.4) | |
Maxim: 26-May-2010 | if all you do is: rebol_source = "PARSE {.... UTF-8 data from scintilla ...} parse-rules "; do_string (rebol_source); probably very fast, enough for real time, if script isn't huge :-) | |
Robert: 20-Aug-2010 | Added it as a codec so you can access it: >> ml: decode 'markup read http://www.rebol.com >> foreach tag ml [probe tag] <!doctype html> ^/ <html> <head> ^/ <meta name="generator" content="REBOL WIP Wiki"/> ^/ <meta name="date" content="10-Aug-2010/12:18:33-7:00"/> ^/ <meta name="rebol-version" content="2.100.97.4.2"/> ^/ <meta http-equiv="Content-Type" content="text/html;charset=UTF-8"/> | |
BrianH: 21-Sep-2010 | Now for the other binding stuff: * SET is a low-level function that would be slowed down immensely by adding any refinements. * SET does handle the unbound scenario: It triggers an error. You can then handle the error. * R2 and R3 get their speed from the direct binding model. The core speedup of that model is that SET doesn't bind. * LOAD in R3 is a high-level mezzanine function. It is meant to be as fast as possible given its purpose, but being fast is not its main goal; user-level flexibility is. Most of the overhead of LOAD is in handling all of its various options, as refinements, file extensions and script header settings. If you know what you are doing, you can always optimize your code by doing it directly instead of having LOAD try to figure out that is what you want to do. LOAD is not meant for use in tight loops. * Henrik, ChristianE, the R3 standard answer to the question of how to make BIND TO-WORD "a" more efficient or friendly in R3 is this: You are encouraged to not program that way in R3. Converting strings to words is something you should do once, not all the time in tight loops. Your code will be much more efficient if you work in REBOL data rather than storing your code in strings and converting at runtime. Strings are for data, or script source, not for containing scripts at runtime. This is a good rule for all REBOL versions, but especially for R3 with its Unicode strings vs. shared UTF-8 words. * I have recently refactored LOAD so that it is broken into smaller, more efficient functions. You might find that those functions would work better for you in lower-level code. But this was done to let us make LOAD *more* powerful, not less, so the same advice I gave above about not using LOAD in tight loops still applies. I don't yet know if the new LOAD is faster or slower, but it is safer and easier to understand and modify, and you can make your own LOAD replacement that calls the same low-level functions if you like. Plus, you get compressed scripts :) | |
BrianH: 7-Oct-2010 | Here's a low-level function to parse and process script headers, which shows how many features are built into the base script model in R3: load-script: funct [ "Decode a script into [header-obj script-ref body-ref]" source [binary! string!] "Source code (string will be UTF-8 encoded)" /header "Return the header object only, no script processing" ;/check "Calculate checksum and assign it to the header checksum field" /original "Use original source for Content header if possible" ] compose [ data: either string? source [to-binary source] [ unless find [0 8] tmp: utf? source [ ; Not UTF-8 cause-error 'script 'no-decode ajoin ["UTF-" abs tmp] ] source ] ; Checksum all the data, even that before the header or outside the block ;sum: if check [checksum/secure data] ; saved for later if tmp: script? data [data: tmp] ; Find the start of the script ; Check for a REBOL header set/any [hdr: rst:] transcode/only data unless case [ :hdr = 'rebol [ ; Possible REBOL header set/any [hdr rst] transcode/next/error rst block? :hdr ; If true, hdr is header spec ] :hdr = [rebol] [ ; Possible script-in-a-block set/any [hdr rst] transcode/next/error rst if block? :hdr [ ; Is script-in-a-block unless header [ ; Don't decode the rest if /header data: first transcode/next data rst: skip data 2 ] true ] ; If true, hdr is header spec ] ] [ ; No REBOL header, use default hdr: [] rst: data ] ; hdr is the header spec block, rst the position afterwards ;assert/type [hdr block! data [binary! block!] rst [binary! block!]] ;assert [same? head data head rst] ; Make the header object, or fail if we can't unless hdr: attempt [construct/with :hdr system/standard/header] [ cause-error 'syntax 'no-header data ] ; hdr is a correct header object! here, or you don't get here ;if check [append hdr 'checksum hdr/checksum: sum] ; calculated earlier ;assert [sum =? select hdr 'checksum] ; Should hdr/checksum be reserved? if header [return hdr] ; If /header, no further processing necessary ; Note: Some fields may not be final because post-processing is not done. ; Skip any whitespace after the header ws: (charset [1 - 32]) ; For whitespace skipping (DEL not included) if binary? rst [parse rst [any ws rst:]] ; Skip any whitespace ; Check for compressed data and decompress if necessary case [ ; Magic autodetection of compressed binary tmp: attempt [decompress rst] [ data: rst: tmp ; Use decompressed data (no header source) append hdr 'compressed hdr/compressed: true ; Just in case ] ; Else not directly compressed (without encoding) (select hdr 'compressed) != true [] ; Not declared, do nothing ; Else it's declared to be compressed, thus should be binary? rst [ ; Regular script, check for encoded binary set/any [tmp rst] transcode/next/error rst either tmp: attempt [decompress :tmp] [ data: rst: tmp ; Use the decoded binary (no header source) hdr/compressed: 'script ; So it saves the same way ; Anything after the first binary! is ignored ] [cause-error 'script 'bad-press -3] ; Else failure ] ; Else it's a block, check for script-encoded compressed binary tmp: attempt [decompress first rst] [ data: rst: tmp hdr/compressed: 'script ; It's binary again now ] ; Else declared compressed but not compressed, so fail 'else [cause-error 'script 'bad-press -3] ] ; Save the script content in the header if specified if :hdr/content = true [ hdr/content: either original [source] [copy source] ] ;assert/type [hdr object! data [binary! block!] rst [binary! block!]] ;assert [same? head data head rst] reduce [hdr data rst] ; Header object, start of source, start of body ] Note all the commented assert statements: they are for testing (when uncommented) and documentation. Also, I later removed the checksum calculation from this code because it was the wrong place to put it, at least as far as modules are concerned. However, Carl didn't know this because he was working on it while I was offline for a few days. | |
BrianH: 7-Oct-2010 | Here is the corresponding function in the code reorg, renamed. The friendly empty lines and comments haven't been added yet. load-header: funct/with [ "Loads script header object and body binary (not loaded)." source [binary! string!] "Source code (a string! will get UTF-8 encoded)" no-decompress [logic!] "Skip decompression of body (because we want to look at header mainly)" ][ ; This function decodes the script header from the script body. ; It checks the 'checksum, 'compress and 'content fields of the header. ; It will set the 'content field to the binary source if 'content is true. ; It will set the 'compress field to 'script for compressed embedded scripts. ; If body is compressed, it will be decompressed (header required). ; Normally, returns the header object and the body text (as binary). ; If no-decompress is false and the script is embedded and not compressed ; then the body text will be a decoded block instead of binary. ; Errors are returned as words: ; no-header ; bad-header ; bad-checksum ; bad-compress ; Note: set/any and :var used - prevent malicious code errors. case/all [ binary? source [data: assert-utf8 source] string? source [data: to binary! source] not data: script? data [return reduce [none data]] ; no header set/any [key: rest:] transcode/only data none ; get 'rebol keyword set/any [hdr: rest:] transcode/next/error data none ; get header block not block? :hdr [return 'no-header] ; header block is incomplete not attempt [hdr: construct/with :hdr system/standard/header][return 'bad-header] :hdr/content = true [hdr/content: data] ; as of start of header (??correct position??) :key = 'rebol [ ; regular script rest: any [find rest non-ws rest] ; skip whitespace after header ;rest: any [find rest #[bitset! [not bits #{7FFFFFFF80}]] rest] ; skip whitespace case/all [ all [:hdr/checksum :hdr/checksum != checksum/secure rest] [return 'bad-checksum] no-decompress [return reduce [hdr rest]] ; decompress not done :hdr/compress = 'script [set/any 'rest first transcode/next rest] ] ; rest is now suspect, use :rest ] :key = [rebol] [ ; embedded script, only 'script compression supported case/all [ :hdr/checksum [return 'bad-checksum] ; checksum not supported no-decompress [return reduce [hdr rest]] ; decompress not done rest: skip first transcode/next data 2 none ; decode embedded script :hdr/compress [hdr/compress: unbind 'script set/any 'rest first rest] ] ; rest is now suspect, use :rest ] :hdr/compress [rest: attempt [decompress :rest]] ; :rest type-checked by decompress not :rest [return 'bad-compress] ; only happens if above decompress failed ] ;assert/type [hdr object! rest [binary! block!]] ; just for documentation reduce [hdr rest] ][ non-ws: charset [not 1 - 32] ] Notes: - The other half of the CASE/all style is a lot of explicit shortcut RETURN statements, whenever the normal flow differs. - Errors are returned as a word from the error catalog, which is later passed to CAUSE-ERROR. - Carl redid the checksum calculation so that scripts can verify against a checksum in their header, to detect corruption. - The checksum in the header probably can't be used for the module checksum because the header itself matters for modules. - Compressed scripts lost a couple minor, unimportant features that we are likely better without. Quiz: What features? - Part, but not all of the reason the code is shorter is because the doc comments haven't been added yet. The CASE/all style helps though. | |
ChristianE: 13-Oct-2010 | IIRC, READ at one point only returned the data read as a binary stream, forcing you to DELINE TO STRING! READ ... because of the transition to UTF-8, but /STRING was added back later. Found nothing in the change log, though. | |
BrianH: 18-Nov-2010 | One thing will definitely be easier though: JSON and Javascript define that they have Unicode source, but don't have a way to specify the encoding (they are text standards, not binary). They can be handled easily in R3 once the source is converted to a string though, since that conversion will handle the encoding issues. In R2 you'd have to either stick to ASCII data or use Gabriele's text codecs and then parse the UTF-8. | |
BrianH: 11-Jan-2011 | Some *? functions that might be better off as *-OF: ENCODING?, FILE-TYPE?, INDEX?, LENGTH?, SIGN? and SIZE?. Except for the first two the old names would need to stick around because of the legacy naming rules. Strangely enough, UTF? is OK because it is short for "UTF what?". The series contents functions have an implicit -OF :) | |
PeterWood: 17-Feb-2011 | That sounds both very worrying and a challenge - how big were the XML files? Were they utf-8 encoded? Did you verify the utf-8 encoding in the XML or could it have contained invalid utf-8 sequences? | |
PeterWood: 20-Apr-2011 | So, if I understand correctly, I would write someting like: iso-ch: union #"^(40) utf-ch-2 and utf-ch: rejoin [#"^{C3}" difference #"^(40)" iso-ch] | |
Andreas: 12-Oct-2011 | The only function in R3 that operates that way is TRANSCODE, so as long as it doesn't choke on overlong combinations #{c0ae} is an overlong encoding for #"." (#{2e}). >> invalid-utf? #{c0ae} == #{C0AE} >> transcode #{c0ae} == [® #{}] >> transcode #{2e} == [. #{}] | |
BrianH: 12-Oct-2011 | So, on R3 INVALID-UTF? flags overlong encodings? Sorry I missed that. Better fix the R2/Forward version accordingly. | |
BrianH: 12-Oct-2011 | INVALID-UTF? returns the series at the position of the first invalid sequence. If it doesn't flag it returns none. | |
Andreas: 12-Oct-2011 | Ok. R2's invalid-utf? catches all 2-byte overlong forms, but not all 3 or 4-byte overlong forms. | |
Andreas: 12-Oct-2011 | R2>> invalid-utf? #{e080af} R2== none R3>> invalid-utf? #{e080af} R3== #{e080af} | |
Andreas: 12-Oct-2011 | So, R3's invalid-utf? seems to flag overlong encodings in general. R2(/Forward)'s invalid-utf? only catches overlong forms for 2-byte sequences, but not for 3- or 4-byte sequences. | |
Group: Power Mezz ... Discussions of the Power Mezz [web-public] | ||
Gabriele: 27-Jan-2010 | Will: I tend to have strings as UTF-8, but char! values need to be latin1 for R2. The .r files are the result of MOLD so although I have #"^(XX)" in the RLP source, you get the actual latin1 char in the .r. | |
Janko: 22-Sep-2010 | I can use decode-email-field to decode various encodings of subject. But I wasn't able to figure out how can I decode the content of an email which is in my case encoded with quoted-printable / utf8 . I found to-netmsg library on codeconscius.com code that loads the email text and parses it into structure. It doesn't decode the subject =?UTF-8?B...?= but it does the content. I could use that and power mezz to get the effect I want. If there is a way to encode content in power-mezz I would rather use it alone. | |
Group: !REBOL3 Host Kit ... [web-public] | ||
Kaj: 2-Jan-2011 | I'd be surprised if AGG couldn't work with UTF-8, and that wouldn't be the default on Unix | |
Kaj: 2-Jan-2011 | They probably use UTF-8 | |
BrianH: 2-Jan-2011 | That might also require some conversion, but at least then the conversion would be there to use. R3 uses UCS for strings internally for speed and code simplicity, though strangely enough words are stored in UTF-8 internally, since you don't have to access and change words on a character basis. | |
BrianH: 2-Jan-2011 | Windows uses UTF-16 for its APIs, not UCS-2, so by using UCS-2 R3 is limited to the BMP codepoints. | |
BrianH: 2-Jan-2011 | If it is UCS-2 or UTF-16, then all that would need to be done is to convert UCS-1 model R3 strings to UCS-2 mode somewhere before rendering. (He says glibly, having not analyzed the AGG sources or APIs.) | |
Kaj: 2-Jan-2011 | I'm still guessing this only applies to AGG on Windows, using UTF-16. On other platforms, AGG uses FreeType, and I guess that would accept UTF-8 | |
Oldes: 2-Jan-2011 | Also I'm not sure REBOL is using UTF-8 internally, I think it has only ANSI or UCS2 | |
Kaj: 2-Jan-2011 | No, as Brian says, it's using fixed width vectors internally. You get UTF-8 only from conversions | |
Group: Core ... Discuss core issues [web-public] | ||
Ashley: 11-Apr-2011 | OK, this is freaky: >> system/version == 2.7.8.2.5 >> a: list-env == [ "TERM_PROGRAM" "Apple_Terminal" "TERM" "xterm-color" "SHELL" "/bin/bash" "TMPDIR" "/var/folders/6O/6OnXy9XG... >> help a A is a block of value: [ "TERM_PROGRAM" "Apple_Terminal" "TERM" "xterm-color" "SHELL" "/bin/bash" "TMPDIR" "/var/folders/6O/6OnXy9XGEjiDp3wDqfCJo++++TI/-Tmp-/" "Apple_PubSub_Socket_Render" "/tmp/launch-BrITkG/Render" "TERM_PROGRAM_VERSION" "273.1" "USER" "Ash" "COMMAND_MODE" "legacy" "SSH_AUTH_SOCK" "/tmp/launch-HlnoPI/Listeners" "__CF_USER_TEXT_ENCODING" "0x1F5:0:0" "PATH" {/usr/bin:/bin:/usr/sbin:/sbin:/usr/local/bin:/usr/X11/bin} "PWD" "/Users/Ash" "LANG" "en_AU.UTF-8" "SHLVL" "1" "HOME" "/Users/Ash" "LOGNAME" "Ash" "DISPLAY" "/tmp/launch-U0Gaqw/org.x:0" "_" "/Users/Ash/REBOL/rebol" ] >> length? a == 18 >> select a "USER" == "Ash" >> select a "HOME" == none | |
Oldes: 10-Nov-2011 | I can imagine length? on char! value in unicode context - it could return number of bytes needed to store the char with the utf-8 encoding:) But I'm sure I can live without it. It would just add overheat to the length! action. | |
Ladislav: 11-Nov-2011 | I want to share with you an "interoperability problem" I encountered. In Windows (at least in not too old versions) there are two versions of string-handling functions: - ANSI (in fact using a codepage for latin charset) - widechar (in fact UNICODE, restricted to 16 bits, I think) It looks, that Apple OS X "prefers" to use decomposed UNICODE, also known as UTF-8MAC, I guess. That means, that it e.g. for a Robert's file it generates a filename looking (in a transcription) as follows: %"Mu^(combining-umlaut)nch.r" As far as the UNICODE goes, this is canonically equivalent to %"M^(u-with-umlaut)nch.r" , but: - Windows don't consider these file names equivalent, i.e. you can have both in one directory - When using the former, the ANSI versions of Windows system functions "translate" the name to: %"Mu^(umlaut)nch.r" -- the %"Mu^(umlaut)nch.r" is a third file name, distinct from both of the above, so, if the R2 reads it in a directory, it is unable to open it | |
Group: Red ... Red language group [web-public] | ||
Dockimbel: 28-Feb-2011 | I plan to support UTF-8 scripts for both Red & Red/System. The memory storage model is not yet decided, could be UTF-8 or UCS-2. | |
Dockimbel: 29-Mar-2011 | Brian: right, but I'm not sure that Red/System needs to be Unicode-aware, at least not to implement UTF-8 Red's sources parsing. | |
Dockimbel: 29-Mar-2011 | Well, by default, Red/System could be transparent to UTF-8 (that's what will be used in Red for strings I/O), as is string! in R2. Will add char! as unicode codepoint to possible evolutions anyway. | |
Dockimbel: 11-Oct-2011 | Anyone knows where to find exhaustive lists of invalid UTF-8 encoding ranges? | |
Andreas: 11-Oct-2011 | C0, C1, F5-FF must never occur in UTF-8. | |
BrianH: 11-Oct-2011 | You might also consider looking at the source of INVALID-UTF? in R2, which is MIT licensed from R2/Forward. | |
BrianH: 11-Oct-2011 | It would still be a good idea to review the Unicode standard to determine which of the characters should be treated as spaces, but that would still be a problem for R3 because all of the delimiters it currently supports are one byte in UTF-8 for efficiency. If other delimiters are supported, R3's parser will be much slower. | |
Andreas: 12-Oct-2011 | Completely forgot about INVALID-UTF? :) |
301 / 402 | 1 | 2 | 3 | [4] | 5 |