AltME groups: search
Help · search scripts · search articles · search mailing listresults summary
world | hits |
r4wp | 115 |
r3wp | 287 |
total: | 402 |
results window for this page: [start: 1 end: 100]
world-name: r4wp
Group: #Red ... Red language group [web-public] | ||
DocKimbel: 5-Aug-2012 | Red: I'm still working on both the compiler and the minimal runtime required to run simple Red programs. I have only the very basic datatypes working for now, no objects (so no ports) yet. I not yet at the point where I can give an accurate ETA for the first alpha, but I hope to be able to provide that ETA in a week. Red string! datatype will support Unicode (UTF-8 and UTF-16 encoding internally). I haven't implemented Unicode yet, so if some of you are willing to provide efficient code for supporting Unicode, that would greatly speedup Red progress. The following functions would be needed (coded in Red/System): - UTF-8 <=> UTF-16 LE conversion routines - (by order of importance) length?, compare (two strings), compare-case, pick, poke, at, find, find-case - optinally: uppercase, lowercase, sort All the above functions should be coded both for UTF-8 and UTF-16 LE. | |
DocKimbel: 5-Aug-2012 | In case, you wonder why Red needs both UTF formats, well, it's simple, Windows and UNIX worlds use different encodings, so we need to support both. Red will use by default UTF-8 for string values, but on Windows platform, it will convert the string to UTF-16 on first call to an OS API, and will keep that encoding later on (and avoid the overhead of converting it each time). We might want to make the UTF-16 related code platform-depend and not include it for other platforms, but I think that some text processing algorithms might benefit from a fixed-size encoding, so for now, I'm for including both encoding for all targets. It will be also possible for users to check and change the encoding of a Red string! value at runtime. | |
BrianH: 5-Aug-2012 | Keep in mind that even UTF-16 is not a fixed-size encoding. Each codepoint either takes 2 or 4 bytes. | |
BrianH: 5-Aug-2012 | UTF-32 (aka UCS4) is a fixed-size encoding. It's rarely used though. | |
BrianH: 4-Sep-2012 | There is a bit that is worth learning from R3's Unicode transition that would help Red. First, make sure that strings are logically series of codepoints. Don't expose the internal structure of strings to code that uses them. Different underlying platforms do their Unicode APIs using different formats, so on different platforms you might need to implement strings differently. You don't want these differences affecting the Red code that uses these strings. Don't have direct equivalence between binary! and string! - require conversion between them. No AS-STRING and AS-BINARY functions. Don't export the underlying binary data. If you do, the code that uses strings would come to depend on a particular underlying format, and would then break on platforms where the underlying format is different. Also, if you provide access to the underlying binary data to Red code, you have to assume that the format of that data can be corrupted at any moment, so you'll have to add a lot of verification code, and your compiler won't be able to get rid of it. Work in codepoints, not characters. Unicode characters are complicated and can involve multiple codepoints, or not, but until you display it none of that matters. R3 uses fixed-length encodings of strings internally in order to speed things up, but that can cause problems when running on underlying platforms that use variable-length encodings in their APIs, like Linux (UTF-8) and Windows/Java/.NET/OSX? (UTF-16). This makes sense for R3 because the underlying code is compiled, but the outer code is not, and there's no way to break that barrier. With Red the string API could be logical, with the optimizer making the distinction go away, so you might be able to get away with using variable-length encodings internally if that makes sense to you. Length and index would be slower, but there'd be less overhead when calling external API functions, so make the tradeoff that works best for you. | |
BrianH: 4-Sep-2012 | That's not as hard as it sounds. There are only 3 API models in wide use: UTF-16, UTF-8, and no Unicode support at all. A given port of Red would only have to support one of those on a given platform. | |
DocKimbel: 4-Sep-2012 | So far, my short-list of encodings to support are UTF-8 and UTF-16LE. UTF-32 might be needed at some point in the future, but for now, I'm not aware of any system that uses it? The Unicode standard by itself is not the problem (having just one encoding would have helped, though). The issue lies in different OSes supporting different encodings, so it makes the choice for an internal x-platform encoding hard. It's a matter of Red internal trade-offs, so I need to study the possible internal resources usage for each one and decide which one is the more appropriate. So far, I was inclined to support both UTF-8 and UTF-16LE fully, but I'm not sure yet that's the best choice. To avoid surprizing users with inconsistent string operation performances, I thought to give users explicit control over string format, if they need such control (by default, Red would handle all automatically internally). For example, on Windows:: s: "hello" ;-- UTF-8 literal string print s ;-- string converted to UCS2 for printing through win32 API write %file s ;-- string converted back to UTF-8 set-modes s 'encoding 'UTF-16 ;-- user deciding on format or s/encoding: 'UTF-16 print length? s ;-- Length? then runs in O(1), no surprize. Supporting ANSI as internal encoding seems useless, being able to just export/import it should suffice. BTW, Brian, IIRC, OS X relies on UTF-8 internally not UTF-16. | |
DocKimbel: 4-Sep-2012 | set-modes s 'encoding 'UTF-16 should rather be: set-modes s [encoding: UTF-16] | |
BrianH: 4-Sep-2012 | Be sure to not forget the difference between UTF-16 (variable-length encoding of all of Unicode) and UCS2 (fixed-length encoding of a subset of Unicode). Windows, Java and .NET support UTF-16 (barring the occasional buggy code that assumes fixed-length encoding). R3's current underlying implementation is UCS2, with its character set limitations, but its logical model is codepoint-series. | |
BrianH: 4-Sep-2012 | IIRC Python 3 uses UCS4 internally for its Unicode strings, with all of the overhead that implies. UCS4 and UTF-32 are the same thing, both fixed-length. | |
BrianH: 4-Sep-2012 | If you support different internal string encodings on a given platform, be sure to not give logical access to the underlying binary data to Red code. The get/set-modes model is good for that kind of thing. If the end developer knows that the string will be grabbed from something that provides UTF-8 and passed along to something that takes UTF-8, they might be better off choosing UTF-8 as an underlying encoding. However, that should just be a mode - their interaction with the string should follow the codepoint model. If the end developer will be working directly with encoded data, they should be working with binary! values. | |
BrianH: 4-Sep-2012 | Btw, in this code above: s/encoding: 'UTF-16 print length? s ;-- Length? then runs in O(1), no surprize. Length is not O(1) for UTF-16, it's O(n). Length is only O(1) for the fixed-length encodings. | |
BrianH: 4-Sep-2012 | Ah, but length is even O(n) for BMP characters in a UTF-16 string, because figuring out that there are only BMP characters in there is an O(n) operation. To be O(1) you'd have to mark some flag in the string when you add the characters in there in the first place. | |
DocKimbel: 4-Sep-2012 | Ok, if you really want to be nitpicking, replace UTF-16 with UCS-2. ;-) | |
BrianH: 4-Sep-2012 | If you are ensuring that only BMP characters are in there then you have UCS2, not UTF-16 :) | |
BrianH: 4-Sep-2012 | Don't worry, I'm only nitpicking to make things better. There's a lot of buggy code out there that assumes UTF-16 is UCS2, so we're better off making that distinction right away :) | |
DocKimbel: 7-Sep-2012 | Brian: I was wrong for OS X, it uses UTF-16 internally according to http://en.wikipedia.org/wiki/UTF-16 | |
DocKimbel: 24-Sep-2012 | Conversion for printing in UTF-16 done on-the-fly (no additional buffer needed) | |
BrianH: 24-Sep-2012 | Will you eventually be doing the same trick R3 does of keeping its symbols in UTF-8 format internally, for binary hashing? Of course you might be handling symbols completely differently... | |
DocKimbel: 24-Sep-2012 | Yes, I currently keep an UTF-8 version in cache for each small string, but I'm not sure I will keep it. | |
PeterWood: 26-Sep-2012 | Is the source file of your Czech version UTF-8 encoded? | |
DocKimbel: 26-Sep-2012 | (just select UTF-8 when saving) | |
Pekr: 26-Sep-2012 | hello.red is already UTF-8, I just added one line and saved ... | |
DocKimbel: 26-Sep-2012 | Be sure you've saved it in UTF-8. | |
Pekr: 26-Sep-2012 | well, anyway - how is R2 being able to read utf-8 anyway? | |
DocKimbel: 26-Sep-2012 | It reads it as a stream of bytes. As UTF-8 doesn't use null bytes in its encoding (except for codepoint 0), it can be fully loaded as string! or binary! in R2 (but you'll see garbage for non-ASCII characters). | |
PeterWood: 26-Sep-2012 | If anybody can provide the UTF-8 chars (hex values) for Hello World in Czech. I'll run a test. | |
DocKimbel: 26-Sep-2012 | The above string doesn't work as-is in Red though, you should pass the codepoints escaped instead of the UTF-8 encoding. | |
Pekr: 26-Sep-2012 | Above works ... but when I write it directly in Notepad (and the file claims it is UTF-8), it does not work ... strange then ... | |
Henrik: 26-Sep-2012 | Not sure if Notepad is the best for UTF-8 work... | |
DocKimbel: 26-Sep-2012 | Pekr: try to set the "encoding" field to UTF-8 in the saving panel (Save as...). | |
Pekr: 26-Sep-2012 | it is set to UTF-8 already .... | |
Pekr: 26-Sep-2012 | In R3, if script is in the UTF-8 format, I can imo directly type it in Notepad ... | |
MagnussonC: 26-Sep-2012 | Tested "Hallå Världen!" on Win 7 (UTF-8) and it works. Saving the file as Notepads "Unicode" doesn't work, but I understand "Unicode" isn't supposed to be UTF. | |
DocKimbel: 26-Sep-2012 | I guess that "Unicode" mode of Notepad is UTF-16. Red accepts only UTF-8 input scripts. | |
DocKimbel: 26-Sep-2012 | You need to change the encoding selector when saving with Notepad to UTF-8. | |
MagnussonC: 26-Sep-2012 | Yes, I testad with UTF-8 encoded file | |
Andreas: 26-Sep-2012 | I noticed that the red/tests/hello.red file is UTF-8 with a BOM -- I'd suggest dropping the BOM, as using a BOM with UTF8 is not recommended. | |
BrianH: 20-Oct-2012 | Note that if you specify the length, it applies to the length of the script after the header and an optional newline after it (cr, crlf or lf). Same goes for the checksum. Both apply to binary data, meaning the source in UTF-8 encoding and with newlines in the style that they are specified in the file. | |
PeterWood: 30-Oct-2012 | AFAIK, windows consoles only supporting Windows 8-bit codepages or UTF16. Red/System can print the full range of UTF-8 characters (as can REBOL) but the console can't display them. | |
Kaj: 30-Oct-2012 | Ah, right, I'd have to use UTF-16 source text | |
PeterWood: 30-Oct-2012 | You would need to check that the Windows console is set to display UTF-16B. This commit ( https://github.com/dockimbel/Red/commit/be271889ff03e44bdb55af04b60ea2bb280cb18f ) shows how. | |
PeterWood: 30-Oct-2012 | The other way is to convert the utf-8 c-string! to UTF-16E integers on the fly and feed them into llibc putwchar yourself. More work upfront but may be easier in the long term. The code in red/runitme/platform/win32.reds is a pretty clear exmpale of how to do it but you wuld still need to write the UTF-8 to UTF16-LE on the fly conversion yourself. (That one is UCS-4 to UTF16LE). | |
DocKimbel: 31-Oct-2012 | Kaj: you can switch the Windows console to an UTF-8 compatible mode using _setmode(): http://msdn.microsoft.com/en-us/library/tw4k6df8.aspx I haven't test it but it should work. Windows uses natively UTF-16LE, so you would probably have a speed penalty using that mode. | |
Kaj: 1-Nov-2012 | hello-Unicode is because the program source is UTF-8 instead of UTF-16 | |
DocKimbel: 1-Nov-2012 | Kaj: Red source scripts should always be UTF-8 encoded regardless of the platform. | |
DocKimbel: 8-Nov-2012 | A series buffer has header, with OFFSET and TAIL pointers that define respectively the begin and end of series slots. The OFFSET pointer allow to reserve space at head of the series for optimizing insertions at head. Series slots size can be 1 (binary/UTF-8/Latin-1), 2 (UCS-2), 4 (UCS-4) or 16 (value!) bytes wide. | |
DocKimbel: 10-Nov-2012 | Red should provide an UTF-8 codec. For national encodings, we would probably proceed by offering on-demand online codecs for the most used ones. That could be a shared resource with R3. | |
DocKimbel: 10-Nov-2012 | BTW, we already have a UTF-8 binary parser in the Red compiler. | |
DocKimbel: 27-Dec-2012 | _setmode call is used to properly set the DOS console to UTF-16 (Unicode mode). | |
DocKimbel: 29-Dec-2012 | You should wait for me to add the marshalling and unmarshalling functions (that will be used everywhere Red needs to interface with non-Red code). In your code example, it should be: 1 + length? version (as it needs to account for terminal NUL character). Also, you need to get sure that the source c-string! buffer is always available or make a copy of it (a pointer to it is stored as a UTF-8 cache, unused yet, but intended for speeding up I/O, still experimental, not sure it will stay for v1.0). | |
Kaj: 10-Apr-2013 | It was my understanding that string/rs-head returns a UTF-8 cache of a string. How can I get this value? I'm trying to get UTF-8 back that I fed in. The problem I'm having is the following: write %syllable.org.html read "http://syllable.org" This writes out just one character instead of the expected file. | |
Kaj: 15-Apr-2013 | Doc, any idea how I can convert a string! passed into a routine! to UTF-8, or access a cached UTF-8 value? | |
PeterWood: 16-Apr-2013 | The answer is not you can't as mold doesnt output a UTF-8 string. | |
DocKimbel: 16-Apr-2013 | Kaj: cached UTF-8 string is available using str/cache if str is a red-string! value. | |
DocKimbel: 16-Apr-2013 | We haven't yet implemented UTF-8 encoding functions in the standard library. It will be done during the I/O implementation (unless you have a strong need for it, then I'll have a look at it). | |
Kaj: 17-Apr-2013 | UTF-8 encoding would be very welcome. My I/O frameworks are of little use without it | |
DocKimbel: 17-Apr-2013 | Red/System c-strings are UTF-8 compatible. | |
Kaj: 17-Apr-2013 | I mean my ongoing request for getting UTF-8 in routines | |
DocKimbel: 17-Apr-2013 | But someone could contribute string! <=> UTF-8 conversion routines in the meantime. | |
DocKimbel: 17-Apr-2013 | I think that those UTF-8 conversion routines would take at least two days of work to get implemented and debugged. I'll see once I get shared libs done if I can afford them before working on the other urgent tasks. | |
DocKimbel: 17-Apr-2013 | Well, as I said, someone could contribute those UTF-8 conversion routines. | |
DocKimbel: 17-Apr-2013 | For Android, java uses UTF-16, so the conversion from string! is (almost) trivial. | |
PeterWood: 17-Apr-2013 | I'd be happy to look at a UCS-2 to UTF-8 conversion function but I don't have the time to do it at the moment. | |
PeterWood: 17-Apr-2013 | I've written a quick function that will take a Red char (UCS4) and output the equivalent UTF-8 as bytes stored in a struct!. It can be used for the base of converting a Red sting to UTF-8. What is needed is to extract Red Char! s from the Red String, call the function and then appedn the UTF-8 to a c-string! | |
PeterWood: 17-Apr-2013 | You can find it at: https://github.com/PeterWAWood/Red-System-Libs/blob/master/UTF-8/ucs4-utf8.reds | |
PeterWood: 18-Apr-2013 | For me the big issue of turning the function into the utf-8 string that Kaj's wants is "How to allocate a c-string! using the Red Memory Manager rather than malloc" Any suggestions appreciated. | |
DocKimbel: 18-Apr-2013 | Here's how your main loop would look like for retrieving every codepoint from a string! value: head: string/rs-head str tail: string/rs-tail str s: GET_BUFFER(str) unit: GET_UNIT(s) while [head < tail][ cp: switch unit [ Latin1 [as-integer p/value] UCS-2 [(as-integer p/2) << 8 + p/1] UCS-4 [p4: as int-ptr! p p4/value] ] ...emit UTF-8 char... head: head + unit ] | |
PeterWood: 18-Apr-2013 | I should be able to turn this into a function for Kaj to include in his routine! where he needs UTF-8 | |
DocKimbel: 18-Apr-2013 | Kaj is working on Linux and Syllable only. Also that API provides UTF-16 to UTF-8 support, but we need also UCS-4 to UTF-8 (UCS-2 being a subset of UTF-16). | |
PeterWood: 19-Apr-2013 | Kaj - You can find a rough and ready red-string! to c-string! function at: https://github.com/PeterWAWood/Red-System-Libs/blob/master/UTF-8/string-c-string.reds it #includes the UCS4 character to UTF8 convertor which you will need in the same directory as the string-c-string func. | |
PeterWood: 19-Apr-2013 | The ucs4 -> utf8 char convertor: https://github.com/PeterWAWood/Red-System-Libs/blob/master/UTF-8/ucs4-utf8.reds | |
PeterWood: 19-Apr-2013 | I haven't really tested it as you can see from : https://github.com/PeterWAWood/Red-System-Libs/blob/master/UTF-8/Tests/string-c-string-test.red | |
DocKimbel: 19-Apr-2013 | Peter, maybe you could user ALLOCATE function from Red/Sytem and let Kaj's code call FREE on UTF-8 buffers after usage? | |
PeterWood: 24-Apr-2013 | Being written in REBOL/View, ALTME encodes characters in the Windows codepage under Windows, MacRoman under OS X and UTF-8 (I think, it may be ISO-8859-1) under Linux. So if you use any character other than standard ASCII characters, it will appear differently on differnet systems. | |
Kaj: 26-Apr-2013 | I've been working on adding UTF-8 support for the past week, so you'll see construction soon | |
Kaj: 26-Apr-2013 | Here's what happens when I paste UTF-8 in the console on Linux: | |
DocKimbel: 26-Apr-2013 | Red input sources must be UTF-8 encoded. | |
DocKimbel: 26-Apr-2013 | You can't paste UTF-8 in the console, it supports only Latin-1. | |
DocKimbel: 26-Apr-2013 | Are you sure you're pasting Latin-1 and not UTF-8? | |
Kaj: 26-Apr-2013 | string/load can only load UTF-8, so only ASCII and UTF-8 files can be read, not Latin-1 | |
DocKimbel: 26-Apr-2013 | For: print read "http://syllable.org", do you feed string/load with an UTF-8 input even on Windows? | |
Kaj: 26-Apr-2013 | Actually, I did one test that confirms Andreas' statement. The only way to get 8-bit data in is to compile a UTF-8 string literal that fits into Latin-1 | |
Kaj: 26-Apr-2013 | No, the console says you can input Latin-1, and you can't, not even through UTF-8 | |
Group: Announce ... Announcements only - use Ann-reply to chat [web-public] | ||
Kaj: 27-Apr-2013 | I implemented UTF-8 output support for Red. I ended up writing optimised versions based more on the Red print backend. I integrated them in my I/O routines and made heavy performance optimisations. Thanks to Peter for leading the way. There are the following Red/System encoders embedded in %common.red: http://red.esperconsultancy.nl/Red-common/dir?ci=tip to-UTF8: encodes a Red string into UTF-8 Red/System c-string! format. to-local-file: encodes a Red string into Latin-1 Red/System c-string! format on Windows, and into UTF-8 on other systems. This yields a string suitable for the local file name APIs. Latin-1 can be output as long as it was input into Red via UTF-8. Non-Latin-1 code points cannot be encoded in Latin-1 and yield a NULL for the entire result. These encoders make use of the Latin1-to-UTF8, UCS2-to-UTF8 and UCS4-to-UTF8 encoding functions. An example of their use in the Red READ and WRITE functions is in %input-output.red | |
Kaj: 27-Apr-2013 | I used the new encoding functions in all my Red bindings: those for the C library, input/output via files and cURL, 0MQ, SQLite and GTK+. In as many places as possible, data marshalled to the external libraries now supports UTF-8. File names on Windows support Latin-1. Files and URLs are always read and written as UTF-8, including on Windows. Red does not support loading Latin-1 strings. | |
Kaj: 27-Apr-2013 | I've updated the binary downloads. The red console interpreters and all the Red examples include the above encoding support now, and all the latest Red features: http://red.esperconsultancy.nl/Red-test/dir?ci=tip For example, the Red/GTK-text-editor now supports writing UTF-8 files with UTF-8 or Latin-1 names. I've added an MSDOS\Red\red-core.exe for Windows 2000, because the GTK+ libraries in red.exe require Windows XP+. | |
Kaj: 27-Apr-2013 | I can't test the encoding on Mac, so I would be interested to hear if it works there, especially UTF-8 file names | |
Kaj: 19-Jun-2013 | I changed the Red 0MQ interface to optimise the memory use during receiving of messages: http://red.esperconsultancy.nl/Red-ZeroMQ-binding/info/2a1541af57 SEND and RECEIVE have been renamed to send-string and receive-string, because they currently handle messages as UTF-8 text. When Red gets a binary! type, versions for binary messages will be added, and there will probably be type agnostic SEND and RECEIVE wrappers again. Previously, you used message: receive socket to receive a string message. Now you pass a premade string! (similar to call/output in R2): message: "" receive-string socket message This means that you can choose between creating new strings for each message (with COPY) or reusing the same string. In the latter case, some Red/System code in receive-string makes sure that no extra Red memory is used, and that all used system and 0MQ memory is freed again. By optimising memory use, this also improves performance of message throughput. | |
Group: Ann-Reply ... Reply to Announce group [web-public] | ||
Kaj: 20-Feb-2013 | Fossil standardises on UTF-8 and standard line endings in text files. I suppose I should not link to single files anymore. From a folder in Fossil's web UI, you can at least view those files | |
Group: Rebol School ... REBOL School [web-public] | ||
Pekr: 20-Jun-2012 | I use Artisteer to prototype web pages, and it saves content in UTF-8. Later on, I need to do few adaptations to such generated pages, so I opened it in R2, reparsed, inserted some stuff, deleted other, but it did not work out .... | |
Pekr: 20-Jun-2012 | Use some external tool to convert it to ANSI, do adaptations, and covert it back to UTF-8? | |
Pekr: 20-Jun-2012 | I mean - text I need to input into the resulting file (UTF-8) is ANSI. I do print to-string read %text-slider.html, and in R3 console, Czech text is not correct .... | |
Kaj: 20-Jun-2012 | So you're saying the input file is not UTF-8? | |
Pekr: 20-Jun-2012 | Yes, ANSI. I solved it by re-saving the same source file as UTF-8 istead of ANSI. Still a bad complication, as by default, Windows sets Notepad to ANSI, so it is a bit inconvenient ... | |
BrianH: 20-Jun-2012 | Petr, R3 can't decode any 8bit encodings with its built-in code, just ASCII (which is 7bit) and UTF-8. However, its binary handling is better so it should be easy to write your own converters. For R2, I would suggest looking at Gabriele's PowerMezz package; it has some great text converters. Of course you lose out on R3's PARSE if you use R2. | |
Arnold: 21-Jun-2012 | On my mac the script I made on windows using a couple of international characters the chars are also displayed wrong. "Nederlands" "English" "Deutsch" "Français" "Español" "Italiano" "Português". When I saved as UTF-8 I hoped my problems would have resolved, but then REBOL complained my script had no REBOL header. :-( | |
Arnold: 22-Jun-2012 | And knowing even this small community has less members then the diacrits they are using in everyday living it is a requirement to deal with UTF-8 UCS or other encodings. | |
Group: !REBOL3 ... General discussion about REBOL 3 [web-public] | ||
GrahamC: 9-Jan-2013 | since this is the trace HEAD /index.html HTTP/1.0 Accept: */* Accept-Charset: utf-8 Host: www.rebol.com User-Agent: REBOL HTTP/1.1 200 OK Date: Wed, 09 Jan 2013 09:03:18 GMT Server: Apache Last-Modified: Sat, 15 Dec 2012 07:02:21 GMT Accept-Ranges: bytes Content-Type: text/html Via: 1.1 BC5-ACLD Content-Length: 7407 Connection: close | |
Andreas: 26-Feb-2013 | No bug, READ does no longer automatically decode binary to strings. Use READ/string to obtain a a Unicode string obtained by decoding the binary with UTF-8. |
1 / 402 | [1] | 2 | 3 | 4 | 5 |