World: r4wp
[#Red] Red language group
older newer | first last |
Jerry 4-Sep-2012 [1538] | I am glad that you are doing the Unicode part now. Better support it sooner than later. Back to 2008, I was one of the three Unicode testers for Carl, and I found many bugs and reported them back to Carl before he released it to the public. |
BrianH 4-Sep-2012 [1539x4] | There is a bit that is worth learning from R3's Unicode transition that would help Red. First, make sure that strings are logically series of codepoints. Don't expose the internal structure of strings to code that uses them. Different underlying platforms do their Unicode APIs using different formats, so on different platforms you might need to implement strings differently. You don't want these differences affecting the Red code that uses these strings. Don't have direct equivalence between binary! and string! - require conversion between them. No AS-STRING and AS-BINARY functions. Don't export the underlying binary data. If you do, the code that uses strings would come to depend on a particular underlying format, and would then break on platforms where the underlying format is different. Also, if you provide access to the underlying binary data to Red code, you have to assume that the format of that data can be corrupted at any moment, so you'll have to add a lot of verification code, and your compiler won't be able to get rid of it. Work in codepoints, not characters. Unicode characters are complicated and can involve multiple codepoints, or not, but until you display it none of that matters. R3 uses fixed-length encodings of strings internally in order to speed things up, but that can cause problems when running on underlying platforms that use variable-length encodings in their APIs, like Linux (UTF-8) and Windows/Java/.NET/OSX? (UTF-16). This makes sense for R3 because the underlying code is compiled, but the outer code is not, and there's no way to break that barrier. With Red the string API could be logical, with the optimizer making the distinction go away, so you might be able to get away with using variable-length encodings internally if that makes sense to you. Length and index would be slower, but there'd be less overhead when calling external API functions, so make the tradeoff that works best for you. |
If there are parts of R2 or R3 that you like or don't like, don't assume that they are part of the design. There's a lot of stuff in there that doesn't match the design, is buggy or unfinished. Also, for R3, don't assume that only Carl knows the design. He worked with others, discussed his design with the other contributors. There's some stuff which only he can answer though, and some design decisions that weren't resolved, let alone implemented. | |
The concurrency model was not fully designed, for instance, and almost completely not implemented. | |
However, the part of the concurrency model that was designed so far affected the design and implementation of the system model and module system. You'd be surprised how much the module system was affected by the system, binding and interpretation model of R3; very little of its design and implementation was arbitrary. You might be able to get the syntax the same for Red's module system, but given the different system/binding/execution model there wouldn't be much of the implementation in common. | |
sqlab 4-Sep-2012 [1543] | I am for sure no expert regarding unicode, but as red is a compiler and open source, why not not add flags that the user has to choose which unicode/string support he wants; either flexibility, but of cost of speed or no unicode support, then he has to do the hard work by himself |
BrianH 4-Sep-2012 [1544x2] | One hypothetical advantage you have with Red is that you can make the logical behavior fairly high-level and have the compiler/optimizer get rid of that at runtime. REBOL, being interpreted, is effectively a lower-level language requiring hand optimization, the kind of hand optimization that you'd want to prohibit in Red because it would interfere with the machine optimization. This means that, for strings at least, it would make sense to have the logical model have a lot of the same constraints as that of R3 (because those constraints were inherent in the design of Unicode), but make the compiler aware of the model so it can translate things to a much lower level. If you break the logical model though, you remove the power the compiler has to optimize things. |
sqlab, it would make sense to have the user choose the underlying model if you are doing Red on bare metal and implementing everything yourself, or running on a system with no Unicode support at all. If you are running a Red program on an existing system with Unicode support, the choice of which model is best has already been made for you. In those cases choosing the best underlying model would best be made by the Red porter, not the end developer. | |
sqlab 4-Sep-2012 [1546] | but that means, that Red has to support all unicode models on all the systems, it can be compiled for. |
BrianH 4-Sep-2012 [1547x2] | That's not as hard as it sounds. There are only 3 API models in wide use: UTF-16, UTF-8, and no Unicode support at all. A given port of Red would only have to support one of those on a given platform. |
Red user code would only need to support the codepoint-series model; Red would translate that into the system's preferred underlying model. More encodings would need to be supported for conversion during I/O, of course, but not for API or internal use. | |
DocKimbel 4-Sep-2012 [1549] | So far, my short-list of encodings to support are UTF-8 and UTF-16LE. UTF-32 might be needed at some point in the future, but for now, I'm not aware of any system that uses it? The Unicode standard by itself is not the problem (having just one encoding would have helped, though). The issue lies in different OSes supporting different encodings, so it makes the choice for an internal x-platform encoding hard. It's a matter of Red internal trade-offs, so I need to study the possible internal resources usage for each one and decide which one is the more appropriate. So far, I was inclined to support both UTF-8 and UTF-16LE fully, but I'm not sure yet that's the best choice. To avoid surprizing users with inconsistent string operation performances, I thought to give users explicit control over string format, if they need such control (by default, Red would handle all automatically internally). For example, on Windows:: s: "hello" ;-- UTF-8 literal string print s ;-- string converted to UCS2 for printing through win32 API write %file s ;-- string converted back to UTF-8 set-modes s 'encoding 'UTF-16 ;-- user deciding on format or s/encoding: 'UTF-16 print length? s ;-- Length? then runs in O(1), no surprize. Supporting ANSI as internal encoding seems useless, being able to just export/import it should suffice. BTW, Brian, IIRC, OS X relies on UTF-8 internally not UTF-16. |
BrianH 4-Sep-2012 [1550] | Thanks, I don't know much about OSX's Unicode support. |
DocKimbel 4-Sep-2012 [1551] | set-modes s 'encoding 'UTF-16 should rather be: set-modes s [encoding: UTF-16] |
BrianH 4-Sep-2012 [1552x4] | Be sure to not forget the difference between UTF-16 (variable-length encoding of all of Unicode) and UCS2 (fixed-length encoding of a subset of Unicode). Windows, Java and .NET support UTF-16 (barring the occasional buggy code that assumes fixed-length encoding). R3's current underlying implementation is UCS2, with its character set limitations, but its logical model is codepoint-series. |
IIRC Python 3 uses UCS4 internally for its Unicode strings, with all of the overhead that implies. UCS4 and UTF-32 are the same thing, both fixed-length. | |
If you support different internal string encodings on a given platform, be sure to not give logical access to the underlying binary data to Red code. The get/set-modes model is good for that kind of thing. If the end developer knows that the string will be grabbed from something that provides UTF-8 and passed along to something that takes UTF-8, they might be better off choosing UTF-8 as an underlying encoding. However, that should just be a mode - their interaction with the string should follow the codepoint model. If the end developer will be working directly with encoded data, they should be working with binary! values. | |
Btw, in this code above: s/encoding: 'UTF-16 print length? s ;-- Length? then runs in O(1), no surprize. Length is not O(1) for UTF-16, it's O(n). Length is only O(1) for the fixed-length encodings. | |
DocKimbel 4-Sep-2012 [1556x2] | Since Python 3.3, things have changed: http://www.python.org/dev/peps/pep-0393/ |
Brian: right, my claim is valid for BMP characters only. | |
BrianH 4-Sep-2012 [1558] | Ah, but length is even O(n) for BMP characters in a UTF-16 string, because figuring out that there are only BMP characters in there is an O(n) operation. To be O(1) you'd have to mark some flag in the string when you add the characters in there in the first place. |
DocKimbel 4-Sep-2012 [1559] | Ok, if you really want to be nitpicking, replace UTF-16 with UCS-2. ;-) |
BrianH 4-Sep-2012 [1560x3] | If you are ensuring that only BMP characters are in there then you have UCS2, not UTF-16 :) |
Python 3.3 seems to finally be following the R3 model, good for them. Even better for them because it's actually implemented. | |
Don't worry, I'm only nitpicking to make things better. There's a lot of buggy code out there that assumes UTF-16 is UCS2, so we're better off making that distinction right away :) | |
DocKimbel 4-Sep-2012 [1563] | Well, then I'm sure you'll be glad to write string unit tests for Red in order to ensure things are done in the proper way. ;-) |
BrianH 4-Sep-2012 [1564] | Doc, pardon me because I don't know what the intended datatype model is for Red. Something like the REBOL datatype/action model could be used to implement the different underlying string encodings that you want in-memory support for. Each supported encoding would have its own set of action handlers, which would all have the same external interface. Swapping the encoding would be as simple as swapping the handler set. Resolving the difference at compile time could be similar to generic type instantiation, or C++ template generation. |
Kaj 4-Sep-2012 [1565] | The datatype code that's committed so far seems to go that way |
DocKimbel 4-Sep-2012 [1566x2] | Brian: implementing an abstraction layer over string encodings is a trivial task. The intended datatype model is very similar to REBOL's one, even more since recentl,y as I moved to an hybrid dynamic/static type system. I'll commit the new code in a few days, so you'll see how close to REBOL it can be. I hope that this hybrid model will help us get the best of both worlds. |
*recently, | |
DocKimbel 5-Sep-2012 [1568] | Got my first real Red program working: Red [] print 1 outputs: 1 It doesn't look like much, but it validates the compiler + runtime from end to end, and at this point, it's really cool! FYI, the native PRINT here triggers a FORM (action) on the passed argument. No REDUCE yet (not implemented). |
james_nak 5-Sep-2012 [1569] | Congratulations Doc. That is so cool. Really! |
DocKimbel 5-Sep-2012 [1570] | Thanks James! |
Janko 5-Sep-2012 [1571] | Wow!!! awesome .. from print "hello worl" to print "<html>...</html>" to make webapps is not that long way, I was just afraid you would never get to RED, Awesome! |
AdrianS 5-Sep-2012 [1572] | Yeah! Way to go! |
Gregg 5-Sep-2012 [1573] | Congratulations Doc! This is a great day. |
GrahamC 5-Sep-2012 [1574x2] | A milestone |
( kilometre in France ) | |
Sunanda 5-Sep-2012 [1576] | +1 for Red :) |
Henrik 6-Sep-2012 [1577] | Congratulations, doc. :-) |
sqlab 6-Sep-2012 [1578] | This gives hope |
Endo 6-Sep-2012 [1579] | Cool! Another 30€ for the good news :) |
Jerry 6-Sep-2012 [1580x2] | Wooow.. Wooow... Did I hear the crying of a baby... I think this is the birthday of Red. The past 1.5 years are just pregnancy. |
:-) Can't wait to show Red to the Chinese people. | |
Pekr 6-Sep-2012 [1582] | From now on, the child is going to grow day by day :-) |
Jerry 6-Sep-2012 [1583] | I can tell, it's gonna be a super star :-) |
Cyphre 6-Sep-2012 [1584] | Congrats, Doc! |
DocKimbel 6-Sep-2012 [1585x2] | Endo: thank you! Thanks to all for your support! |
Pekr: right, from now on, you can expect daily progress on Red layer. I will push the new code soon, I still need to complete it a bit and clean it up. Jerry: the baby looks nice, we'll just have to keep it away from junk food and it will grow up well. ;-) | |
PeterWood 6-Sep-2012 [1587] | Many congratulations Nenad. |
older newer | first last |