• Home
  • Script library
  • AltME Archive
  • Mailing list
  • Articles Index
  • Site search
 

AltME groups: search

Help · search scripts · search articles · search mailing list

results summary

worldhits
r4wp115
r3wp287
total:402

results window for this page: [start: 1 end: 100]

world-name: r4wp

Group: #Red ... Red language group [web-public]
DocKimbel:
5-Aug-2012
Red: I'm still working on both the compiler and the minimal runtime 
required to run simple Red programs. I have only the very basic datatypes 
working for now, no objects (so no ports) yet. I not yet at the point 
where I can give an accurate ETA for the first alpha, but I hope 
to be able to provide that ETA in a week.


Red string! datatype will support Unicode (UTF-8 and UTF-16 encoding 
internally). I haven't implemented Unicode yet, so if some of you 
are willing to provide efficient code for supporting Unicode, that 
would greatly speedup Red progress. 

The following functions would be needed (coded in Red/System):

- UTF-8 <=> UTF-16 LE conversion routines

- (by order of importance) length?, compare (two strings), compare-case, 
pick, poke, at, find, find-case
- optinally: uppercase, lowercase, sort


All the above functions should be coded both for UTF-8 and UTF-16 
LE.
DocKimbel:
5-Aug-2012
In case, you wonder why Red needs both UTF formats, well, it's simple, 
 Windows and UNIX worlds use different encodings, so we need to support 
both. Red will use by default UTF-8 for string values, but on Windows 
platform, it will convert the string to UTF-16 on first call to an 
OS API, and will keep that encoding later on (and avoid the overhead 
of converting it each time). 


We might want to make the UTF-16 related code platform-depend and 
not include it for other platforms, but I think that some text processing 
algorithms might benefit from a fixed-size encoding, so for now, 
I'm for including both encoding for all targets.


It will be also possible for users to check and change the encoding 
of a Red string! value at runtime.
BrianH:
5-Aug-2012
Keep in mind that even UTF-16 is not a fixed-size encoding. Each 
codepoint either takes 2 or 4 bytes.
BrianH:
5-Aug-2012
UTF-32 (aka UCS4) is a fixed-size encoding. It's rarely used though.
BrianH:
4-Sep-2012
There is a bit that is worth learning from R3's Unicode transition 
that would help Red.


First, make sure that strings are logically series of codepoints. 
Don't expose the internal structure of strings to code that uses 
them. Different underlying platforms do their Unicode APIs using 
different formats, so on different platforms you might need to implement 
strings differently. You don't want these differences affecting the 
Red code that uses these strings.


Don't have direct equivalence between binary! and string! - require 
conversion between them. No AS-STRING and AS-BINARY functions. Don't 
export the underlying binary data. If you do, the code that uses 
strings would come to depend on a particular underlying format, and 
would then break on platforms where the underlying format is different. 
Also, if you provide access to the underlying binary data to Red 
code, you have to assume that the format of that data can be corrupted 
at any moment, so you'll have to add a lot of verification code, 
and your compiler won't be able to get rid of it.


Work in codepoints, not characters. Unicode characters are complicated 
and can involve multiple codepoints, or not, but until you display 
it none of that matters.


R3 uses fixed-length encodings of strings internally in order to 
speed things up, but that can cause problems when running on underlying 
platforms that use variable-length encodings in their APIs, like 
Linux (UTF-8) and Windows/Java/.NET/OSX? (UTF-16). This makes sense 
for R3 because the underlying code is compiled, but the outer code 
is not, and there's no way to break that barrier. With Red the string 
API could be logical, with the optimizer making the distinction go 
away, so you might be able to get away with using variable-length 
encodings internally if that makes sense to you. Length and index 
would be slower, but there'd be less overhead when calling external 
API functions, so make the tradeoff that works best for you.
BrianH:
4-Sep-2012
That's not as hard as it sounds. There are only 3 API models in wide 
use: UTF-16, UTF-8, and no Unicode support at all. A given port of 
Red would only have to support one of those on a given platform.
DocKimbel:
4-Sep-2012
So far, my short-list of encodings to support are UTF-8 and UTF-16LE. 
UTF-32 might be needed at some point in the future, but for now, 
I'm not aware of any system that uses it?


The Unicode standard by itself is not the problem (having just one 
encoding would have helped, though). The issue lies in different 
OSes supporting different encodings, so it makes the choice for an 
internal x-platform encoding hard. It's a matter of Red internal 
trade-offs, so I need to study the possible internal resources usage 
for each one and decide which one is the more appropriate. So far, 
I was inclined to support both UTF-8 and UTF-16LE fully, but I'm 
not sure yet that's the best choice. To avoid surprizing users with 
inconsistent string operation performances, I thought to give users 
explicit control over string format, if they need such control (by 
default, Red would handle all automatically internally). For example, 
on Windows::

    s: "hello"		;-- UTF-8 literal string

    print s		;-- string converted to UCS2 for printing through win32 
    API
    write %file s	;-- string converted back to UTF-8

    set-modes s 'encoding 'UTF-16 ;-- user deciding on format
or
    s/encoding: 'UTF-16

    print length? s	;-- Length? then runs in O(1), no surprize.



Supporting ANSI as internal encoding seems useless, being able to 
just export/import it should suffice.

BTW, Brian, IIRC, OS X relies on UTF-8 internally not UTF-16.
DocKimbel:
4-Sep-2012
set-modes s 'encoding 'UTF-16
should rather be:
    set-modes s [encoding: UTF-16]
BrianH:
4-Sep-2012
Be sure to not forget the difference between UTF-16 (variable-length 
encoding of all of Unicode) and UCS2 (fixed-length encoding of a 
subset of Unicode). Windows, Java and .NET support UTF-16 (barring 
the occasional buggy code that assumes fixed-length encoding). R3's 
current underlying implementation is UCS2, with its character set 
limitations, but its logical model is codepoint-series.
BrianH:
4-Sep-2012
IIRC Python 3 uses UCS4 internally for its Unicode strings, with 
all of the overhead that implies. UCS4 and UTF-32 are the same thing, 
both fixed-length.
BrianH:
4-Sep-2012
If you support different internal string encodings on a given platform, 
be sure to not give logical access to the underlying binary data 
to Red code. The get/set-modes model is good for that kind of thing. 
If the end developer knows that the string will be grabbed from something 
that provides UTF-8 and passed along to something that takes UTF-8, 
they might be better off choosing UTF-8 as an underlying encoding. 
However, that should just be a mode - their interaction with the 
string should follow the codepoint model. If the end developer will 
be working directly with encoded data, they should be working with 
binary! values.
BrianH:
4-Sep-2012
Btw, in this code above:
    s/encoding: 'UTF-16
    print length? s	;-- Length? then runs in O(1), no surprize.


Length is not O(1) for UTF-16, it's O(n). Length is only O(1) for 
the fixed-length encodings.
BrianH:
4-Sep-2012
Ah, but length is even O(n) for BMP characters in a UTF-16 string, 
because figuring out that there are only BMP characters in there 
is an O(n) operation. To be O(1) you'd have to mark some flag in 
the string when you add the characters in there in the first place.
DocKimbel:
4-Sep-2012
Ok, if you really want to be nitpicking, replace UTF-16 with UCS-2. 
;-)
BrianH:
4-Sep-2012
If you are ensuring that only BMP characters are in there then you 
have UCS2, not UTF-16 :)
BrianH:
4-Sep-2012
Don't worry, I'm only nitpicking to make things better. There's a 
lot of buggy code out there that assumes UTF-16 is UCS2, so we're 
better off making that distinction right away :)
DocKimbel:
7-Sep-2012
Brian: I was wrong for OS X, it uses UTF-16 internally according 
to http://en.wikipedia.org/wiki/UTF-16
DocKimbel:
24-Sep-2012
Conversion for printing in UTF-16 done on-the-fly (no additional 
buffer needed)
BrianH:
24-Sep-2012
Will you eventually be doing the same trick R3 does of keeping its 
symbols in UTF-8 format internally, for binary hashing? Of course 
you might be handling symbols completely differently...
DocKimbel:
24-Sep-2012
Yes, I currently keep an UTF-8 version in cache for each small string, 
but I'm not sure I will keep it.
PeterWood:
26-Sep-2012
Is the source file of your Czech version UTF-8 encoded?
DocKimbel:
26-Sep-2012
(just select UTF-8 when saving)
Pekr:
26-Sep-2012
hello.red is already UTF-8, I just added one line and saved ...
DocKimbel:
26-Sep-2012
Be sure you've saved it in UTF-8.
Pekr:
26-Sep-2012
well, anyway - how is R2 being able to read utf-8 anyway?
DocKimbel:
26-Sep-2012
It reads it as a stream of bytes. As UTF-8 doesn't use null bytes 
in its encoding (except for codepoint 0), it can be fully loaded 
as string! or binary! in R2 (but you'll see garbage for non-ASCII 
characters).
PeterWood:
26-Sep-2012
If anybody can provide the UTF-8 chars (hex values)  for Hello World 
in Czech. I'll run a test.
DocKimbel:
26-Sep-2012
The above string doesn't work as-is in Red though, you should pass 
the codepoints escaped instead of the UTF-8 encoding.
Pekr:
26-Sep-2012
Above works ... but when I write it directly in Notepad (and the 
file claims it is UTF-8), it does not work ... strange then ...
Henrik:
26-Sep-2012
Not sure if Notepad is the best for UTF-8 work...
DocKimbel:
26-Sep-2012
Pekr: try to set the "encoding" field to UTF-8 in the saving panel 
(Save as...).
Pekr:
26-Sep-2012
it is set to UTF-8 already ....
Pekr:
26-Sep-2012
In R3, if script is in the UTF-8 format, I can imo directly type 
it in Notepad ...
MagnussonC:
26-Sep-2012
Tested "Hallå Världen!" on Win 7 (UTF-8) and it works. Saving the 
file as Notepads  "Unicode" doesn't work, but I understand "Unicode" 
isn't supposed to be UTF.
DocKimbel:
26-Sep-2012
I guess that "Unicode" mode of Notepad is UTF-16. Red accepts only 
UTF-8 input scripts.
DocKimbel:
26-Sep-2012
You need to change the encoding selector when saving with Notepad 
to UTF-8.
MagnussonC:
26-Sep-2012
Yes, I testad with UTF-8 encoded file
Andreas:
26-Sep-2012
I noticed that the red/tests/hello.red file is UTF-8 with a BOM -- 
I'd suggest dropping the BOM, as using a BOM with UTF8 is not recommended.
BrianH:
20-Oct-2012
Note that if you specify the length, it applies to the length of 
the script after the header and an optional newline after it (cr, 
crlf or lf). Same goes for the checksum. Both apply to binary data, 
meaning the source in UTF-8 encoding and with newlines in the style 
that they are specified in the file.
PeterWood:
30-Oct-2012
AFAIK, windows consoles only supporting Windows 8-bit codepages or 
UTF16. Red/System can print the full range of UTF-8 characters (as 
can REBOL) but the console can't display them.
Kaj:
30-Oct-2012
Ah, right, I'd have to use UTF-16 source text
PeterWood:
30-Oct-2012
You would need to check that the Windows console is set to display 
UTF-16B.


This commit ( https://github.com/dockimbel/Red/commit/be271889ff03e44bdb55af04b60ea2bb280cb18f
) shows how.
PeterWood:
30-Oct-2012
The other way is to convert the utf-8 c-string! to  UTF-16E integers 
on the fly  and feed them into llibc putwchar yourself. More work 
upfront but may be easier in the long term.


The code in red/runitme/platform/win32.reds is a pretty clear exmpale 
of how to do it but you wuld still need to write the UTF-8 to UTF16-LE 
on the fly conversion yourself. (That one is UCS-4 to UTF16LE).
DocKimbel:
31-Oct-2012
Kaj: you can switch the Windows console to an UTF-8 compatible mode 
using _setmode():
http://msdn.microsoft.com/en-us/library/tw4k6df8.aspx


I haven't test it but it should work. Windows uses natively UTF-16LE, 
so you would probably have a speed penalty using that mode.
Kaj:
1-Nov-2012
hello-Unicode is because the program source is UTF-8 instead of UTF-16
DocKimbel:
1-Nov-2012
Kaj: Red source scripts should always be UTF-8 encoded regardless 
of the platform.
DocKimbel:
8-Nov-2012
A series buffer has header, with OFFSET and TAIL pointers that define 
respectively the begin and end of series slots. The OFFSET pointer 
allow to reserve space at head of the series for optimizing insertions 
at head. Series slots size can be 1 (binary/UTF-8/Latin-1), 2 (UCS-2), 
4 (UCS-4) or 16 (value!) bytes wide.
DocKimbel:
10-Nov-2012
Red should provide an UTF-8 codec. For national encodings, we would 
probably proceed by offering  on-demand online codecs for the most 
used ones. That could be a shared resource with R3.
DocKimbel:
10-Nov-2012
BTW, we already have a UTF-8 binary parser in the Red compiler.
DocKimbel:
27-Dec-2012
_setmode call is used to properly set the DOS console to UTF-16 (Unicode 
mode).
DocKimbel:
29-Dec-2012
You should wait for me to add the marshalling and unmarshalling functions 
(that will be used everywhere Red needs to interface with non-Red 
code). In your code example, it should be: 1 + length? version (as 
it needs to account for terminal NUL character). Also, you need to 
get sure that the source c-string! buffer is always available or 
make a copy of it (a pointer to it is stored as a UTF-8 cache, unused 
yet, but intended for speeding up I/O, still experimental, not sure 
it will stay for v1.0).
Kaj:
10-Apr-2013
It was my understanding that string/rs-head returns a UTF-8 cache 
of a string. How can I get this value?


I'm trying to get UTF-8 back that I fed in. The problem I'm having 
is the following:

write %syllable.org.html read "http://syllable.org"

This writes out just one character instead of the expected file.
Kaj:
15-Apr-2013
Doc, any idea how I can convert a string! passed into a routine! 
to UTF-8, or access a cached UTF-8 value?
PeterWood:
16-Apr-2013
The answer is not you can't as mold doesnt output a UTF-8 string.
DocKimbel:
16-Apr-2013
Kaj: cached UTF-8 string is available using str/cache if str is a 
red-string! value.
DocKimbel:
16-Apr-2013
We haven't yet implemented UTF-8 encoding functions in the standard 
library. It will be done during the I/O implementation (unless you 
have a strong need for it, then I'll have a look at it).
Kaj:
17-Apr-2013
UTF-8 encoding would be very welcome. My I/O frameworks are of little 
use without it
DocKimbel:
17-Apr-2013
Red/System c-strings are UTF-8 compatible.
Kaj:
17-Apr-2013
I mean my ongoing request for getting UTF-8 in routines
DocKimbel:
17-Apr-2013
But someone could contribute string! <=> UTF-8 conversion routines 
in the meantime.
DocKimbel:
17-Apr-2013
I think that those UTF-8 conversion routines would take at least 
two days of work to get implemented and debugged.    I'll see once 
I get shared libs done if I can afford them before working on the 
other urgent tasks.
DocKimbel:
17-Apr-2013
Well, as I said, someone could contribute those UTF-8 conversion 
routines.
DocKimbel:
17-Apr-2013
For Android, java uses UTF-16, so the conversion from string! is 
(almost) trivial.
PeterWood:
17-Apr-2013
I'd be happy to look at a UCS-2 to UTF-8 conversion function but 
I don't have the time to do it at the moment.
PeterWood:
17-Apr-2013
I've written a quick function that will take a Red char (UCS4) and 
output the equivalent UTF-8 as bytes stored in a struct!.


It can be used for the base of converting a Red sting to UTF-8. What 
is needed is to extract Red Char! s from the Red String, call the 
function and then appedn the UTF-8 to a c-string!
PeterWood:
17-Apr-2013
You can find it at:


https://github.com/PeterWAWood/Red-System-Libs/blob/master/UTF-8/ucs4-utf8.reds
PeterWood:
18-Apr-2013
For me the big issue of turning the function into the utf-8 string 
that Kaj's wants is "How to allocate a c-string! using the Red Memory 
Manager rather than malloc"

Any suggestions appreciated.
DocKimbel:
18-Apr-2013
Here's how your main loop would look like for retrieving every codepoint 
from a string! value:

	head: string/rs-head str
	tail: string/rs-tail str
		
	s: GET_BUFFER(str)
	unit: GET_UNIT(s)
		
	while [head < tail][
		cp: switch unit [
			Latin1 [as-integer p/value]
			UCS-2  [(as-integer p/2) << 8 + p/1]
			UCS-4  [p4: as int-ptr! p p4/value]
		]
		...emit UTF-8 char...
		head: head + unit
	]
PeterWood:
18-Apr-2013
I should be able to turn this into a function for Kaj to include 
in his routine! where he needs UTF-8
DocKimbel:
18-Apr-2013
Kaj is working on Linux and Syllable only. Also that API provides 
UTF-16 to UTF-8 support, but we need also UCS-4 to UTF-8 (UCS-2 being 
a subset of UTF-16).
PeterWood:
19-Apr-2013
Kaj - You can find a rough and ready  red-string! to  c-string! function 
at:


https://github.com/PeterWAWood/Red-System-Libs/blob/master/UTF-8/string-c-string.reds


it #includes the UCS4 character to UTF8 convertor which you will 
need in the same directory as the string-c-string func.
PeterWood:
19-Apr-2013
The ucs4 -> utf8 char convertor:


https://github.com/PeterWAWood/Red-System-Libs/blob/master/UTF-8/ucs4-utf8.reds
PeterWood:
19-Apr-2013
I haven't really tested it as you can see from :


https://github.com/PeterWAWood/Red-System-Libs/blob/master/UTF-8/Tests/string-c-string-test.red
DocKimbel:
19-Apr-2013
Peter, maybe you could user ALLOCATE function from Red/Sytem and 
let Kaj's code call FREE on UTF-8 buffers after usage?
PeterWood:
24-Apr-2013
Being written in REBOL/View, ALTME encodes characters in the Windows 
codepage under Windows, MacRoman under OS X and UTF-8 (I think, it 
may be ISO-8859-1) under Linux. So if you use any character  other 
than standard ASCII characters, it will appear differently on differnet 
systems.
Kaj:
26-Apr-2013
I've been working on adding UTF-8 support for the past week, so you'll 
see construction soon
Kaj:
26-Apr-2013
Here's what happens when I paste UTF-8 in the console on Linux:
DocKimbel:
26-Apr-2013
Red input sources must be UTF-8 encoded.
DocKimbel:
26-Apr-2013
You can't paste UTF-8 in the console, it supports only Latin-1.
DocKimbel:
26-Apr-2013
Are you sure you're pasting Latin-1 and not UTF-8?
Kaj:
26-Apr-2013
string/load can only load UTF-8, so only ASCII and UTF-8 files can 
be read, not Latin-1
DocKimbel:
26-Apr-2013
For: print read "http://syllable.org", do you feed string/load with 
an UTF-8 input even on Windows?
Kaj:
26-Apr-2013
Actually, I did one test that confirms Andreas' statement. The only 
way to get 8-bit data in is to compile a UTF-8 string literal that 
fits into Latin-1
Kaj:
26-Apr-2013
No, the console says you can input Latin-1, and you can't, not even 
through UTF-8
Group: Announce ... Announcements only - use Ann-reply to chat [web-public]
Kaj:
27-Apr-2013
I implemented UTF-8 output support for Red. I ended up writing optimised 
versions based more on the Red print backend. I integrated them in 
my I/O routines and made heavy performance optimisations. Thanks 
to Peter for leading the way. There are the following Red/System 
encoders embedded in %common.red:

http://red.esperconsultancy.nl/Red-common/dir?ci=tip


to-UTF8: encodes a Red string into UTF-8 Red/System c-string! format.

to-local-file: encodes a Red string into Latin-1 Red/System c-string! 
format on Windows, and into UTF-8 on other systems. This yields a 
string suitable for the local file name APIs. Latin-1 can be output 
as long as it was input into Red via UTF-8. Non-Latin-1 code points 
cannot be encoded in Latin-1 and yield a NULL for the entire result.


These encoders make use of the Latin1-to-UTF8, UCS2-to-UTF8 and UCS4-to-UTF8 
encoding functions. An example of their use in the Red READ and WRITE 
functions is in %input-output.red
Kaj:
27-Apr-2013
I used the new encoding functions in all my Red bindings: those for 
the C library, input/output via files and cURL, 0MQ, SQLite and GTK+. 
In as many places as possible, data marshalled to the external libraries 
now supports UTF-8. File names on Windows support Latin-1. Files 
and URLs are always read and written as UTF-8, including on Windows. 
Red does not support loading Latin-1 strings.
Kaj:
27-Apr-2013
I've updated the binary downloads. The red console interpreters and 
all the Red examples include the above encoding support now, and 
all the latest Red features:

http://red.esperconsultancy.nl/Red-test/dir?ci=tip


For example, the Red/GTK-text-editor now supports writing UTF-8 files 
with UTF-8 or Latin-1 names.


I've added an MSDOS\Red\red-core.exe for Windows 2000, because the 
GTK+ libraries in red.exe require Windows XP+.
Kaj:
27-Apr-2013
I can't test the encoding on Mac, so I would be interested to hear 
if it works there, especially UTF-8 file names
Kaj:
19-Jun-2013
I changed the Red 0MQ interface to optimise the memory use during 
receiving of messages:

http://red.esperconsultancy.nl/Red-ZeroMQ-binding/info/2a1541af57


SEND and RECEIVE have been renamed to send-string and receive-string, 
because they currently handle messages as UTF-8 text. When Red gets 
a binary! type, versions for binary messages will be added, and there 
will probably be type agnostic SEND and RECEIVE wrappers again. Previously, 
you used

message: receive socket


to receive a string message. Now you pass a premade string! (similar 
to call/output in R2):

message: ""
receive-string socket message


This means that you can choose between creating new strings for each 
message (with COPY) or reusing the same string. In the latter case, 
some Red/System code in receive-string makes sure that no extra Red 
memory is used, and that all used system and 0MQ memory is freed 
again. By optimising memory use, this also improves performance of 
message throughput.
Group: Ann-Reply ... Reply to Announce group [web-public]
Kaj:
20-Feb-2013
Fossil standardises on UTF-8 and standard line endings in text files. 
I suppose I should not link to single files anymore. From a folder 
in Fossil's web UI, you can at least view those files
Group: Rebol School ... REBOL School [web-public]
Pekr:
20-Jun-2012
I use Artisteer to prototype web pages, and it saves content in UTF-8. 
Later on, I need to do few adaptations to such generated pages, so 
I opened it in R2, reparsed, inserted some stuff, deleted other, 
but it did not work out ....
Pekr:
20-Jun-2012
Use some external tool to convert it to ANSI, do adaptations, and 
covert it back to UTF-8?
Pekr:
20-Jun-2012
I mean - text I need to input into the resulting file (UTF-8) is 
ANSI. I do print to-string read %text-slider.html, and in R3 console, 
Czech text is not correct ....
Kaj:
20-Jun-2012
So you're saying the input file is not UTF-8?
Pekr:
20-Jun-2012
Yes, ANSI. I solved it by re-saving the same source file as UTF-8 
istead of ANSI. Still a bad complication, as by default, Windows 
sets Notepad to ANSI, so it is a bit inconvenient ...
BrianH:
20-Jun-2012
Petr, R3 can't decode any 8bit encodings with its built-in code, 
just ASCII (which is 7bit) and UTF-8. However, its binary handling 
is better so it should be easy to write your own converters. For 
R2, I would suggest looking at Gabriele's PowerMezz package; it has 
some great text converters. Of course you lose out on R3's PARSE 
if you use R2.
Arnold:
21-Jun-2012
On my mac the script I made on windows using a couple of international 
characters the chars are also displayed wrong. "Nederlands" "English" 
"Deutsch" "Français"

 "Español" "Italiano" "Português". When I saved as UTF-8 I hoped my 
 problems would have resolved, but then REBOL complained my script 
 had no REBOL header. :-(
Arnold:
22-Jun-2012
And knowing even this small community has less members then the diacrits 
they are using in everyday living it is a requirement to deal with 
UTF-8 UCS or other encodings.
Group: !REBOL3 ... General discussion about REBOL 3 [web-public]
GrahamC:
9-Jan-2013
since this is the trace

HEAD /index.html HTTP/1.0
Accept: */*
Accept-Charset: utf-8
Host: www.rebol.com
User-Agent: REBOL

HTTP/1.1 200 OK
Date: Wed, 09 Jan 2013 09:03:18 GMT
Server: Apache
Last-Modified: Sat, 15 Dec 2012 07:02:21 GMT
Accept-Ranges: bytes
Content-Type: text/html
Via: 1.1 BC5-ACLD
Content-Length: 7407
Connection: close
Andreas:
26-Feb-2013
No bug, READ does no longer automatically decode binary to strings. 
Use READ/string to obtain a a Unicode string obtained by decoding 
the binary with UTF-8.
1 / 402[1] 2345