r3wp [groups: 83 posts: 189283]
  • Home
  • Script library
  • AltME Archive
  • Mailing list
  • Articles Index
  • Site search
 

World: r3wp

[REBOL Syntax] Discussions about REBOL syntax

Steeve
23-Feb-2012
[317x2]
hum... still wrong
url-syntax: [
	not [digit | #"'" | #"." digit | sign] word-char
	any [escape-uri | not termination-char not #":" skip]
	#":"
	any [escape-uri | #"/" | not termination-char skip]
]
BrianH
23-Feb-2012
[319x3]
That's a good start! I'm really curious about whether ulrs and emails 
deal with chars over 127, especially in R3. As far as I know, the 
URI standards don't support them directly, but various internationalization 
extensions add recodings for these non-ASCII characters. It would 
be good to know exactly which chars supported in the data model, 
so we can hack the code that supports that data to match.
When last I checked, R3 considers all chars over 127 to be word-chars. 
It is considered to be non of REBOL's business whether a printer 
or display would show the character, so that even includes the additional 
Unicode space and control characters beyond ASCII. R3 has a binary 
parser, you see.
non of -> none of
Steeve
23-Feb-2012
[322]
yeah
BrianH
23-Feb-2012
[323]
Do you know if the REBOL syntax parser (LOAD and TRANSCODE) handles 
the unescaping and puts the decoded data into the url! structure, 
or if that is handled by the DECODE-URL mezzanine code? I'm hoping 
it's handled by the mezzanine, because it's broken in both R2 and 
R3 and mezzanine changes are the only kind we can make at the moment.
Maxim
23-Feb-2012
[324x3]
AFAICT  it's part of the datatype... since a space will go back and 
forth when you go to/from URL! and other types like string

(in R2 at least):
>> to-url "gogo://a.com/space here"
== gogo://a.com/space here
>> to-string gogo://a.com/space here
== "gogo://a.com/space here"
or did I get you wron?
wrong
Steeve
23-Feb-2012
[327]
Brian, Can you show me what is broken ? I'm a bit unsettled by your 
concern
BrianH
23-Feb-2012
[328x3]
The escape decoding gets done too early. The decoding should not 
be done after until the URI structure has been parsed. If you do 
the escape decoding too early, characters that are escaped so that 
they won't be treated as syntax characters (like /) are treated as 
syntax characters erroneously. This is a bad problem for schemes 
like HTTP or FTP that can use usernames and passwords, because the 
passwords in particular either get corrupted or have inappropriately 
restricted character sets. IDN encoding should be put off until the 
last minute too, once we add support for Unicode to the url handlers 
of HTTP, plus any others that should support that standard.
Given that the URI structure is parsed by DECODE-URL (or the R3 equivalent), 
that means that any unescaping should be done in that function, or 
in the scheme handler itself, not in the native code that runs before 
the mezzanine code is called.
Re-escaping in MOLD is OK though. It's the input that's the problem, 
not the output.
Maxim
23-Feb-2012
[331]
yep... and I've lost hours trying to get some ftp code to work because 
it had strange urls (with passwds)... which the interpreter would 
break all the time. 

At some point you are mystified by what is the actual URL being sent 
to the server.


once you see what is going on, you can get it to work, but realizing 
that you didn't actually send the url you expect, can take quite 
a long time to realize and properly fix once you've got a whole app 
expecting/playing with urls.
BrianH
23-Feb-2012
[332]
I've been hoping to fix that. I can load a hot-patch into R2, and 
include a patch in a host kit build in R3 or replace functions from 
%rebol.r if necessary.
Steeve
23-Feb-2012
[333x5]
Ok I try to resume our concern.

The url! and email! syntax is more permissive than a valid URI. It's 
not a problem nor a design flaw.

The escape decoding should not be done at all when decoded as a part 
of an url! or email!. Right, but it will not be corrected until Carl 
does it.

DECODE-URL can be rewritten (used by schemes). The parser is too 
strict and can't deal with complex forms.
Lot of inconsistencies with file! datatype between R2 and R3.
Escaping notation = huge mess
you can use 2 forms for file! :
in R2
- %"*"  quoted sting file, with ^ escape notation allowed
- %*  Form  with %ff escape notation allowed  
in R3
- quoted string file works fine

- in the %* form, the % escape notation works fine but the ^ char 
mess up  things in some cases without issuing an error
In the %* form, R3 should recognise the ^ char as a normal char (not 
one escaping notation) as R2 does.
So for the moment; I think it's better to reject the ^ char in the 
R3 syntax
Maxim
23-Feb-2012
[338]
yeah, its surely some left over copy/paste code from the string loader, 
left in the file loader by error.
BrianH
23-Feb-2012
[339x3]
Worse than being a huge mess, R2 and R3 have different messes. R2 
MOLD fails to encode the % character properly. R3 chokes on the ^ 
character in unquoted mode, and allows both ^ and % escaping in quoted 
mode, and MOLDs the ^ character without encoding it (a problem because 
it chokes on that character). Overall the R2 MOLD problem is worse 
than all of the R3 problems put together because % is a more common 
character in filenames than ^, but both need fixing. I wish it just 
did one escaping method for files, % escaping, or did only % escaping 
for unquoted files and only ^ escaping for quoted files. % escaping 
doesn't support Unicode characters over 255, but no characters like 
that need to be escaped anyways - they can be written directly.
R2 file! syntax may have more problems that I'm not aware of though.
I guess that I just want the escaping behavior Steeve described for 
R2, but with the MOLD of %%25 fix from R3, along with % by itself 
being interpreted as and molding as %"".
Steeve
24-Feb-2012
[342x4]
file-char: complement union charset {%:@} termination-char
file-char/#"/": true	;** #"/" added
file-syntax: [
	#"%" [
		quoted-string
		| any [file-char | escape-uri] ;** fail on ^ char
	] termination
]
alternative-syntax R2 file-syntax: [
	#"%" [
		quoted-string
		| some [file-char | escape-uri | #"^^"]  ;** ^ valid char
	] termination
]
Missing rules...
path! refinement! date! time! 
Anything else ???
pair!
Sources
https://github.com/rebolsource/rebol-syntax
Maxim
24-Feb-2012
[346]
I don't see to recognise the serialized version of the few datatypes 
which have it...
 #[true] #[false] #[none] [#function [][] ]  [#object [] ]
Steeve
24-Feb-2012
[347]
yep
Cyphre
24-Feb-2012
[348]
image!
Maxim
24-Feb-2012
[349]
#[list![]]  #[hash![]]
Steeve
24-Feb-2012
[350]
Okkkkk, there is a huge list for the serialized ones ;-)
Maxim
24-Feb-2012
[351]
money!    1.00%
Steeve
24-Feb-2012
[352]
percent!
Cyphre
24-Feb-2012
[353x2]
date! time!...
bitset!...
Maxim
24-Feb-2012
[355]
path! set-path! lit-path!
Steeve
24-Feb-2012
[356]
Well...
Cyphre
24-Feb-2012
[357]
just write: ? datatype! in the console to get some list
Steeve
24-Feb-2012
[358]
I will focus on the annoying ones for now
Maxim
24-Feb-2012
[359]
date has many variations, its probably the more complex one left
Steeve
24-Feb-2012
[360]
yep
Maxim
24-Feb-2012
[361]
actually,  path! also has a few quirks, like allowing parens and 
the use of a  get-set-word at the end
Steeve
24-Feb-2012
[362]
but path! needs all the other dataypes to be finished first
Maxim
24-Feb-2012
[363x2]
no, afaik,  just paren!, word and its own additional format quirks. 
    as the global block definition expacts, so too will parens, and 
thus the path.
expacts... expands
Steeve
24-Feb-2012
[365]
So, path! is not complex in that regard (values separated by '/')
Maxim
24-Feb-2012
[366]
yeah, just have to find the values which are valid in a path (not 
all types are valid, at least in R2)