r3wp [groups: 83 posts: 189283]
  • Home
  • Script library
  • AltME Archive
  • Mailing list
  • Articles Index
  • Site search
 

World: r3wp

[REBOL Syntax] Discussions about REBOL syntax

Steeve
19-Feb-2012
[274x2]
Introducing email! datatype next.

form: '?[*-:-*'] 
':' may be in the first position only
'<' can't be in the first position
'%FF' escaping chars in hexa notation
Both R2, R3

escape-uri: [#"%" 2 hex-digit]
email-char: complement union charset {%@:} termination-char
email-syntax: [
	[#":" | not #"<" email-char | escape-uri]
	any [email-char | escape-uri]
	#"@"
	any [email-char | escape-uri]
	termination
]
Andreas
19-Feb-2012
[276x2]
Hmm, when : is in the first position, a : can occur anywhere afterwards 
as well.
For example, [:a:@:b:]
Steeve
19-Feb-2012
[278]
not anymore an email! but an url! then
Andreas
19-Feb-2012
[279]
Not in R3.
Steeve
19-Feb-2012
[280x5]
right
right
right
good catch, true in R2 also
Arg, It will be hard to keep the rule tight
BrianH
19-Feb-2012
[285]
I figure that we should look at the email formatting standard, then 
subtract support for any syntax that would conflict with something 
else in REBOL, especially if that doesn't commonly show up in actual 
email addresses. We've already made some tradeoffs in favor of email 
(i.e. no @ in issues or words), maybe we want to make more.
Andreas
19-Feb-2012
[286]
Where would we "want" to do that?
BrianH
19-Feb-2012
[287]
Doesn't work for R2 though - that syntax just needs to be documented, 
it can't be changed.
Andreas
19-Feb-2012
[288x2]
Or how would such a desire reflect?
In filing CC issues?
BrianH
19-Feb-2012
[290x2]
When I was trying to replicate the R3 word syntax, it was partly 
to document R3, partly to serve as the basis of a more flexible TRANSCODE 
that would allow people to handle more sloppy syntax without removing 
the valuable errors from the regular TRANSCODE, but mostly it served 
to generate new CC tickets for syntax bugs that we weren't aware 
of because the syntax wasn't well enough documented, and they hadn't 
come up in practice yet.
There is a large, unknown number of such bugs in URL syntax, for 
instance. I wouldn't be surprised if that is the case with email 
too.
Andreas
19-Feb-2012
[292x2]
If it's obvious bugs, that's comparatively easy, yes.
Your initial message above sounded more like wishes towards a more 
restricted email!.
BrianH
19-Feb-2012
[294x2]
A more thorough examination of the syntax makes more of these bugs 
obvious.
I don't necessarily want a more restricted email! than it is already, 
but if we are expanding what is possible with email!, it will still 
likely need to be restricted relative to the email standard.
Andreas
19-Feb-2012
[296]
We are not expanding anything :) We are just describing what syntactical 
rules the REBOL email! literal syntax follows.
BrianH
19-Feb-2012
[297]
I'm a little more concerned with R3 URL syntax though, since in that 
case there are real bugs that have already affected people in real 
cases, and because hypothetically a lot of the bugs are fixable in 
mezzanine code.
Andreas
19-Feb-2012
[298]
And as the email! datatype can be used for many a purpose within 
dialects, it does not necessarily have to match RFC822 (or rather 
5322) exactly.
Steeve
19-Feb-2012
[299]
but the syntax checking can't be corrected witth mezzs right ?
Andreas
19-Feb-2012
[300]
(Which would be a relatively complex problem anyway ...

http://www.ex-parrot.com/~pdw/Mail-RFC822-Address.html)
BrianH
19-Feb-2012
[301x2]
Steeve: For emails, no. For urls, yes.
For url! the syntax checking is mostly done by the DECODE-URL mezzanine. 
We can't change what is recognized as a url! by REBOL, but we can 
change how the data is treated once it's recognized. There are errors 
in escape handling, for instance.
Steeve
19-Feb-2012
[303]
Corrected version, works with R2 and R3:

escape-uri: [#"%" 2 hex-digit]
email-char: complement union charset {%@:} termination-char
email-esc: [email-char | escape-uri]
email-syntax: [
	[
		#":" any [email-esc | #":" ] #"@" any [email-esc | #":" ]
		| not #"<" some email-esc #"@" any email-esc
	]
	termination
]
Andreas
19-Feb-2012
[304]
Ah, was wondering. So we can't change the syntax or url!s in R3 as 
well, we can only improve/bugfix url! handling.
BrianH
19-Feb-2012
[305]
You'd be surprised at how flexible the syntax of url! is in R3 :)
Andreas
19-Feb-2012
[306]
I don't think I would.
BrianH
19-Feb-2012
[307x2]
Fair enough. But if you can figure out exactly hor MOLD handles escaping 
of urls, that would help narrow down what bugs we can fix in DECODE-URL.
hor -> how
Andreas
19-Feb-2012
[309]
I would be slightly surprised if it is more flexible than string 
syntax, but I somehow doubt that :)
BrianH
19-Feb-2012
[310]
Fewer escaping methods, so no. What's weird is that some kinds of 
string escaping work for the file! type.
Steeve
20-Feb-2012
[311]
It's calm here
Ladislav
20-Feb-2012
[312x2]
committed a couple of 1903-5 additions. You were right that #1905 
is ugly, Steeve.
Caught up with the code posted above.
Steeve
23-Feb-2012
[314x5]
url! syntax (both R2,R3)
I've not created specific charsets, so the rule is more verbose.

- The first char! same as for word! (less "+-")
- Must contain at least one ':'
- "/" Allowed only after the first ":"
- Escape-uri allowed like in email!

url-syntax: [
	not digit not #"'" not sign word-char
	any [escape-uri | not termination-char not #":" skip]
	#":"
	any [escape-uri | #"/" | not termination-char skip]
]
Forgot the case when it begins with '"." 
I should have stick with the word-syntax much closer
url-syntax: [
	[#"." not digit | not digit not #"'" not sign word-char]
	any [escape-uri | not termination-char not #":" skip]
	#":"
	any [escape-uri | #"/" | not termination-char skip]
]
hum... still wrong
url-syntax: [
	not [digit | #"'" | #"." digit | sign] word-char
	any [escape-uri | not termination-char not #":" skip]
	#":"
	any [escape-uri | #"/" | not termination-char skip]
]
BrianH
23-Feb-2012
[319x3]
That's a good start! I'm really curious about whether ulrs and emails 
deal with chars over 127, especially in R3. As far as I know, the 
URI standards don't support them directly, but various internationalization 
extensions add recodings for these non-ASCII characters. It would 
be good to know exactly which chars supported in the data model, 
so we can hack the code that supports that data to match.
When last I checked, R3 considers all chars over 127 to be word-chars. 
It is considered to be non of REBOL's business whether a printer 
or display would show the character, so that even includes the additional 
Unicode space and control characters beyond ASCII. R3 has a binary 
parser, you see.
non of -> none of
Steeve
23-Feb-2012
[322]
yeah
BrianH
23-Feb-2012
[323]
Do you know if the REBOL syntax parser (LOAD and TRANSCODE) handles 
the unescaping and puts the decoded data into the url! structure, 
or if that is handled by the DECODE-URL mezzanine code? I'm hoping 
it's handled by the mezzanine, because it's broken in both R2 and 
R3 and mezzanine changes are the only kind we can make at the moment.