Parse versus Regular Expressions
[1/42] from: lmecir:mbox:vol:cz at: 3-Apr-2003 9:39
Hi all,
sorry, if this is too trivial for the Regular Expressions (RE) experts, I am certainly
not one (I am not using RE at all, neither I knew the definition of RE).
I've heard, that PARSE was more general, than RE, although I didn't know why. Only recently
I found an example:
anbn: [#"a" anbn #"b" | none]
parse "aaaaabbbbb" anbn ; == true
The following is an RE equation ("." means concatenation and "+" means something like
or):
X = a.X + b
, which can be trivially rewritten for PARSE as follows:
x: [[#"a" x] | #"b"]
and the result:
parse "b" x ; == true
parse "ab" x ; == true
parse "aaab" x ; == true
Of course, the result of the RE equation is known, it is ("*" means repetition):
X = a*b
, which can be rewritten for PARSE as:
x: [any #"a" #"b"]
, but the fact, that both approaches worked has been appreciated by a local teacher of
the subject. Then he asked me, whether the following equation could be easily rewritten
for PARSE too:
Y = Y.a + b
The result of the equation is known, it is:
Y = ba*
, i.e. in PARSE notation:
y: [#"b" any #"a"]
Noticeably, the "dumb" approach to the equation rewriting:
y: [[y #"a"] | #"b"]
didn't "fire" any error, stack overflow, or any unusual exception. Nevertheless, it didn't
work:
parse "b" y ; == true
parse "ba" y ; == false
Any ideas?
Regards
-L
[2/42] from: g:santilli:tiscalinet:it at: 3-Apr-2003 11:10
Hi Ladislav,
On Thursday, April 3, 2003, 9:39:43 AM, you wrote:
LM> Noticeably, the "dumb" approach to the equation rewriting:
LM> y: [[y #"a"] | #"b"]
LM> didn't "fire" any error, stack overflow, or any unusual exception. Nevertheless,
it didn't work:
LM> parse "b" y ; == true
LM> parse "ba" y ; == false
LM> Any ideas?
>> r: [r]
== [r]
>> parse "abc" r
== false
It looks like, that PARSE notices the endless loop, and makes the
rule fail.
Regards,
Gabriele.
--
Gabriele Santilli <[g--santilli--tiscalinet--it]> -- REBOL Programmer
Amigan -- AGI L'Aquila -- REB: http://web.tiscali.it/rebol/index.r
[3/42] from: brett:codeconscious at: 3-Apr-2003 18:34
Silent stack limit?
y: [ [ (?? y) [y | (print "Didn't match Y")] #"a"] | #"b"]
parse "ba" y
Regards,
Brett.
Ladislav wrote: ---------------
.....
The result of the equation is known, it is:
Y = ba*
, i.e. in PARSE notation:
y: [#"b" any #"a"]
Noticeably, the "dumb" approach to the equation rewriting:
y: [[y #"a"] | #"b"]
didn't "fire" any error, stack overflow, or any unusual exception.
Nevertheless, it didn't work:
parse "b" y ; == true
parse "ba" y ; == false
Any ideas?
Regards
-L
[4/42] from: andreas:bolka:gmx at: 3-Apr-2003 10:56
Thursday, April 3, 2003, 9:39:43 AM, Ladislav wrote:
> Then he asked me, whether the following equation could be easily
> rewritten for PARSE too:
<<quoted lines omitted: 5>>
> Noticeably, the "dumb" approach to the equation rewriting:
> y: [[y #"a"] | #"b"]
I just want to add, that this is not necessarily a "dump" approach but
rather a left-recursive approach. Wether a given parser can handle
left-recursive grammars depends on the parser's internal
implementation.
There are certainly situations where left-recursive grammars are by
far more natural and elegant when compared to their non left-recursive
aequivalents.
> didn't "fire" any error, stack overflow, or any unusual exception.
> Nevertheless, it didn't work:
> parse "b" y ; == true
> parse "ba" y ; == false
> Any ideas?
Not really, sorry (unless you count my conclusion below as "ideas" :).
The problem here is that we (at least I :) do not know how 'parse is
implemented internally (and I have not attempted to gain more
knowledge by experimentation).
The only things I dare to conclude from my experiences (or from your
examples):
- 'parse is not able to handle left-recursive grammars
- 'parse does not tell you, when it encounters left recursion, it
"silently" pretends that the sentence (the "ba" in your example)
does not match the grammar. Not a very desireable behaviour, imho.
--
Best regards,
Andreas mailto:[andreas--bolka--gmx--net]
[5/42] from: joel:neely:fedex at: 3-Apr-2003 6:22
Hi, Ladislav,
Ladislav Mecir wrote:
> I've heard, that PARSE was more general, than RE, although I
> didn't know why...
>
I've yet to see any demonstration of that claim. My perspective is
that each one has some things that it expresses conveniently, and
those two sets overlap (but are not identical). However, convenience
is not the same as generality.
PARSE seems to partake of the same issue as Prolog; its seems to be
implemented in a way that makes it a hybrid between declarative and
procedural programming, with the result that sometimes things have
side effects or behaviors that can be difficult to figure out.
> Noticeably, the "dumb" approach to the equation rewriting:
> y: [[y #"a"] | #"b"]
<<quoted lines omitted: 4>>
> Any ideas?
>> trace on parse "ba" y trace off
Result: (unset)
Trace: parse (word)
Trace: "ba" (string)
Trace: y (word)
Match: [y #"a"]
Input: ba
Match: [[y #"a"] | #"b"]
Input: ba
Match: [y #"a"]
Input: ba
Match: [[y #"a"] | #"b"]
Input: ba
Match: [y #"a"]
... after many more repetitions, followed by
Match: [y #"a"]
Input: ba
Match: [[y #"a"] | #"b"]
Input: ba
Match: [y #"a"]
Input: ba
recover: #"b"
Match: #"b"
Input: ba
Match: #"a"
Input: aa
Match: #"a"
Input: aa
recover: #"b"
... and many more repetitions of the last cycle, then
Input: aa
Match: #"a"
Input: aa
recover: #"b"
Match: #"b"
Input: ba
Result: false (logic)
Trace: trace (word)
Trace: off (word)
>>
Which leads to a couple of observations:
1) The seemingly equivalent
>> y: [#"b" | [y #"a"]]
== [#"b" | [y #"a"]]
doesn't work either.
>> parse "b" y ; == true
== true
>> parse "ba" y ; == false
== false
>> trace on parse "ba" y trace off
Result: (unset)
Trace: parse (word)
Trace: "ba" (string)
Trace: y (word)
Match: #"b"
Input: ba
Result: false (logic)
Trace: trace (word)
Trace: off (word)
which leads to the suspicion that backtracking in the case of
failure (a standard technique in RE) isn't supported so well
in (the current implementation of) PARSE . IOW, an imaginary
solution would traverse the tree below, allowing each use of
Y to start a fresh backtrackable branch:
y --+-- #"b" failure (target string not exausted)
|
+-- [y #"a"] alternative, expand Y
|
+-- #"b" success, continue most recent alternative
|
+-- #"a" success, end reached
2) The fact that the nested occurrence of Y is taken as an
imperative command instead of an alternative to be reconsidered
later if necessary seems to support the PARSE/PROLOG analogy
above.
3) The trace of the earlier definition appears to descend and then
"recover" without trying all possible alternatives. The limit
on this cycling is not documented AFAIK. It would be useful to
know about such limits.
-jn-
--
Polonius: ... What do you read, my lord?
Hamlet: Words, words, words.
_Hamlet_, Act II, Scene 2
[6/42] from: rotenca:telvia:it at: 3-Apr-2003 20:15
Hi,
> y: [[y #"a"] | #"b"]
>didn't "fire" any error, stack overflow, or any unusual exception.
Nevertheless, it didn't work:
> parse "b" y ; == true
> parse "ba" y ; == false
>Any ideas?
http://www.escribe.com/internet/rebol/m25516.html about recursion limits of
parse
this makes your rule to match this string:
y: [y #"a" | #"b"]
parse head insert/dup tail "b" "a" 200 y ; == true
but also:
parse head insert/dup tail "b" "a" 66 y ; == true
parse "baa" y ; == true
parse "b" y ; == true
it matches "ba..." string where the number of "a" is (200 // (n + 1)) - n = 0
and given the recursion limits the result is the right one.
---
Ciao
Romano
[7/42] from: rotenca:telvia:it at: 4-Apr-2003 2:36
Hi Joel,
> which leads to the suspicion that backtracking in the case of
> failure (a standard technique in RE) isn't supported so well
> in (the current implementation of) PARSE . IOW, an imaginary
> solution would traverse the tree below, allowing each use of
> Y to start a fresh backtrackable branch:
parse "ba" ["b" | "ba"] ;== false
I think that this result is right, because the rule ["b" | "ba"] is matched by
b
, the failure occurs after the rule matched, when parse find "a", so no
backtracing happens here, because there are not open alternative of failed
rule.
Do you think that parse should recover to any alternative rule also in
correctly matched ones?
---
Ciao
Romano
[8/42] from: joel:neely:fedex at: 4-Apr-2003 6:57
Hi, Romano,
Romano Paolo Tenca wrote:
> parse "ba" ["b" | "ba"] ;== false
> I think that this result is right, because the rule ["b" | "ba"]
<<quoted lines omitted: 3>>
> Do you think that parse should recover to any alternative rule
> also in correctly matched ones?
Yes.
How would you read the rule above (aloud)? wouldn't you read
["b" | "ba"]
as
The string containing "b" or the string containing "ba".
or some equivalent variation? Wouldn't most people assume that the
specific string "ba" matches that description? The above (false)
result is troubling to me for several reasons:
1) It makes the PARSE dialect a strange hybrid between declarative
and procedural descriptions (as mentioned in my earlier post).
2) It totally breaks normal usage that "or" is symmetrical, even
though we read the | as "or".
3) There are many aspects of REBOL design that are justified or
defended in discussion on the basis that "REBOL was not designed
for computing scientists or programmers, but for human beings",
yet if you verbalize the above expression
parse "ba" ["b" | "ba"]
in "everyday people language" e.g. something like
Is the sequence of letters "BA" accepted by a rule that
will match "B" or "BA"?
I believe that most people would say "Of course!"
4) It breaks the obvious, intuitive attempts to describe the way
the parts of the PARSE dialect interact. I think most folks
would find it convenient to be able to "reason by analogy"
using such familar rules as
a * (b + c) == (a * b) + (a * c)
about the relationsips between, e.g. concatenation and choice.
Notice, however, the differences in behavior of the following
attempts at parse rules.
parse "ba" ["b" | "ba"] ; == false
parse "ba" ["b" ["" | "a"]] ; == false
parse "ba" ["b" [end | "a"]] ; == true
parse "ba" ["b" end | "ba"] ; == true
parse "ba" ["b" end | "ba" end] ; == true
parse "ba" [["b" | "ba"] end] ; == false
AFAIK the only answers to "Why are these different?" would be
a) a complicated description of the implementation, or
b) a rationale based on emphasizing some other detail than
the obvious "what would most people expect" issue.
-jn-
--
Polonius: ... What do you read, my lord?
Hamlet: Words, words, words.
_Hamlet_, Act II, Scene 2
[9/42] from: g:santilli:tiscalinet:it at: 4-Apr-2003 15:54
Hi Joel,
On Friday, April 4, 2003, 2:57:22 PM, you wrote:
JN> The string containing "b" or the string containing "ba".
Well, indeed it is not very intuitive, but the real meaning is:
Try to match the rule "b". If id does not match, try to
match the rule "ba".
PARSE then returns TRUE or FALSE depending on whether the rule
matched the whole string or not (i.e. whether the current position
after the end of the matching is the tail of the string or block).
JN> 1) It makes the PARSE dialect a strange hybrid between declarative
JN> and procedural descriptions (as mentioned in my earlier post).
The PARSE dialect is something that is matched to the input string
or block from left to right. I'm not sure if this makes it an
hybrid; probably it does, and I admit I like it this way. (Much
better than RE, that is.)
JN> 2) It totally breaks normal usage that "or" is symmetrical, even
JN> though we read the | as "or".
Indeed, | has not the intuitive meaning of OR. It is just a
marker: if PARSE reaches it, then it stops matching the current
rule block. If a matching fails, then PARSE looks to see if
there's a | in the current rule block and restarts matching from
after the |.
JN> Is the sequence of letters "BA" accepted by a rule that
JN> will match "B" or "BA"?
JN> I believe that most people would say "Of course!"
A rule that returns TRUE for "B" or "BA" is:
["b" end | "ba" end]
or, if you want to avoid the END,
["ba" | "b"]
(i.e. having the longest-matching rule first).
JN> AFAIK the only answers to "Why are these different?" would be
JN> a) a complicated description of the implementation, or
I don't think so. It is a very simple description of the
implementation, because the implementation is very simple. (On
contrast, RE parsers are usually very complex.)
Regards,
Gabriele.
--
Gabriele Santilli <[g--santilli--tiscalinet--it]> -- REBOL Programmer
Amigan -- AGI L'Aquila -- REB: http://web.tiscali.it/rebol/index.r
[10/42] from: greggirwin:mindspring at: 4-Apr-2003 10:15
Hi Joel,
<< 2) It totally breaks normal usage that "or" is symmetrical, even
though we read the | as "or". >>
It's like the ANY function, so I don't know that I'd say it "totally
breaks" normal usage. Isn't OR often optimized to work this way; i.e.
shortcut evaluation?
I think the most important thing, barring actual "bugs" in PARSE's
design is documenting how these things work so grammars can be designed
correctly (e.g. longest match first; greedy grammars? :).
-- Gregg
[11/42] from: lmecir:mbox:vol:cz at: 4-Apr-2003 22:14
Hi Joel,
> Ladislav Mecir wrote:
> >
> > I've heard, that PARSE was more general, than RE, although I
> > didn't know why...
> >
>
> I've yet to see any demonstration of that claim.
As I mentioned here, it is:
z: [#a z #b | none]
, which is a rule, that cannot be written using Regular Expressions.
...
> 1) It makes the PARSE dialect a strange hybrid between declarative
> and procedural descriptions (as mentioned in my earlier post).
...
Yes, PARSE dialect looks procedural. Nevertheless, all the "language stuff"
is a hybrid between procedural and declarative descriptions: e.g. Regular
Expressions and FSM's describe the same - Regular Languages.
Similarly Grammars (more declarative) and Turing Machines (procedural)
describe the same languages.
> 2) It totally breaks normal usage that "or" is symmetrical, even
> though we read the | as "or".
In the Regular Languages article I am trying to read has been stated, that
the Regular Languages can be described using the following minimal set of
operator symbols
used in addition to the usual characters of the base
alphabet:
0 - an empty language symbol. The meaning of the symbol is, that no word is
compatible with it. Analogical PARSE rule:
[end skip]
1 - an empty word symbol. The meaning of the symbol is, that an empty word
is compatible with it. Analogical PARSE rule:
[]
+ - the addition operator. The meaning of it is, that a word is compatible
with the A+B rule, if it is compatible with either the A or the B rule. This
operator is symmetrical as opposed to analogical '| in a PARSE rule.
. - the concatenation operator. While this operator looks like
multiplication, it isn't symmetrical. The meaning: a word is compatible with
the A.B rule, if it can be written as a concatenation of two words, where
the first one is compatible with the A rule and the second one is compatible
with the B rule. Analogical PARSE rule:
[a b]
* - the repetition operator. This is a unary operator. The meaning: a word
is compatible with A*, if it is consists of any number of words, each of
which is compatible with A. Analogical PARSE rule:
[any a]
( ) - the parentheses to specify the priority of operators. The meaning:
(A+B).C differs from A+(B.C) as usual. We can use square brackets in PARSE
analogically.
Why am I writing it here? These declarative rules have got only one
symmetrical operator. I don't think, that many Rebol users miss its
symmetry. The shape | of the analogical PARSE symbol is pointing at the
difference. If somebody doesn't like the fact, that the word "or" hints,
that it should be symmetrical, then he can use a different word "else",
which is more procedural, less symmetrical and it is a better match.
If you really want a symmetrical OR parse rule, it can be programmed:
or-rule: func [
{Generate an A or B symmetrical PARSE rule}
a
b
] [
use [f g ma mb a' b' s a'' b''] copy/deep [
a': a
b': b
a'': [a' f: (ma: true)]
b'': [b' g: (mb: true)]
[
f: g: s: (ma: mb: false) opt a'' :s opt b''
(
if not ma [f: g]
if not mb [g: f]
f: either (index? f) <= (index? g) [f] [g]
ma: either ma or mb [[:f]] [[end skip]]
)
ma
]
]
]
A test:
oab: or-rule "a" "ab"
oba: or-rule "ab" "a"
parse "ab" oab ; == false
parse "ab" oba ; == false
parse "a" oab ; == true
parse "a" oba ; == true
Summary: PARSE rules are quite intuitive procedural description.
Exceptions:
1) The stack overflow shall be handled as an exception, because it is highly
unprobable, that PARSE will be able to produce a "reliable" result.
2) A simple thing, that "bothers" me is:
parse "aa" [["a"] (print "match")]
Every beginner reads it as follows: the rule ["a"] is matched against the
string and the match is found. It is hard to "explain" to a beginner, why
(despite of the match) the rule fails.
3) The fact, that "past tail positions mean failure" is an unintuitive
property. It is even related to some ugly crashes.
4) A Garbage Collector bug causes PARSE crashes.
5) The TO and THRU words in PARSE should accept any rules.
6) There can be a NOT PARSE word.
7) There should be a LIT PARSE word for block parsing.
8) PARSE should behave like the normal natives with respect to RETURN and
THROW.
Regards
-Ladislav
[12/42] from: joel:neely:fedex at: 4-Apr-2003 17:13
Hi, Gregg,
Gregg Irwin wrote:
> << 2) It totally breaks normal usage that "or" is symmetrical, even
> though we read the | as "or". >>
>
> It's like the ANY function, so I don't know that I'd say it "totally
> breaks" normal usage. Isn't OR often optimized to work this way; i.e.
> shortcut evaluation?
>
In REBOL (at least) OR does *not* shortcircuit, but ANY does:
>> any [do [print "1" true] do [print "2" true]]
1
== true
versus
>> (do [print "1" true]) or (do [print "2" true])
1
2
== true
However, that's beside the point if we're talking about symmetry; in
REBOL *both* OR and ANY are symmetric (over pure logic values),
in the sense that the result is independent of the order of the
arguments:
foreach a reduce [true false] [
foreach b reduce [true false] [
print [a b equal? any [a b] any [b a] equal? (a or b) (b or
a)]
]
]
yields
true true true true
true false true true
false true true true
false false true true
So it's not like ANY in that respect.
> I think the most important thing, barring actual "bugs" in PARSE's
> design is documenting how these things work so grammars can be designed
> correctly (e.g. longest match first; greedy grammars? :).
>
I totally agree with that statement.
-jn-
--
----------------------------------------------------------------------
Joel Neely joelDOTneelyATfedexDOTcom 901-263-4446
Counting lines of code is to software development as
counting bricks is to urban development.
[13/42] from: joel:neely:fedex at: 4-Apr-2003 17:28
Hi, Gabriele,
Gabriele Santilli wrote:
> JN> 1) It makes the PARSE dialect a strange hybrid between declarative
> JN> and procedural descriptions (as mentioned in my earlier post).
>
> The PARSE dialect is something that is matched to the input string
> or block from left to right. I'm not sure if this makes it an
> hybrid; probably it does, and I admit I like it this way. (Much
> better than RE, that is.)
>
My point (regarding declarative-vs-procedural) was simply that RE
notation purely describes the pattern one is looking for, and the
RE engine manages all the complexity for you, *including* the cases
where backtracking is necessary to find a complete match after a
false start. It appears to me that PARSE requires the programmer to
do more work when backtracking may be needed.
The specific example at hand required the programmer to take action
to make the first alternative fail before PARSE would consider the
second. It would be nice if failure at the end of the *entire* rule
also caused backtracking so the programmer didn't have to do any
extra work.
-jn-
--
----------------------------------------------------------------------
Joel Neely joelDOTneelyATfedexDOTcom 901-263-4446
Counting lines of code is to software development as
counting bricks is to urban development.
[14/42] from: joel:neely:fedex at: 4-Apr-2003 17:01
Hi, Ladislav,
Ladislav Mecir wrote:
> As I mentioned here, it is:
>
> z: [#a z #b | none]
>
> , which is a rule, that cannot be written using Regular Expressions.
>
OK. The old context-free-vs-context-sensitive thing. However, this
little example doesn't scale up very well AFAICT. An equivalent to
parse somestring z: [#a z #b | none]
in Perl would be something like
$somestring =~ /^(a*)(b*)$/ and length ($1) == length ($2)
While the content identification and the length equality take two
different sub-expressions, I'm wondering what happens if we make the
problem just a little larger, and try to match any of
""
"abc"
"aabbcc"
"aaabbbccc"
and so on. Using an RE to get content, and then checking lengths,
is easy to write and understand (and the pattern scales up to more
than two pieces of content), as follows:
$somestring =~ /^(a*)(b*)(c*)$/ and
length ($1) == length ($2) and
length ($2) == length ($3)
and so on for four or more "pieces".
How would you attack that with the PARSE dialect?
-jn-
--
----------------------------------------------------------------------
Joel Neely joelDOTneelyATfedexDOTcom 901-263-4446
Counting lines of code is to software development as
counting bricks is to urban development.
[15/42] from: joel:neely:fedex at: 4-Apr-2003 17:51
Hi, Ladislav,
As usual, you've provided much food for thought.
Ladislav Mecir wrote:
> Yes, PARSE dialect looks procedural. Nevertheless, all the
> "language stuff" is a hybrid between procedural and declarative
> descriptions: e.g. Regular Expressions and FSM's describe the
> same - Regular Languages.
>
> Similarly Grammars (more declarative) and Turing Machines
> (procedural) describe the same languages.
>
...
> If you really want a symmetrical OR parse rule, it can be programmed:
>
I'm quite willing to be educated, so here's another example. I'll
state the problem, then wait for suggestions on how to use PARSE
before I show the solution (using Python's RE engine). Lest I be
accused of theory, I'll point out that this is a disguised and
simplified version of a program I've recently written for Real Work.
GIVEN:
A file of lines, each of which is 80 characters, and contains:
1) a six-digit leading sequence number,
2) a 66-character body area,
3) an eight-digit trailing sequence number.
The body area contains sentences, which end in a period followed
by whitespace. A sentence may spread across the body areas of
one or more lines, but if a sentence ends on one line, the rest
of that body will be blank and the next sentence will begin
in a subsequent line.
If the body area begins with an asterisk, it is to be ignored.
Consecutive lines should have both leading and trailing sequence
numbers that are in order.
OUTPUT:
If any line has an out-of-order leading or trailing sequence
number, echo that line to output as an error.
Output whole sentences (with redundant whitespace removed) as
individual lines of output.
To illustrate (although my lines here aren't 80 bytes, to avoid
email line wrap and save typing):
00001 This is a sentence. 20030301
00002 So 20030302
00003 is 20030302
00004 this. 20030302
00005* this will disappear 20030303
00006* but this won't because of sequence order 20030101
00007 The last 20030304
00008 sentence. 20030304
should get output like this
This is a sentence.
So is this.
ERROR: 000006* but this won't because of sequenc...
The last sentence.
Thanks in advance to anyone who offers PARSE solutions!
-jn-
--
----------------------------------------------------------------------
Joel Neely joelDOTneelyATfedexDOTcom 901-263-4446
Counting lines of code is to software development as
counting bricks is to urban development.
[16/42] from: andreas:bolka:gmx at: 5-Apr-2003 4:03
Hi Joel, Ladislav,
Saturday, April 5, 2003, 1:01:10 AM, Joel wrote:
> Ladislav Mecir wrote:
>>
<<quoted lines omitted: 5>>
>> Expressions.
>>
IMHO 'parse _should_ be a context-free grammar (CFG) parser. As the '|
debate shows 'parse does not (per default) behave in a completely CFG
conform way, it (at least) misses the symmetrical OR.
However, CFGs are a strict superset of Regular Expressions (RE) -
AFAIK every possible RE pattern can be expressed as CFG.
> OK. The old context-free-vs-context-sensitive thing.
The interesting thing is, that this classic example is basically not
context sensitive
but only a symmetrical production. If symmetry is
considered as an instance of "context-sensitivity" is up to the
reader, nevertheless, those symmetrical productions are something REs
cannot express.
> I'm wondering what happens if we make the problem just a little
> larger, and try to match any of
<<quoted lines omitted: 8>>
> and so on for four or more "pieces".
> How would you attack that with the PARSE dialect?
Are you not mixing the grammar description with "external" (at least
from the grammar's point of view) programming language facilities?
This mixing is of course possible with 'parse as well:
parse-r: func [ s /local r p1 p2 p3 ] [
r: [ [copy p1 any "a"] [copy p2 any "b"] [copy p3 any "c"] ]
all [ (parse s r) (= length? p1 length? p2) (= length? p2 length? p3) ]
]
This does not account for empty sentences, but I think this is really
not the point :) The example you gave is indeed context-sensitive and
as both, REs and CFGs have no real sense of context, "external"
facilities are required to establish the required context.
--
Best regards,
Andreas mailto:[andreas--bolka--gmx--net]
[17/42] from: brett:codeconscious at: 5-Apr-2003 13:50
Hi Joel,
> I'm quite willing to be educated, so here's another example. I'll
> state the problem, then wait for suggestions on how to use PARSE
> before I show the solution (using Python's RE engine). Lest I be
> accused of theory, I'll point out that this is a disguised and
> simplified version of a program I've recently written for Real Work.
Well I had a go. I decided to just attack it using the first approach that
came to mind.
My prelinary thinking was that there is two themes here, control
(sequencing, errors) and character stream (accumulating sentences). So the
plan was (1) break into lines (2) break into fields (3) filter (4) process
the character stream. That is basically what I've carried into PARSE rules.
To accomplish this I've had to make Parse jump around a bit by changing the
input pointer a few times and by rewriting the rules as things proceed.
If this was for real work, then:
EITHER one-off? [
use it after more testing
][ rewrite it with a more robust solution]
Regards,
Brett.
REBOL [
Title: "Joel's PARSE problem."
Author: "Brett Handley"
Comments: {
1. First "blunt parse attack" approach.
2. I don't like the way I am stepping through character data.
3. More of a concern, in this approach it is difficult to see
any "value add" from PARSE over just writing a non-parse
parser.
4. I posted it because despite being ugly, I've expended the
effort, so may as well offer it as a target of critique! :)
5. Way to sensitive to a change in character positions!
6. Not tested thoroughly - only on the data in the code.
}
]
data: decompress #{
789CA5D25D0AC3200C00E0ABE46DD087E24FF7738FED02B6CD9820DA55A5D75F
EA2C8C958DD9459F8C7C2644C628385C6EDA036D051E6D40DB610D45211893B4
F8CC31910FCFAE0C597B2279F2254145FEED356FC940DD6FEBF7E9EDAB44C0A4
8D815E7B350CA8C62D9E4CDEA18236866C3ABB0BD062A7A24770571AD03DCE03
0237F6F8F191E4F13C8F23CD17C1281FCA9A5CD7D724EFF4FDEE2F5F68F11EAA
35717B80020000
}
; -------------------------------------------------
; PARSE RULES
; -------------------------------------------------
r_digit: charset {0123456789}
r_char: complement charset { ^-}
r_seq1: [
copy v_lineseq1 6 r_digit
(v_lineseq1: to integer! v_lineseq1)
]
r_seq2: [
copy v_lineseq2 8 r_digit
(v_lineseq2: to integer! v_lineseq2)
]
r_body: [
:v_linebody
66 [
#"." (end-of-sentence) |
copy v_char r_char (character v_char) |
skip (whitespace)
]
8 skip
]
r_ignore: [none]
r_error: [
(emit join "ERROR: " copy/part v_linestart v_lineend)
none
]
; Splits each line into fields, checks sequencing.
r_line: [
v_linestart:
r_seq1 v_linebody: 66 skip r_seq2 v_lineend:
(r_filterdynamic: dynamic-filter-rule?)
r_filterdynamic
]
; -------------------------------------------------
; Functions call during parse process
; -------------------------------------------------
dynamic-filter-rule?: does [
either all [
any [none? seq1 v_lineseq1 >= seq1]
any [none? seq2 v_lineseq2 >= seq2]
] [
seq1: v_lineseq1
seq2: v_lineseq2
either v_linebody/1 = #"*" [[none]] [r_body]
] [r_error]
]
character: func [ch] [insert tail out ch]
whitespace: does [
if all [not empty? out #" " <> last out] [
insert tail out #" "
]
]
end-of-sentence: does [
character #"."
emit out out: copy {}
]
emit: :print
; -------------------------------------------------
; HERE WE GO
; -------------------------------------------------
;
; Initialise
;
v_lineseq1: v_lineseq2:
v_linebody: v_linestart: v_lineend:
seq1: seq2: r_filterdynamic: none
out: copy {}
either parse/all data [any r_line] [
if not empty? out [print "ERROR last sentence incomplete."]
] [print "ERROR The file does not conform to the expected format."]
HALT
[18/42] from: lmecir:mbox:vol:cz at: 5-Apr-2003 8:59
Hi,
> in Perl would be something like
>
> $somestring =~ /^(a*)(b*)$/ and length ($1) == length ($2)
, which is Perl, not RE...
> While the content identification and the length equality take two
> different sub-expressions, I'm wondering what happens if we make the
<<quoted lines omitted: 10>>
> length ($2) == length ($3)
> and so on for four or more "pieces".
Again, Perl, not RE. Parse dialect:
x: [start: any #"a" end: (n: offset? start end) n #"b" n #"c"]
Regards
-Ladislav
[19/42] from: brett:codeconscious at: 5-Apr-2003 17:19
I noted in my last email that I didn't think I was getting much value add
from PARSE in that solution so I thought I would accept that idea and see
what a less PARSE reliant solution might look like. I know that this is
tangential to the request but it could be useful to the discussion so here
it is.
I'm interested to see if there is an elegant PARSE solution for this problem
but I'm not sure. In solving the problem, it might have been useful to
invoke PARSE from inside a PARSE rule but I cannot remember if this is safe
or not.
REBOL [
Title: "Joel's PARSE problem."
Author: "Brett Handley"
Comments: {}
]
data: decompress #{
789CA5D25D0AC3200C00E0ABE46DD087E24FF7738FED02B6CD9820DA55A5D75F
EA2C8C958DD9459F8C7C2644C628385C6EDA036D051E6D40DB610D45211893B4
F8CC31910FCFAE0C597B2279F2254145FEED356FC940DD6FEBF7E9EDAB44C0A4
8D815E7B350CA8C62D9E4CDEA18236866C3ABB0BD062A7A24770571AD03DCE03
0237F6F8F191E4F13C8F23CD17C1281FCA9A5CD7D724EFF4FDEE2F5F68F11EAA
35717B80020000
}
emit: :print
parse-file: function [data] [
digit out
seq1 seq2 v_lineseq1 v_lineseq2 v_linebody
] [
digit: charset {0123456789}
out: copy []
forskip data 80 [
if not parse/all data [
copy v_lineseq1 6 digit
copy v_linebody 66 skip (trim v_linebody)
copy v_lineseq2 8 digit
to end
] [make error! "Bad file format - sequence numbers must be digits."]
either all [
any [none? seq1 v_lineseq1 >= seq1]
any [none? seq2 v_lineseq2 >= seq2]
] [
seq1: v_lineseq1 seq2: v_lineseq2
if data/7 <> #"*" [
insert tail out v_linebody
if #"." = last v_linebody [
emit reform out
clear out
]
]
] [emit join "ERROR: " copy/part data 80]
]
if not empty? out [emit "ERROR: Last sentence is incomplete."]
]
; Wrap the invocation in some error handling.
if error? set/any 'result try [parse-file data] [
error: disarm result
if block? msg: get in get in system/error error/type error/id [
msg: bind/copy msg in error 'self
]
reform msg
emit join "ERROR: " reform msg
]
HALT
[20/42] from: g:santilli:tiscalinet:it at: 5-Apr-2003 10:39
Hi Joel,
On Saturday, April 5, 2003, 1:01:10 AM, you wrote:
JN> How would you attack that with the PARSE dialect?
You can use the same technique you use with REs, except that you
can avoid useless copying.
>> parse "abc" [start: any #a end: (n: offset? start end) n #b n #c]
== true
>> parse "aabbcc" [start: any #a end: (n: offset? start end) n #b n #c]
== true
>> parse "aaabbbccc" [start: any #a end: (n: offset? start end) n #b n #c]
== true
>> parse "aaabbccc" [start: any #a end: (n: offset? start end) n #b n #c]
== false
Regards,
Gabriele.
--
Gabriele Santilli <[g--santilli--tiscalinet--it]> -- REBOL Programmer
Amigan -- AGI L'Aquila -- REB: http://web.tiscali.it/rebol/index.r
[21/42] from: g:santilli:tiscalinet:it at: 5-Apr-2003 11:13
Hi Joel,
On Saturday, April 5, 2003, 1:51:32 AM, you wrote:
JN> Thanks in advance to anyone who offers PARSE solutions!
line-rule: [copy leading 6 digit copy body 66 skip copy trailing 8 digit]
digit: charset "0123456789"
last-leading: last-trailing: 0
buffer: ""
foreach line read/lines %test.txt [
if parse/all line line-rule [
leading: to integer! leading
trailing: to integer! trailing
either any [leading <= last-leading trailing < last-trailing] [
print ["ERROR:" line]
] [
last-leading: leading last-trailing: trailing
if #"*" <> pick trim body 1 [
insert insert tail buffer " " body
]
if all [not empty? buffer #"." = last buffer] [
print trim buffer clear buffer
]
]
]
]
Which gives:
This is a sentence.
So is this.
ERROR: 000006* but this won't because of sequence order 20030101
The last sentence.
Regards,
Gabriele.
--
Gabriele Santilli <[g--santilli--tiscalinet--it]> -- REBOL Programmer
Amigan -- AGI L'Aquila -- REB: http://web.tiscali.it/rebol/index.r
[22/42] from: g:santilli:tiscalinet:it at: 5-Apr-2003 10:48
Hi Joel,
On Saturday, April 5, 2003, 1:28:26 AM, you wrote:
JN> The specific example at hand required the programmer to take action
JN> to make the first alternative fail before PARSE would consider the
JN> second. It would be nice if failure at the end of the *entire* rule
JN> also caused backtracking so the programmer didn't have to do any
JN> extra work.
That would mean that we can no more use subblocks for parsing.
I.e., if PARSE worked as you suggest, how would one write this?
rule1: ["a" | "b"]
rule2: ["c" rule1 "d"]
On "cad", based on your suggestion, rule1 would not reach the end
of the string and thus backtrack.
Notice that you can't backtrack the whole rule because it can be
composed of a complex network of subrules, each one having '|
alternatives. Which one should backtrack? What happens to the code
in parens?
With REs you can do this, mainly because they are just patterns
(and not grammars with possibly embedded code).
Regards,
Gabriele.
--
Gabriele Santilli <[g--santilli--tiscalinet--it]> -- REBOL Programmer
Amigan -- AGI L'Aquila -- REB: http://web.tiscali.it/rebol/index.r
[23/42] from: joel:neely:fedex at: 5-Apr-2003 7:46
Hi, Ladislav,
Thanks for the enlightening discussion!
Ladislav Mecir wrote:
> > in Perl would be something like
> >
> > $somestring =~ /^(a*)(b*)$/ and length ($1) == length ($2)
>
> , which is Perl, not RE...
>
Well, since there is no "pure RE" language, I have to use RE as
embedded in *some* programming language. I could have just as well
used Python, Ruby, Java, awk, etc... (The thought of using c for
this is just too painful to contemplate! ;-)
> >
> > $somestring =~ /^(a*)(b*)(c*)$/ and
<<quoted lines omitted: 3>>
> > and so on for four or more "pieces".
> Again, Perl, not RE. Parse dialect:
Again, I had to use some programming language if I wanted to show
a fragment of a program. The point was that most languages that use
RE (with which I'm familiar) make it easy to specify a declarative
pattern, then ask additional questions about the data if that pattern
is successfully matched (in this case, comparing the lengths of the
runs of "a", "b", and "c").
> x: [start: any #"a" end: (n: offset? start end) n #"b" n #"c"]
>
Very nice! Thanks for the solution!
Now let me scale differently: suppose I want to match consecutive,
equal-length runs of those three letters anywhere within the target
string? For example, all of the targets
"my dog has aaabbbccc fleas"
"aaadddeeeabc"
"abcccccccc"
"aabbaaaaabbcccc"
meet that criterion.
The previous solution transforms easily, as follows:
1) allow matching anywhere in the target -- this is implemented by
removing the BOS/EOS anchors (^ and $) from the pattern;
2) require at least one of each character (since zero of each is
an empty string that can be found anywhere an any target!) --
this is implemented by changing the * qualifier ("any") to
+ ("some") on all subpatterns;
3) recognize that extra "a"s at the beginning and/or extra "c"s at
the end don't disqualify the group -- this is implemented by
requiring only that there are at least as many "a" as "b" and
at least as many "c" as "b".
I'm "thinking out loud" here to show the thought process involve
in moving from one problem/solution to the next. The changes above
give us:
$somestring =~ /(a+)(b+)(c+)/ and
length ($1) >= length ($2) and
length ($2) <= length ($3)
So, let me ask here, how would you go about solving this next
variation on the theme? Would you transform the definition of X
above, or would you address it as a fresh problem with a different
strategy for solving?
-jn-
--
Polonius: ... What do you read, my lord?
Hamlet: Words, words, words.
_Hamlet_, Act II, Scene 2
[24/42] from: joel:neely:fedex at: 5-Apr-2003 8:44
Hi, Gabriele,
Gabriele Santilli wrote:
> You can use the same technique you use with REs, except that you
> can avoid useless copying.
<<quoted lines omitted: 6>>
> >> parse "aaabbccc" [start: any #a end: (n: offset? start end) n #b n #c]
> == false
I'm not clear on which part you're referring to as "useless copying".
Can you help me understand what you meant?
How would you solve the variation I posed in my reply to Ladislav?
(allow the run of equal-length a, b, c segments to occur anywhere
inside the target string?)
-jn-
--
Polonius: ... What do you read, my lord?
Hamlet: Words, words, words.
_Hamlet_, Act II, Scene 2
[25/42] from: joel:neely:fedex at: 5-Apr-2003 8:40
Hi, Gabriele,
Before we get too far down this road (and have to backtrack ;-), let
me state clearly that I'm not agitating for PARSE to be redesigned.
I am simply suggesting that each paradigm has its own "sweet spot".
I've always found that laying alternative strategies down side-by-
side and figuring out exactly where/why/how they differ has helped
me use both more effectively. That's my intent in this conversation.
Now let me try to reply to your questions/objections.
Gabriele Santilli wrote:
> JN> The specific example at hand required the programmer to take
> JN> action to make the first alternative fail before PARSE would
<<quoted lines omitted: 7>>
> On "cad", based on your suggestion, rule1 would not reach the end
> of the string and thus backtrack.
By "*entire* rule" I meant the top-level one invoked by PARSE itself.
In other words, when failure occurs, try a different alternative.
Under this view, saying
parse "cad" rule2
would be equivalent to saying
Find a path from start to finish through
(start)--- "c" ---+--- "a" ---+--- "d" ---(finish)
\ /
'- "b" -'
that spells "cad".
This doesn't require any backtracking over RULE1, since failure only
occurs when there is no viable alternative for the present position
(and REBOL defines being at the end of the rule with more input
remaining to be matched as not viable).
Another way of saying this is that your RULE1 and RULE2 could be
regarded (under a pure, declarative, backtrackable model) as being
a little grammer that matches/generates the set of strings that
are constructed by following all possible paths from start to
finish:
"cad"
"cbd"
in this case.
> Notice that you can't backtrack the whole rule because it can be
> composed of a complex network of subrules, each one having '|
> alternatives. Which one should backtrack?
>
The standard answer is "the most recent". For example, nested rules
such as
["a" ["b" | "bc"] ["d" | "de"] "f"]
that contain more than one decision point can still be "explored" in
a systematic way. At the risk of the illustration being more bulky
than I'd like, we can trace the attempt to match a string to the
above rule using underlining to show which portions of the string and
rule have been matched:
String Rule Comment
"abcdef" ["a" ["b" | "bc"] ["d" | "de"] "f"] just beginning
"abcdef" ["a" ["b" | "bc"] ["d" | "de"] "f"] match, no choice
- -
"abcdef" ["a" ["b" | "bc"] ["d" | "de"] "f"] match 1st choice
-- - - in 1st group
"abcdef" ["a" ["b" | "bc"] ["d" | "de"] "f"] fail 1st choice
--* - - * in 2nd group
"abcdef" ["a" ["b" | "bc"] ["d" | "de"] "f"] fail 2nd choice
--* - - * in 2nd group
...at which point there are no alternatives left, so back up and
reconsider the previous decision...
"abcdef" ["a" ["b" | "bc"] ["d" | "de"] "f"] match 2nd choice
--- - -- in 1st group
"abcdef" ["a" ["b" | "bc"] ["d" | "de"] "f"] match 1st choice
---- - -- - in 2nd group
"abcdef" ["a" ["b" | "bc"] ["d" | "de"] "f"] fail next pattern
----* - -- - * after 2nd group
"abcdef" ["a" ["b" | "bc"] ["d" | "de"] "f"] match 2nd choice
----- - -- -- in 2nd group
"abcdef" ["a" ["b" | "bc"] ["d" | "de"] "f"] match last part
------ - -- -- - SUCCESS!
I hesitate to draw the picture of that rule (for fear of adding more
bulk) but feel it may be helpful to clarify:
+- "b" --+ +- "d" --+
/ \ / \
(start)--- "a" -+ +--+ +- "e" ---(finish)
\ / \ /
+- "bc" -+ +- "de" -+
so the tracing laid out (in painful detail ;-) above is just exploring
systematically (left-to-right, top-to-bottom) through the above map
searching for a road from start to finish. Hitting a dead-end just
causes the last decision to be reconsidered.
> What happens to the code in parens?
>
That's my point. If we mix in the possibility of side-effects that
may occur while moving forward, then backtracking is *certainly*
more complicated to automate. One could either:
1) save the state of anything modified by a procedural action, and
restore that state upon backtracking across the action, or
2) explore the declarative pattern looking for a successful path,
and then take/commit all included actions at the end once a
successful path is located.
For those of us who use databases, this amounts to the same kind of
commit/rollback pattern that is used with a database transaction.
All I'm trying to point out is that in a procedural model, the
programmer has to do more work to implement backtracking, and it is
harder to do that if other procedural effects are intertwined within
the pure pattern matching.
> With REs you can do this, mainly because they are just patterns
> (and not grammars with possibly embedded code).
>
Agreed. Identification is done first, and then any actions are taken
after a match has been achieved.
Thanks for the interesting discussion!
-jn-
--
Polonius: ... What do you read, my lord?
Hamlet: Words, words, words.
_Hamlet_, Act II, Scene 2
[26/42] from: joel:neely:fedex at: 5-Apr-2003 9:00
Hi, again, Gabriele,
I've just tripped over a tidy example of a task that's trivial to
solve if backtracking is available.
Find a repeated group of characters at the end of a string (one or more
characters, repeated two or more times consecutively). For example,
"my doggyoggy"
11112222
"has as as as as "
111222333444555
"fleasssssss"
1234567
Note that some cases can be solved in multiple ways:
"fleassssss"
111222
"fleassssss"
112233
"fleassssss"
123456
in which case any solution is OK.
The Perl/Python/etc-style regular expression which tests for this
situation and provides the first occurrence of the repeated group
of characters is
/(.+)\1+$/
which, for those not familar with RE "line noise" ;-) can be read
(.+) a group (the parens) of one or more (the plus)
arbitrary characters (the dot)
\1 ... followed by the same group previously matched
+ ... one or more times
$ ... at the end of the string
I'm really interested in how a PARSE expert would approach the same
task. Any takers?
-jn-
--
Polonius: ... What do you read, my lord?
Hamlet: Words, words, words.
_Hamlet_, Act II, Scene 2
[27/42] from: g:santilli:tiscalinet:it at: 5-Apr-2003 21:27
Hi Joel,
On Saturday, April 5, 2003, 4:40:01 PM, you wrote:
JN> Before we get too far down this road (and have to backtrack ;-), let
JN> me state clearly that I'm not agitating for PARSE to be redesigned.
JN> I am simply suggesting that each paradigm has its own "sweet spot".
I know that, however, we should not ignore the underlying
complexity of the tools we use. Sometimes is a blessing to have
some nice abstracted layer over everything, sometimes it is a
disaster. Most of the times, things can be automated efficiently;
but there are cases where you expect the computer to do the right
thing, but it doesn't. (As an example, consider SQL; I like it and
use it a lot, but today a simple DELETE query on MySQL took three
hours when I was expecting it to take twenty minutes at most. I
had to kill the process because I could not wait more. Evidently
MySQL had no clue how to optimize the operation in that case, but
I could have done better if I wasn't expecting it to behave
correctly on a simple (for my eyes!) DELETE FROM table WHERE col 'value'.)
JN> By "*entire* rule" I meant the top-level one invoked by PARSE itself.
I know, but that requires PARSE to be much more complex that it is
now. And I'm not sure that it would be so much useful to be worth
it, especially considering that I have yet to find a case in real
work were PARSE does not work well.
JN> In other words, when failure occurs, try a different alternative.
You have to know all the alternatives! (As you show below.) But
PARSE doesn't. It just knows the current position in the string or
block and the current rule being matched.
JN> All I'm trying to point out is that in a procedural model, the
JN> programmer has to do more work to implement backtracking,
You just need to place the longest-matching rules first. It's much
less work that it seems, and to me it is almost natural now.
JN> Thanks for the interesting discussion!
All this said, I'd like to say that I'd welcome a pattern matching
parser with backtracking and so on, just because I'm curious to
see how much effort it is to write one. My bet is "too much", but
of course the rest of the world seems to have a different POV. ;-)
Wanna write one?
Regards,
Gabriele.
--
Gabriele Santilli <[g--santilli--tiscalinet--it]> -- REBOL Programmer
Amigan -- AGI L'Aquila -- REB: http://web.tiscali.it/rebol/index.r
[28/42] from: g:santilli:tiscalinet:it at: 5-Apr-2003 21:48
Hi Joel,
On Saturday, April 5, 2003, 4:44:11 PM, you wrote:
JN> I'm not clear on which part you're referring to as "useless copying".
When you say (some expression) and then refer to the matched
substring with $1, it means that the interpreter has created a new
variable and copied a part of the input string in it. With the
approach I (and Ladislav) used you do not need to allocate any
extra space or copy parts of the input string. (I.e. you are sure
that the are no copies, independently on the "sophisticatedness"
of the implementation.)
JN> How would you solve the variation I posed in my reply to Ladislav?
JN> (allow the run of equal-length a, b, c segments to occur anywhere
JN> inside the target string?)
Using an approach similar to the one you used for the RE, i.e.
(using copies this time, for the sake of simplicity):
any [
here:
copy As some #"a"
copy Bs some #"b"
copy Cs some #"c"
(if all [
greater-or-equal? length? As length? Bs
lesser-or-equal? length? Bs length? Cs
] [print ["Found match at position" index? here]])
| skip
]
It is admittedly quite slow, but I doubt that the RE would be
faster.
>> parse/all "my dog has aaabbbccc fleas" rule
Found match at position 12
== true
>> parse/all "aaadddeeeabc" rule
Found match at position 10
== true
>> parse/all "abcccccccc" rule
Found match at position 1
== true
>> parse/all "aabbaaaaabbcccc" rule
Found match at position 5
== true
(The last result is arguably wrong, however I think it is on par
with the one you get with your RE.)
Regards,
Gabriele.
--
Gabriele Santilli <[g--santilli--tiscalinet--it]> -- REBOL Programmer
Amigan -- AGI L'Aquila -- REB: http://web.tiscali.it/rebol/index.r
[29/42] from: g:santilli:tiscalinet:it at: 5-Apr-2003 22:00
Hi Joel,
On Saturday, April 5, 2003, 5:00:08 PM, you wrote:
JN> /(.+)\1+$/
I wonder what is the RE parser doing under the hood. What is the
complexity of such a innocent-looking RE?
JN> I'm really interested in how a PARSE expert would approach the same
JN> task. Any takers?
Well, I wouldn't use PARSE probably. I would do it this way: for
all the divisors of the length of the input string, check to see
if the string is made of a repetition of a substring that has the
length of the divisor. If no solution is found, apply the same
test starting at the next position of the string, recursively.
Regards,
Gabriele.
--
Gabriele Santilli <[g--santilli--tiscalinet--it]> -- REBOL Programmer
Amigan -- AGI L'Aquila -- REB: http://web.tiscali.it/rebol/index.r
[30/42] from: lmecir:mbox:vol:cz at: 5-Apr-2003 23:56
Sorry, In my previous post I forgot to change the definition of X to:
x: [start: some #"a" end: (n: offset? start end) n #"b" n #"c"]
sorry for any inconveniences
Regards
-Ladislav
[31/42] from: lmecir:mbox:vol:cz at: 6-Apr-2003 0:08
Hi Joel,
> I've just tripped over a tidy example of a task that's trivial to
> solve if backtracking is available.
<<quoted lines omitted: 27>>
> task. Any takers?
> -jn-
like this:
x1: [end: (x2: copy/part start end) x2]
x3: to-rule x1
x4: [start: skip x3 to end]
parse head reverse "my doggyoggy" x4
Regards
-Ladislav
[32/42] from: lmecir:mbox:vol:cz at: 5-Apr-2003 23:50
Hi Joel,
> Thanks for the enlightening discussion!
I enjoy it too, although I am a little bit busy.
> > > in Perl would be something like
> > >
<<quoted lines omitted: 6>>
> used Python, Ruby, Java, awk, etc... (The thought of using c for
> this is just too painful to contemplate! ;-)
I just wanted to underline the fact (not obvious from your explanations),
that RE provably cannot do such things. Only programming languages not using
the proper definition of RE can.
> Now let me scale differently: suppose I want to match consecutive,
> equal-length runs of those three letters anywhere within the target
<<quoted lines omitted: 4>>
> "aabbaaaaabbcccc"
> meet that criterion.
I would use my TO-RULE function and the previous rule X as follows:
x: [start: any #"a" end: (n: offset? start end) n #"b" n #"c"]
z: to-rule x
parse "my dog has aaabbbccc fleas" [z to end]
The function:
fail: [end skip]
to-rule: function [
{generate a to A parse rule}
a [block!]
] [c f] [
compose/deep [
any [
[
(reduce [a]) ([(c: fail f: none) | (c: none f: fail) skip])
] c
] f
]
]
> The previous solution transforms easily, as follows:
> 1) allow matching anywhere in the target -- this is implemented by
<<quoted lines omitted: 18>>
> strategy for solving?
> -jn-
Let me ask you a question. What would be the result of your expression for:
"aabbbcccabc"
Regards
-Ladislav
[33/42] from: lmecir:mbox:vol:cz at: 6-Apr-2003 0:23
> > The Perl/Python/etc-style regular expression which tests for this
> > situation and provides the first occurrence of the repeated group
> > of characters is
> >
> > /(.+)\1+$/
The only trouble with this "Regular expression" is, that it isn't a Regular
Expression. I don't want to forbid its usage, but the name is incorrect and
misleading.
Regards
-Ladislav
[34/42] from: rotenca:telvia:it at: 6-Apr-2003 2:58
Hi all,
> JN> How would you solve the variation I posed in my reply to Ladislav?
> JN> (allow the run of equal-length a, b, c segments to occur anywhere
<<quoted lines omitted: 12>>
> | skip
> ]
I think that it can be done also with a variation of the previous rule, if I
understand well the target:
parse/all str [
some [
s: some #"a" e: (n: offset? s e)
n #"b" n #"c" (print index? s)
to end
| skip
]
]
---
Ciao
Romano
[35/42] from: greggirwin:mindspring at: 5-Apr-2003 18:15
Hi Joel et al,
Sorry I don't have time to think deeply (or dink theeply) about this
right now, but it's great topic.
<< Well, since there is no "pure RE" language, I have to use RE as
embedded in *some* programming language. I could have just as well
used Python, Ruby, Java, awk, etc... >>
Are you sure about AWK? I didn't think it supported backreferences,
and a quick check in the Friedl RE book seems to bear that out (though
a couple modern AWK's might).
OK so taking AWK out isn't that important, or why I'm writing. What
seems important to me is that - as you pointed out - RE needs a host
language and there are many flavors of RE out there. The biggest
differences will be between the NFA and DFA engines (DFA engines not
supporting backtracking AFAIK).
Now, I'm going to be in way over my head if I continue down this path,
but it strikes me that we can probably look at the differences in that
context rather than on a case-by-case basis with examples. I.e. come
up with a general solution, or model, about how to design PARSE rules
for these kinds of grammars/situations. This thread has already provided
some good examples.
For those with the time and desire, the dragon book has a section on
how to construct a DFA from an NFA which might provide some pointers.
-- Gregg
[36/42] from: rotenca:telvia:it at: 6-Apr-2003 3:06
Hi Joel,
> The Perl/Python/etc-style regular expression which tests for this
> situation and provides the first occurrence of the repeated group
> of characters is
>
> /(.+)\1+$/
This is a solution with an error! jump out parse statement, can be done also
with another stop rule, it it seems too much tricky (BTW, reversing the string
before parsing it, leads to a more direct solution):
stop-true: [end skip]
if error? set/any 'res try [
parse/all "my doggyoggy" [
some [
s: (len: (length? s) / 2 n: 0 stop: none)
some [
(if len < n: n + 1 [stop: stop-true])
stop
:s opt [copy x n skip some x end (make error! s)]
]
skip
]
]
][
res: disarm res print res/arg1
]
---
Ciao
Romano
[37/42] from: ingo:h-o-h at: 7-Apr-2003 15:03
Hi Joel,
Joel Neely wrote:
<...>
> GIVEN:
>
> A file of lines, each of which is 80 characters, and contains:
<...>
> Thanks in advance to anyone who offers PARSE solutions!
<...>
This is one possible solution,
Kind regards,
Ingo
[REBOL [
Title: {Rebol solution to:
Re: Parse versus Regular Expressions (Joel Nealy)}
Author: "Ingo Hohmann"
]
data: {00001 This is a sentence. 20030301
00002 So 20030302
00003 is 20030302
00004 this. 20030302
00004 Error in this line. 20030302
00005* this will disappear 20030303
00006* but this won't because of sequence order 20030101
00007 The last 20030304
00008 sentence. 20030304
00009*And this will show. 20030301}
char: complement charset ""
non-point: complement charset "."
old-pre: old-post: "" ; I need no copy here
sentence: ""
parse/all data [
any [
copy pre 5 char
copy line [48 non-point | thru "."]
copy trailing any " "
copy post 8 char
opt skip ; In case there's no newline in the last line
(
either all [pre > old-pre post >= old-post][
if not #"*" = first line [
insert tail sentence line
if #"." = last sentence [
print trim/lines sentence
clear sentence
]
]
][
print rejoin ["ERROR: " pre line
either none? trailing [""][trailing] post
]
]
old-pre: pre
old-post: post
)
]
]
]
[38/42] from: tomc:darkwing:uoregon at: 7-Apr-2003 23:47
Hi Joel
char: charset {abcdefghijklmnopwrstuvwxyz}
hit: [mark: copy ch char :mark]
context-sensitive: [
hit
copy str some ch (len: length? str)
any[hit copy str len ch]
]
parse string contect-sensitive
that should scale to any number of "pieces" of any size
(within machine limits ofcourse)
On Fri, 4 Apr 2003, Joel Neely wrote:
[39/42] from: tomc:darkwing:uoregon at: 8-Apr-2003 1:10
not as bad it it looks at first
digit: charset {0123456789}
digits: [some digit]
cur-seq: copy/part find buf "^/" -8
sentence: copy ""
parse buf [
some[ (flag: false)
digits opt["*" (flag: true)] copy string to newline newline
( seq: copy/part tail string -8
string: trim head clear find/last string seq
either seq = cur-seq
[either flag [sentence: copy ""][append sentence join " " [string]]]
[either seq > cur-seq
[print sentence sentence: string cur-seq: seq]
[print rejoin["ERROR " seq if flag ["* "] string] sentence: copy ""]
]
)
](print sentence)
]
On Fri, 4 Apr 2003, Joel Neely wrote:
[40/42] from: lmecir:mbox:vol:cz at: 9-Apr-2003 17:12
Hi all,
Joel wrote:
> PARSE seems to partake of the same issue as Prolog; its seems to be
> implemented in a way that makes it a hybrid between declarative and
> procedural programming, with the result that sometimes things have
> side effects or behaviors that can be difficult to figure out.
I have read some grammar articles and came to an opinion, that PARSE the most closely
resembles lambda calculus, which I would call *purely procedural*. It looks so natural,
that it confused not just Joel, I think.
> I've just tripped over a tidy example of a task that's trivial to
> solve if backtracking is available.
<<quoted lines omitted: 26>>
> I'm really interested in how a PARSE expert would approach the same
> task. Any takers?
To not start any flame wars, I would call the above expression an Enhanced
Regular Expression, because as we will see below, it isn't a Regular Expression at all.
We have seen the PARSE solution. I used my freshly acquired "knowledge" of grammar stuff
and tried to analyse the grammar. Here are the rules I came to(supposing, that the base
alphabet is just "ab", to simplify the expressions, S is the starting symbol):
First, some Regular Rules:
S->aS
S->bS
S->T
T->aa
T->bb
Context Free Rules:
T->aVA
T->bVB
V->Ca
V->Db
V->aVa
V->bVb
Context Preserving Rules:
Ca->CE
Cb->CF
Da->DE
Db->DF
And, at last, Context Changing Rules:
Ea->aE
Eb->bE
Fa->aF
Fb->bF
EA->Aa
EB->Ba
FA->Ab
FB->Bb
CA->aa
CB->ba
DA->ab
DB->bb
Uff! This points to the fact, that the discussed grammar is neither Regular, nor Context
Free, and not even Context Preserving and it can hardly be called "simple".
As opposed to this, the earlier PARSE rule example:
z: [#"a" z #"b" | none]
can be expressed as:
Regular Rule:
S->""
Context Free Rule:
S->aSb
i.e. it is a Context Free Grammar.
Regards
-Ladislav
[41/42] from: lmecir:mbox:vol:cz at: 9-Apr-2003 19:00
Hi all,
Joel wrote:
> 1) It makes the PARSE dialect a strange hybrid between declarative
> and procedural descriptions (as mentioned in my earlier post).
Inspired by this POV I decided to read some articles and found out, that PARSE probably
isn't a hybrid at all, it is a purely procedural thing based on the lambda calculus.
> 2) It totally breaks normal usage that "or" is symmetrical, even
> though we read the | as "or".
...
> Notice, however, the differences in behavior of the following
> attempts at parse rules.
<<quoted lines omitted: 8>>
> b) a rationale based on emphasizing some other detail than
> the obvious "what would most people expect" issue.
My reading revealed, that the "and" and "or" names are inappropriate and that in the
lambda calculus it is usual to call the operators "andalso" and "orelse".
Regards
-Ladislav
[42/42] from: lmecir:mbox:vol:cz at: 12-Apr-2003 10:51
Hi all, I uploaded a new article to RiT
http://www.compkarori.com/cgi-local/vanilla.r?selector=display&snip=PARSE-Ve
rsus-Regexs summarizing my findings. Feel free
to edit/improve it and correct my errors.
-Ladislav
Notes
- Quoted lines have been omitted from some messages.
View the message alone to see the lines that have been omitted