Parse limitation ?
[1/16] from: patrick::philipot::laposte::net at: 8-Oct-2003 12:10
Hi List,
I'd like to parse a string searching for two things at the same time.
it seems to me that this is impossible.
For example, a text from which I want to extract the HREF and the SRC target.
myText: {<A HREF="#section1"><IMG SRC="foobar.gif"><A HREF="#section1">}
parse myText [
any [ thru "HREF=" copy target to ">" (print target) |
thru "SRC=" copy target to ">" (print target)
] ; any
] ; parse
#section1
#section1
parse myText [
any [ thru "SRC=" copy target to ">" (print target) |
thru "HREF=" copy target to ">" (print target)
] ; any
] ; parse
foobar.gif
#section1
The result is different depending which rule comes first. The only way I see as a workaround
is to parse the text twice. Is there a better (smarter) way?
Regards
Patrick
[2/16] from: ingo:2b1 at: 8-Oct-2003 12:50
Hi Patrick,
patrick ā la poste wrote:
> Hi List,
>
> I'd like to parse a string searching for two things at the same time.
> it seems to me that this is impossible.
One trick is, to find something that is equal between the two strings, and
work from there ...
REBOL []
myText: {<A HREF="#section1"><IMG SRC="foobar.gif"><A HREF="#section2">}
parse/all myText [
any [
to "=" here: (there: at here -4) :there [
[ "HREF=" | " SRC=" ]
copy target to ">" (print target) |
thru "="
]
]
] ; parse
In this example I used the "=" which is common to both strings, checked
whether what I have _before_ this sign is one of the two strings I'm
interested in, and then start to copy, or just go thru the "=" to start
again ...
I hope that helps,
Ingo
[3/16] from: petr:krenzelok:trz:cz at: 8-Oct-2003 14:06
patrick ā la poste wrote:
>Hi List,
>I'd like to parse a string searching for two things at the same time.
<<quoted lines omitted: 16>>
>"#section1"
>The result is different depending which rule comes first. The only way I see as a workaround
is to parse the text twice. Is there a better (smarter) way?
I would just like to point out, that 'first directive or tu/thru [a | b
| c] was proposed for parse enahncement some time ago, but then some
parse gurus (e.g. Gabriele) admitted, that parse would have to work
other way internally and that it is not easy achievable (am I right,
Gabriele?)
OTOH - your example is just one of those which we often enough meet in
real life, but have no easy/elegant solution for, at least not for
novice being able to solve it ....
-pekr-
[4/16] from: lmecir:mbox:vol:cz at: 8-Oct-2003 14:23
Hi Pat,
----- Original Message -----
From: "patrick ā la poste"
> Hi List,
> I'd like to parse a string searching for two things at the same time.
<<quoted lines omitted: 18>>
> Regards
> Patrick
This is possible with PARSE. You can use my parse enhancements e.g. Have a look at: http://www.fm.vslib.cz/~ladislav/rebol/parseen.r
Ladislav
[5/16] from: g:santilli:tiscalinet:it at: 8-Oct-2003 16:03
Hi Petr,
On Wednesday, October 8, 2003, 2:06:58 PM, you wrote:
PK> I would just like to point out, that 'first directive or tu/thru [a | b
PK> | c] was proposed for parse enahncement some time ago, but then some
PK> parse gurus (e.g. Gabriele) admitted, that parse would have to work
PK> other way internally and that it is not easy achievable (am I right,
PK> Gabriele?)
The point is, that internally PARSE would be forced to do the
equivalent of:
[any [a | b | c | skip]]
so even if it could be a bit faster than the above I don't think
it would be of great help. More readable, maybe... so it's
something I could add to compile-rules, if I get some time to work
on it.
In this particular case, I wouldn't use this construct at all,
since it's much better to have a more complete grammar (that can
make distinction between href= in a tag and outside of a tag
etc.), IMHO.
Regards,
Gabriele.
--
Gabriele Santilli <[g--santilli--tiscalinet--it]> -- REBOL Programmer
Amiga Group Italia sez. L'Aquila --- SOON: http://www.rebol.it/
[6/16] from: petr:krenzelok:trz:cz at: 8-Oct-2003 16:31
Petr Krenzelok wrote:
>patrick ā la poste wrote:
>>Hi List,
<<quoted lines omitted: 38>>
>real life, but have no easy/elegant solution for, at least not for
>novice being able to solve it ....
Well, I just played a bit and following hack appeared in my notepad :-)
reposition: func [str blk /local res tmp][
res: copy []
foreach item blk [
if not none? tmp: find str item [append res reduce [index? tmp item]]
]
sort/skip res 2
either empty? res [str][at str (first res) - (index? str) + 1]
]
myText: {
<A HREF="#section1"><IMG SRC="foobar.gif"><A HREF="#section2">
<A HREF="#section1"><IMG SRC="foobar.gif"><A HREF="#section2">
<A HREF="#section1"><IMG SRC="foobar.gif"><A HREF="#section2">
<A HREF="#section1"><IMG SRC="foobar.gif"><A HREF="#section2">
}
src-rule: ["SRC=" copy target to ">" (print target)]
href-rule: ["HREF=" copy target to ">" (print target)]
parse/all mytext [
any [
mark: (mark: reposition mark ["HREF=" "SRC="]) :mark
[src-rule | href-rule]
]
to end
]
You can call 'reposition function with block containing any number of
options you want to decide upon which is coming first. It will just do
plain search, analyze its postion, sort resulting block and "reposition"
your parse input string so that the parser pointer points to first of
the options, so you can directly apply "HREF=", "SRC=" etc and you can
be sure one of them is there ...
Well, I don't know how it is robust, but tried with mytext: read
http://www.rebol.com and it seems it needs further tuning :-) ....
following might get you better results:
mytext: read http://www.rebol.com
src-rule: [{SRC="} copy target to {"} (print target)]
href-rule: [{HREF="} copy target to {"} (print target)]
parse/all mytext [
any [
mark: (mark: reposition mark [{HREF="} {SRC="}]) :mark
[src-rule | href-rule]
]
to end
]
Anyway ... you've got some inspiration ...
-pekr-
[7/16] from: petr:krenzelok:trz:cz at: 8-Oct-2003 16:39
Gabriele Santilli wrote:
>Hi Petr,
>On Wednesday, October 8, 2003, 2:06:58 PM, you wrote:
<<quoted lines omitted: 6>>
>equivalent of:
> [any [a | b | c | skip]]
ah, but that is char-by-char execution ...
>so even if it could be a bit faster than the above I don't think
>it would be of great help. More readable, maybe... so it's
>something I could add to compile-rules, if I get some time to work
>on it.
>
>In this particular case, I wouldn't use this construct at all,
>since it's much better to have a more complete grammar
>
yes, exactly - but I think such grammar to simply achieve what was
requested will not be easy for novices. The tool (REBOL) should support
our thinking pattern - and the most easy on is to "skip" "to | thru"
certain string - no matter what is in between.
If someone is up-to writing complete html parser, building DOM object,
then maybe we are near seeing rebol based web-browser? :-)
-pekr-
[8/16] from: g:santilli:tiscalinet:it at: 8-Oct-2003 17:42
Hi Petr,
On Wednesday, October 8, 2003, 4:39:01 PM, you wrote:
PK> ah, but that is char-by-char execution ...
Do you know any other way to do that? (Your example is using FIND
multiple times, and in a big string that would be many times
slower.)
PK> yes, exactly - but I think such grammar to simply achieve what was
PK> requested will not be easy for novices. The tool (REBOL) should support
PK> our thinking pattern - and the most easy on is to "skip" "to | thru"
PK> certain string - no matter what is in between.
I think that it is better to think of the problem in a different
way, because it allows you to understand things much better. If
you switch to think about grammars instead of patterns you'll find
out that your problems get simpler, not more complicated. IMHO.
PK> If someone is up-to writing complete html parser, building DOM object,
PK> then maybe we are near seeing rebol based web-browser? :-)
Well, the 74-lines [X]HTML parser built into Temple is far from
being complete, but has been able to parse all the HTML files I've
fed into it until now. I don't think this is so much complicated,
you just need to avoid that brain-dead way of doing things that
seems to pervade the world. ;-)
Regards,
Gabriele.
--
Gabriele Santilli <[g--santilli--tiscalinet--it]> -- REBOL Programmer
Amiga Group Italia sez. L'Aquila --- SOON: http://www.rebol.it/
[9/16] from: maximo:meteorstudios at: 8-Oct-2003 12:29
> -----Original Message-----
> From: Gabriele Santilli [mailto:[g--santilli--tiscalinet--it]]
<<quoted lines omitted: 5>>
> you switch to think about grammars instead of patterns you'll find
> out that your problems get simpler, not more complicated. IMHO.
can you give a short example of a grammar that would extract the text from
<tag! tag content <subtag! its content?> <p> paragraph info</p>content end>
and returns a block such as:
[
tag! [
"tag content"
subtag! [
"its content?"
]
p [
"paragraph info"
]
"content end"
]
]
I have no idea How I would approach this!
this could be a nice tutorial for us "less gifted" parsers.
-MAx
[10/16] from: petr:krenzelok:trz:cz at: 8-Oct-2003 18:57
Gabriele Santilli wrote:
>Hi Petr,
>
>On Wednesday, October 8, 2003, 4:39:01 PM, you wrote:
>
>PK> ah, but that is char-by-char execution ...
>
>Do you know any other way to do that? (Your example is using FIND
>multiple times, and in a big string that would be many times
>slower.)
>
Well - I am not sure my example will be any slower, except the penalty
of extra function call. First, I pass it string at certain position and
it then returns strings at positions, where further parse rule a) or b)
can be applied directly, second - it is 2 direct search in string and
decision upon which index came first vs probably recursive char-by-char
rules (which penalty I am not able to think about :-)
>PK> yes, exactly - but I think such grammar to simply achieve what was
>PK> requested will not be easy for novices. The tool (REBOL) should support
<<quoted lines omitted: 4>>
>you switch to think about grammars instead of patterns you'll find
>out that your problems get simpler, not more complicated. IMHO.
Yes, I can imagine it, really. The problem is (at least for me), that I
am able to understand such grammar once someone creates it, but am not
able to come up with it to solve problem at hand. Will you blame us
little bit underskilled rebol programmers now? :-)
>PK> If someone is up-to writing complete html parser, building DOM object,
>PK> then maybe we are near seeing rebol based web-browser? :-)
<<quoted lines omitted: 3>>
>you just need to avoid that brain-dead way of doing things that
>seems to pervade the world. ;-)
Sounds interesting. I am just curious, if e.g. html only (not trying to
complicate things with java-script for now :-) browser would be possible
with Rebol? IIRC Python has web browser. Just curious.
-pekr-
[11/16] from: patrick:philipot:laposte at: 8-Oct-2003 21:01
Hello Ingo,
Wednesday, October 8, 2003, 12:50:20 PM, you wrote:
IH> Hi Patrick,
IH> patrick ā la poste wrote:
>> Hi List,
>>
>> I'd like to parse a string searching for two things at the same time.
>> it seems to me that this is impossible.
IH> One trick is, to find something that is equal between the two strings, and
IH> work from there ...
IH> REBOL []
IH> myText: {<A HREF="#section1"><IMG SRC="foobar.gif"><A HREF="#section2">}
IH> parse/all myText [
IH> any [
IH> to "=" here: (there: at here -4) :there [
IH> [ "HREF=" | " SRC=" ]
copy target to ">>" (print target) |
IH> thru "="
IH> ]
IH> ]
IH> ] ; parse
IH> In this example I used the "=" which is common to both strings, checked
IH> whether what I have _before_ this sign is one of the two strings I'm
IH> interested in, and then start to copy, or just go thru the "=" to start
IH> again ...
IH> I hope that helps,
IH> Ingo
This is brilliant!
Thank you Ingo.
--
Best regards,
Patrick
[12/16] from: g:santilli:tiscalinet:it at: 8-Oct-2003 21:47
Hi Maxim,
On Wednesday, October 8, 2003, 6:29:03 PM, you wrote:
MOA> can you give a short example of a grammar that would extract the text from
MOA> <tag! tag content <subtag! its content?> <p> paragraph info</p>content end>
MOA> and returns a block such as:
[...]
Well, nested tags are not valid HTML so this does not handle them,
but maybe it could be of some inspiration. (Sorry for Joel-style
indentation. ;-)
tag-rule:
[ "<" m1:
[ "/" word thru ">" (end-tag to word! word-res)
| "!--" thru "-->" m2: (add-contents to tag! copy/part m1 back m2)
| "!DOCTYPE" thru ">" m2: (add-contents to tag! copy/part m1 back m2)
| "?xml" thru "?>" m2: (add-contents to tag! copy/part m1 back m2)
| word any space (clear attributes) any attribute ["/" (content: no) | none (content:
yes)] ">"
(open-tag to word! word-res attributes content)
] ]
chars: complement charset {<>"'= ^/^-/}
value-chars: union chars charset "/"
word: [copy word-res some chars]
space: charset { ^/^-}
attributes: [ ]
attribute:
[ (wrs: word-res) word any space
[ "=" any space
[ {"} copy value any dquoted-chars {"}
| {'} copy value any squoted-chars {'}
| copy value any value-chars
] any space
| (value: yes)
] (insert insert tail attributes to word! word-res any [value copy ""] word-res: wrs)
]
dquoted-chars: complement charset {"}
squoted-chars: complement charset {'}
document-rule:
[ some
[ copy contents to "<" (add-contents contents) tag-rule
| copy contents to end (add-contents contents) break
] ]
stack: [ ]
parsed: none
no-content-tags:
[ basefont br area link img param hr input col frame base meta]
open-tag:
func [tagname attributes content? /local tag]
[ if find no-content-tags tagname [content?: no]
either content?
[ tag: compose/deep [[(tagname) (attributes)]]
insert/only tail last stack tag
insert/only tail stack tag
]
[ tag: compose [(tagname) (attributes)]
insert/only tail last stack tag
] ]
end-tag:
func [tagname]
[ stack: back tail stack
if head? stack [exit] ; unmatched close tag
while [tagname <> tagname-of stack/1]
[ stack: back stack
if head? stack [exit] ; unmatched close tag
]
stack: head clear stack
]
add-contents:
func [contents]
[ if contents
[ insert tail last stack contents
] ]
parse-document:
func [document]
[ stack: clear head stack
insert/only stack parsed: make block! 10
parse/all document document-rule
parsed
]
This is extracted from other code so it is possible that something
is missing. Example:
>> parse-document "<html><head><title>Title</title></head><body>This is a<br>test</body></html>"
== [[[html] [[head] [[title] "Title"]] [[body] "This is a" [br] "test"]]]
>> parse-document read http://www.rebol.com
== [[[HTML] "^/" [[HEAD] "^/" [META HTTP-EQUIV "Content-Type" CONTENT "text/html;CHARSET=iso-8859-1"]
"^/" [META NAME "KEYWORDS" CO...
Regards,
Gabriele.
--
Gabriele Santilli <[g--santilli--tiscalinet--it]> -- REBOL Programmer
Amiga Group Italia sez. L'Aquila --- SOON: http://www.rebol.it/
[13/16] from: g:santilli:tiscalinet:it at: 8-Oct-2003 22:04
Hi Petr,
On Wednesday, October 8, 2003, 6:57:38 PM, you wrote:
PK> Well - I am not sure my example will be any slower, except the penalty
PK> of extra function call. First, I pass it string at certain position and
First of all, FIND searches char by char too. It's just way faster
because it's native; but, if you end up searching the string n
times, you get n*m complexity (where m is the size of the string),
and this scales up so badly that in the end it gets slower than
using a PARSE loop.
Probably FIND is still faster for two or three alternatives. We'd
have to test it. When the alternatives are just strings, you could
speed up the PARSE loop using a charset, and I have the feeling
that PARSE is as fast as FIND in such a case, so the PARSE
solution would be n times faster for n alternatives.
PK> Yes, I can imagine it, really. The problem is (at least for me), that I
PK> am able to understand such grammar once someone creates it, but am not
PK> able to come up with it to solve problem at hand. Will you blame us
PK> little bit underskilled rebol programmers now? :-)
Not at all, but you are underestimating yourself. ;-)
PK> Sounds interesting. I am just curious, if e.g. html only (not trying to
PK> complicate things with java-script for now :-) browser would be possible
PK> with Rebol? IIRC Python has web browser. Just curious.
The problem for a web browser is not HTML parsing, it's rendering.
In my dream-future, I will finish the PDF Maker 2 and then write a
HTML2PDF translator. Rendering in View would be possible too, but
I'd like RT to offer us some kind of native rich text handling
first... you see, I'm too lazy to do all of that myself. ;-)
Who needs a REBOL web browser? I'd like an email client much
better.
Regards,
Gabriele.
--
Gabriele Santilli <[g--santilli--tiscalinet--it]> -- REBOL Programmer
Amiga Group Italia sez. L'Aquila --- SOON: http://www.rebol.it/
[14/16] from: greggirwin:mindspring at: 8-Oct-2003 12:00
Hi Petr,
PK> Yes, I can imagine it, really. The problem is (at least for me), that I
PK> am able to understand such grammar once someone creates it, but am not
PK> able to come up with it to solve problem at hand. Will you blame us
PK> little bit underskilled rebol programmers now? :-)
It's often a challenge for me as well, but I think it's because of
what Gabriele said; I don't think in the right terms. Once I do that,
it seems to be much easier. The problem, though, isn't with REBOL or
PARSE, it has to do with grammar design, which most of us don't have
much (or any) experience with.
-- Gregg
[15/16] from: greggirwin:mindspring at: 8-Oct-2003 12:03
Hi Patrick,
pālp> I'd like to parse a string searching for two things at the same time.
pālp> it seems to me that this is impossible.
...
pālp> parse myText [
pālp> any [ thru "HREF=" copy target to ">" (print target) |
pālp> thru "SRC=" copy target to ">" (print target)
pālp> ] ; any
pālp> ] ; parse
I'm pretty sure this same thing came up not too long ago on the list.
See if rebol.net/list has it, or if you've been around for at least a
couple months, you should have it too (the solution that is). If you
can't find it, let me know and I'll see if I can dig it up.
The issue has to do with wanting the THRU rule to be smarter than it
is. PARSE doesn't do backtracking, so it will keep going forward
until it finds the next occurrence of the first rule you give it,
which isn't what you want, but it isn't wrong either. :)
-- Gregg
[16/16] from: robert:muench:robertmuench at: 9-Oct-2003 11:07
On Wed, 8 Oct 2003 12:10:42 +0200, patrick ā la poste
<[patrick--philipot--laposte--net]> wrote:
> myText: {<A HREF="#section1"><IMG SRC="foobar.gif"><A HREF="#section1">}
Hi, one other trick beside doing by-hand backtracking (which is very
powerful) is to define more than one rule set and use parse several times.
Why try to write on rule set at all? No one tries to solve a programming
problem with one function.
So, what could be done:
1. We could parse for < and > and copy all we have.
2. The copied string can than be parsed again with an other rule set.
parse myText [ some [
to "<" copy sub-parse to ">" ( parse sub-parse [
"HREF=" (print "href")
| "SRC=" (print "src")
])
]
]
What needs to be remember is that a rule which uses | only hit once. The
first part that makes it to the end will terminate further evaluation. The
logic is clear, the rule did it's job, why continue?
While doing make-doc-pro I have used this approach at several places,
where parse rules would get very complicated otherwise.
--
Robert M. Münch
Management & IT Freelancer
Mobile: +49 (177) 245 2802
http://www.robertmuench.de
Notes
- Quoted lines have been omitted from some messages.
View the message alone to see the lines that have been omitted