Parse limitation ?

[1/16] from: patrick::philipot::laposte::net at: 8-Oct-2003 12:10

Hi List, I'd like to parse a string searching for two things at the same time. it seems to me that this is impossible. For example, a text from which I want to extract the HREF and the SRC target. myText: {<A HREF="#section1"><IMG SRC="foobar.gif"><A HREF="#section1">} parse myText [ any [ thru "HREF=" copy target to ">" (print target) | thru "SRC=" copy target to ">" (print target) ] ; any ] ; parse #section1 #section1 parse myText [ any [ thru "SRC=" copy target to ">" (print target) | thru "HREF=" copy target to ">" (print target) ] ; any ] ; parse foobar.gif #section1 The result is different depending which rule comes first. The only way I see as a workaround is to parse the text twice. Is there a better (smarter) way? Regards Patrick

[2/16] from: ingo:2b1 at: 8-Oct-2003 12:50

Hi Patrick, patrick � la poste wrote:

> Hi List, > > I'd like to parse a string searching for two things at the same time. > it seems to me that this is impossible.

One trick is, to find something that is equal between the two strings, and work from there ... REBOL [] myText: {<A HREF="#section1"><IMG SRC="foobar.gif"><A HREF="#section2">} parse/all myText [ any [ to "=" here: (there: at here -4) :there [ [ "HREF=" | " SRC=" ] copy target to ">" (print target) | thru "=" ] ] ] ; parse In this example I used the "=" which is common to both strings, checked whether what I have _before_ this sign is one of the two strings I'm interested in, and then start to copy, or just go thru the "=" to start again ... I hope that helps, Ingo

[3/16] from: petr:krenzelok:trz:cz at: 8-Oct-2003 14:06

patrick � la poste wrote:

>Hi List, >I'd like to parse a string searching for two things at the same time.

<<quoted lines omitted: 16>>

>"#section1" >The result is different depending which rule comes first. The only way I see as a workaround is to parse the text twice. Is there a better (smarter) way?

I would just like to point out, that 'first directive or tu/thru [a | b | c] was proposed for parse enahncement some time ago, but then some parse gurus (e.g. Gabriele) admitted, that parse would have to work other way internally and that it is not easy achievable (am I right, Gabriele?) OTOH - your example is just one of those which we often enough meet in real life, but have no easy/elegant solution for, at least not for novice being able to solve it .... -pekr-

[4/16] from: lmecir:mbox:vol:cz at: 8-Oct-2003 14:23

Hi Pat, ----- Original Message ----- From: "patrick � la poste"

> Hi List, > I'd like to parse a string searching for two things at the same time.

<<quoted lines omitted: 18>>

> Regards > Patrick

This is possible with PARSE. You can use my parse enhancements e.g. Have a look at: http://www.fm.vslib.cz/~ladislav/rebol/parseen.r Ladislav

[5/16] from: g:santilli:tiscalinet:it at: 8-Oct-2003 16:03

Hi Petr, On Wednesday, October 8, 2003, 2:06:58 PM, you wrote: PK> I would just like to point out, that 'first directive or tu/thru [a | b PK> | c] was proposed for parse enahncement some time ago, but then some PK> parse gurus (e.g. Gabriele) admitted, that parse would have to work PK> other way internally and that it is not easy achievable (am I right, PK> Gabriele?) The point is, that internally PARSE would be forced to do the equivalent of: [any [a | b | c | skip]] so even if it could be a bit faster than the above I don't think it would be of great help. More readable, maybe... so it's something I could add to compile-rules, if I get some time to work on it. In this particular case, I wouldn't use this construct at all, since it's much better to have a more complete grammar (that can make distinction between href= in a tag and outside of a tag etc.), IMHO. Regards, Gabriele. -- Gabriele Santilli <[g--santilli--tiscalinet--it]> -- REBOL Programmer Amiga Group Italia sez. L'Aquila --- SOON: http://www.rebol.it/

[6/16] from: petr:krenzelok:trz:cz at: 8-Oct-2003 16:31

Petr Krenzelok wrote:

>patrick � la poste wrote: >>Hi List,

<<quoted lines omitted: 38>>

>real life, but have no easy/elegant solution for, at least not for >novice being able to solve it ....

Well, I just played a bit and following hack appeared in my notepad :-) reposition: func [str blk /local res tmp][ res: copy [] foreach item blk [ if not none? tmp: find str item [append res reduce [index? tmp item]] ] sort/skip res 2 either empty? res [str][at str (first res) - (index? str) + 1] ] myText: { <A HREF="#section1"><IMG SRC="foobar.gif"><A HREF="#section2"> <A HREF="#section1"><IMG SRC="foobar.gif"><A HREF="#section2"> <A HREF="#section1"><IMG SRC="foobar.gif"><A HREF="#section2"> <A HREF="#section1"><IMG SRC="foobar.gif"><A HREF="#section2"> } src-rule: ["SRC=" copy target to ">" (print target)] href-rule: ["HREF=" copy target to ">" (print target)] parse/all mytext [ any [ mark: (mark: reposition mark ["HREF=" "SRC="]) :mark [src-rule | href-rule] ] to end ] You can call 'reposition function with block containing any number of options you want to decide upon which is coming first. It will just do plain search, analyze its postion, sort resulting block and "reposition" your parse input string so that the parser pointer points to first of the options, so you can directly apply "HREF=", "SRC=" etc and you can be sure one of them is there ... Well, I don't know how it is robust, but tried with mytext: read http://www.rebol.com and it seems it needs further tuning :-) .... following might get you better results: mytext: read http://www.rebol.com src-rule: [{SRC="} copy target to {"} (print target)] href-rule: [{HREF="} copy target to {"} (print target)] parse/all mytext [ any [ mark: (mark: reposition mark [{HREF="} {SRC="}]) :mark [src-rule | href-rule] ] to end ] Anyway ... you've got some inspiration ... -pekr-

[7/16] from: petr:krenzelok:trz:cz at: 8-Oct-2003 16:39

Gabriele Santilli wrote:

>Hi Petr, >On Wednesday, October 8, 2003, 2:06:58 PM, you wrote:

<<quoted lines omitted: 6>>

>equivalent of: > [any [a | b | c | skip]]

ah, but that is char-by-char execution ...

>so even if it could be a bit faster than the above I don't think >it would be of great help. More readable, maybe... so it's >something I could add to compile-rules, if I get some time to work >on it. > >In this particular case, I wouldn't use this construct at all, >since it's much better to have a more complete grammar >

yes, exactly - but I think such grammar to simply achieve what was requested will not be easy for novices. The tool (REBOL) should support our thinking pattern - and the most easy on is to "skip" "to | thru" certain string - no matter what is in between. If someone is up-to writing complete html parser, building DOM object, then maybe we are near seeing rebol based web-browser? :-) -pekr-

[8/16] from: g:santilli:tiscalinet:it at: 8-Oct-2003 17:42

Hi Petr, On Wednesday, October 8, 2003, 4:39:01 PM, you wrote: PK> ah, but that is char-by-char execution ... Do you know any other way to do that? (Your example is using FIND multiple times, and in a big string that would be many times slower.) PK> yes, exactly - but I think such grammar to simply achieve what was PK> requested will not be easy for novices. The tool (REBOL) should support PK> our thinking pattern - and the most easy on is to "skip" "to | thru" PK> certain string - no matter what is in between. I think that it is better to think of the problem in a different way, because it allows you to understand things much better. If you switch to think about grammars instead of patterns you'll find out that your problems get simpler, not more complicated. IMHO. PK> If someone is up-to writing complete html parser, building DOM object, PK> then maybe we are near seeing rebol based web-browser? :-) Well, the 74-lines [X]HTML parser built into Temple is far from being complete, but has been able to parse all the HTML files I've fed into it until now. I don't think this is so much complicated, you just need to avoid that brain-dead way of doing things that seems to pervade the world. ;-) Regards, Gabriele. -- Gabriele Santilli <[g--santilli--tiscalinet--it]> -- REBOL Programmer Amiga Group Italia sez. L'Aquila --- SOON: http://www.rebol.it/

[9/16] from: maximo:meteorstudios at: 8-Oct-2003 12:29

> -----Original Message----- > From: Gabriele Santilli [mailto:[g--santilli--tiscalinet--it]]

<<quoted lines omitted: 5>>

> you switch to think about grammars instead of patterns you'll find > out that your problems get simpler, not more complicated. IMHO.

can you give a short example of a grammar that would extract the text from <tag! tag content <subtag! its content?> paragraph infocontent end> and returns a block such as: [ tag! [ "tag content" subtag! [ "its content?" ] p [ "paragraph info" ] "content end" ] ] I have no idea How I would approach this! this could be a nice tutorial for us "less gifted" parsers. -MAx

[10/16] from: petr:krenzelok:trz:cz at: 8-Oct-2003 18:57

Gabriele Santilli wrote:

>Hi Petr, > >On Wednesday, October 8, 2003, 4:39:01 PM, you wrote: > >PK> ah, but that is char-by-char execution ... > >Do you know any other way to do that? (Your example is using FIND >multiple times, and in a big string that would be many times >slower.) >

Well - I am not sure my example will be any slower, except the penalty of extra function call. First, I pass it string at certain position and it then returns strings at positions, where further parse rule a) or b) can be applied directly, second - it is 2 direct search in string and decision upon which index came first vs probably recursive char-by-char rules (which penalty I am not able to think about :-)

>PK> yes, exactly - but I think such grammar to simply achieve what was >PK> requested will not be easy for novices. The tool (REBOL) should support

<<quoted lines omitted: 4>>

>you switch to think about grammars instead of patterns you'll find >out that your problems get simpler, not more complicated. IMHO.

Yes, I can imagine it, really. The problem is (at least for me), that I am able to understand such grammar once someone creates it, but am not able to come up with it to solve problem at hand. Will you blame us little bit underskilled rebol programmers now? :-)

>PK> If someone is up-to writing complete html parser, building DOM object, >PK> then maybe we are near seeing rebol based web-browser? :-)

<<quoted lines omitted: 3>>

>you just need to avoid that brain-dead way of doing things that >seems to pervade the world. ;-)

Sounds interesting. I am just curious, if e.g. html only (not trying to complicate things with java-script for now :-) browser would be possible with Rebol? IIRC Python has web browser. Just curious. -pekr-

[11/16] from: patrick:philipot:laposte at: 8-Oct-2003 21:01

Hello Ingo, Wednesday, October 8, 2003, 12:50:20 PM, you wrote: IH> Hi Patrick, IH> patrick � la poste wrote:

>> Hi List, >> >> I'd like to parse a string searching for two things at the same time. >> it seems to me that this is impossible.

IH> One trick is, to find something that is equal between the two strings, and IH> work from there ... IH> REBOL [] IH> myText: {<A HREF="#section1"><IMG SRC="foobar.gif"><A HREF="#section2">} IH> parse/all myText [ IH> any [ IH> to "=" here: (there: at here -4) :there [ IH> [ "HREF=" | " SRC=" ] copy target to ">>" (print target) | IH> thru "=" IH> ] IH> ] IH> ] ; parse IH> In this example I used the "=" which is common to both strings, checked IH> whether what I have _before_ this sign is one of the two strings I'm IH> interested in, and then start to copy, or just go thru the "=" to start IH> again ... IH> I hope that helps, IH> Ingo This is brilliant! Thank you Ingo. -- Best regards, Patrick

[12/16] from: g:santilli:tiscalinet:it at: 8-Oct-2003 21:47

Hi Maxim, On Wednesday, October 8, 2003, 6:29:03 PM, you wrote: MOA> can you give a short example of a grammar that would extract the text from MOA> <tag! tag content <subtag! its content?> paragraph infocontent end> MOA> and returns a block such as: [...] Well, nested tags are not valid HTML so this does not handle them, but maybe it could be of some inspiration. (Sorry for Joel-style indentation. ;-) tag-rule: [ "<" m1: [ "/" word thru ">" (end-tag to word! word-res) | "!--" thru "-->" m2: (add-contents to tag! copy/part m1 back m2) | "!DOCTYPE" thru ">" m2: (add-contents to tag! copy/part m1 back m2) | "?xml" thru "?>" m2: (add-contents to tag! copy/part m1 back m2) | word any space (clear attributes) any attribute ["/" (content: no) | none (content: yes)] ">" (open-tag to word! word-res attributes content) ] ] chars: complement charset {<>"'= ^/^-/} value-chars: union chars charset "/" word: [copy word-res some chars] space: charset { ^/^-} attributes: [ ] attribute: [ (wrs: word-res) word any space [ "=" any space [ {"} copy value any dquoted-chars {"} | {'} copy value any squoted-chars {'} | copy value any value-chars ] any space | (value: yes) ] (insert insert tail attributes to word! word-res any [value copy ""] word-res: wrs) ] dquoted-chars: complement charset {"} squoted-chars: complement charset {'} document-rule: [ some [ copy contents to "<" (add-contents contents) tag-rule | copy contents to end (add-contents contents) break ] ] stack: [ ] parsed: none no-content-tags: [ basefont br area link img param hr input col frame base meta] open-tag: func [tagname attributes content? /local tag] [ if find no-content-tags tagname [content?: no] either content? [ tag: compose/deep [[(tagname) (attributes)]] insert/only tail last stack tag insert/only tail stack tag ] [ tag: compose [(tagname) (attributes)] insert/only tail last stack tag ] ] end-tag: func [tagname] [ stack: back tail stack if head? stack [exit] ; unmatched close tag while [tagname <> tagname-of stack/1] [ stack: back stack if head? stack [exit] ; unmatched close tag ] stack: head clear stack ] add-contents: func [contents] [ if contents [ insert tail last stack contents ] ] parse-document: func [document] [ stack: clear head stack insert/only stack parsed: make block! 10 parse/all document document-rule parsed ] This is extracted from other code so it is possible that something is missing. Example:

>> parse-document "<html><head><title>Title</title></head><body>This is a test</body></html>"

== [[[html] [[head] [[title] "Title"]] [[body] "This is a" [br] "test"]]]

>> parse-document read http://www.rebol.com

== [[[HTML] "^/" [[HEAD] "^/" [META HTTP-EQUIV "Content-Type" CONTENT "text/html;CHARSET=iso-8859-1"] "^/" [META NAME "KEYWORDS" CO... Regards, Gabriele. -- Gabriele Santilli <[g--santilli--tiscalinet--it]> -- REBOL Programmer Amiga Group Italia sez. L'Aquila --- SOON: http://www.rebol.it/

[13/16] from: g:santilli:tiscalinet:it at: 8-Oct-2003 22:04

Hi Petr, On Wednesday, October 8, 2003, 6:57:38 PM, you wrote: PK> Well - I am not sure my example will be any slower, except the penalty PK> of extra function call. First, I pass it string at certain position and First of all, FIND searches char by char too. It's just way faster because it's native; but, if you end up searching the string n times, you get n*m complexity (where m is the size of the string), and this scales up so badly that in the end it gets slower than using a PARSE loop. Probably FIND is still faster for two or three alternatives. We'd have to test it. When the alternatives are just strings, you could speed up the PARSE loop using a charset, and I have the feeling that PARSE is as fast as FIND in such a case, so the PARSE solution would be n times faster for n alternatives. PK> Yes, I can imagine it, really. The problem is (at least for me), that I PK> am able to understand such grammar once someone creates it, but am not PK> able to come up with it to solve problem at hand. Will you blame us PK> little bit underskilled rebol programmers now? :-) Not at all, but you are underestimating yourself. ;-) PK> Sounds interesting. I am just curious, if e.g. html only (not trying to PK> complicate things with java-script for now :-) browser would be possible PK> with Rebol? IIRC Python has web browser. Just curious. The problem for a web browser is not HTML parsing, it's rendering. In my dream-future, I will finish the PDF Maker 2 and then write a HTML2PDF translator. Rendering in View would be possible too, but I'd like RT to offer us some kind of native rich text handling first... you see, I'm too lazy to do all of that myself. ;-) Who needs a REBOL web browser? I'd like an email client much better. Regards, Gabriele. -- Gabriele Santilli <[g--santilli--tiscalinet--it]> -- REBOL Programmer Amiga Group Italia sez. L'Aquila --- SOON: http://www.rebol.it/

[14/16] from: greggirwin:mindspring at: 8-Oct-2003 12:00

Hi Petr, PK> Yes, I can imagine it, really. The problem is (at least for me), that I PK> am able to understand such grammar once someone creates it, but am not PK> able to come up with it to solve problem at hand. Will you blame us PK> little bit underskilled rebol programmers now? :-) It's often a challenge for me as well, but I think it's because of what Gabriele said; I don't think in the right terms. Once I do that, it seems to be much easier. The problem, though, isn't with REBOL or PARSE, it has to do with grammar design, which most of us don't have much (or any) experience with. -- Gregg

[15/16] from: greggirwin:mindspring at: 8-Oct-2003 12:03

Hi Patrick, p�lp> I'd like to parse a string searching for two things at the same time. p�lp> it seems to me that this is impossible. ... p�lp> parse myText [ p�lp> any [ thru "HREF=" copy target to ">" (print target) | p�lp> thru "SRC=" copy target to ">" (print target) p�lp> ] ; any p�lp> ] ; parse I'm pretty sure this same thing came up not too long ago on the list. See if rebol.net/list has it, or if you've been around for at least a couple months, you should have it too (the solution that is). If you can't find it, let me know and I'll see if I can dig it up. The issue has to do with wanting the THRU rule to be smarter than it is. PARSE doesn't do backtracking, so it will keep going forward until it finds the next occurrence of the first rule you give it, which isn't what you want, but it isn't wrong either. :) -- Gregg

[16/16] from: robert:muench:robertmuench at: 9-Oct-2003 11:07

On Wed, 8 Oct 2003 12:10:42 +0200, patrick � la poste <[patrick--philipot--laposte--net]> wrote:

> myText: {<A HREF="#section1"><IMG SRC="foobar.gif"><A HREF="#section1">}

Hi, one other trick beside doing by-hand backtracking (which is very powerful) is to define more than one rule set and use parse several times. Why try to write on rule set at all? No one tries to solve a programming problem with one function. So, what could be done: 1. We could parse for < and > and copy all we have. 2. The copied string can than be parsed again with an other rule set. parse myText [ some [ to "<" copy sub-parse to ">" ( parse sub-parse [ "HREF=" (print "href") | "SRC=" (print "src") ]) ] ] What needs to be remember is that a rule which uses | only hit once. The first part that makes it to the end will terminate further evaluation. The logic is clear, the rule did it's job, why continue? While doing make-doc-pro I have used this approach at several places, where parse rules would get very complicated otherwise. -- Robert M. M�nch Management & IT Freelancer Mobile: +49 (177) 245 2802 http://www.robertmuench.de

Notes

Quoted lines have been omitted from some messages.
View the message alone to see the lines that have been omitted