Mailing List Archive: 49091 messages
  • Home
  • Script library
  • AltME Archive
  • Mailing list
  • Articles Index
  • Site search
 

A little parse help

 [1/19] from: syke:amigaextreme at: 20-Aug-2001 9:31


Hi, I'm kinda tired today ;-) If I for example have this text: Quick brown fox jumps !image-brown.gif over the fence and I want to parse out the image file, I'll just do parse text [ any [ thru "!" copy wanted-text to " " ]] But what do I do if the line ends after the .gif? Basically, I want parse to copy the wanted-text until it either finds a " " or a newline. /Regards Stefan Falk www.amigaextreme.com

 [2/19] from: brett:codeconscious at: 20-Aug-2001 17:39


Try parse text [ any [ thru "!" copy wanted-text [to " " | to end]]] Brett.

 [3/19] from: petr::krenzelok::trz::cz at: 20-Aug-2001 9:42


Stefan Falk wrote:
> Hi, > I'm kinda tired today ;-) > > If I for example have this text: > > Quick brown fox jumps !image-brown.gif over the fence > > and I want to parse out the image file, I'll just do > parse text [ any [ thru "!" copy wanted-text to " " ]]
1) I think that even your parse rule above is not ever met. 'parse, by default, ommits spaces, so you would be probably better with parse/all here. 2) I don't know your application, but wouldn't you would be better with 'find? e.g. ->> start: find/any str "!*.???" == "!image-brown.gif over the fence" ->> end: find start " " == " over the fence" ->> res: copy/part start end == "!image-brown.gif" ->> remove res == "image-brown.gif" ->> If your string is long, you can reassing its position in a loop, e.g. "str: end" and continue in searching another image ... Maybe not so elegant, but ... -pekr-

 [4/19] from: syke:amigaextreme at: 20-Aug-2001 22:27


Hi again, I get a really strange behaviour from parse when I try to do this (it's also an explanation to what I'm trying to do). if find content "http://" [ parse/all content [ any [ to "http://" copy URL to "<br>" ( link: rejoin [{<a href="} URL {">} URL {</a>}] replace content URL link ) ] ] ] When I try to do this, Rebol crashes, the processor on the web server hits 100% and the only solution is to stop the webserver and then start it again. However, if I do like this: string: "Test" if find content "http://" [ parse/all content [ any [ to "http://" copy URL to "<br>" ( link: rejoin [{<a href="} URL {">} string {</a>}] replace content URL link ) ] ] ] It works. It seems as if using URL two times within Rejoin will cause Rebol to hang. Any idea as to what is causing this? /Regards Stefan Falk - www.amigaextreme.com ----- Original Message ----- From: "Petr Krenzelok" <[Petr--Krenzelok--trz--cz]> To: <[rebol-list--rebol--com]> Sent: Monday, August 20, 2001 9:42 AM Subject: [REBOL] Re: A little parse help
> Stefan Falk wrote: > > Hi,
<<quoted lines omitted: 7>>
> > parse text [ any [ thru "!" copy wanted-text to " " ]] > 1) I think that even your parse rule above is not ever met. 'parse, by
default,
> ommits spaces, so you would be probably better with parse/all here. > 2) I don't know your application, but wouldn't you would be better with
'find?
> e.g. > ->> start: find/any str "!*.???"
<<quoted lines omitted: 7>>
> ->> > If your string is long, you can reassing its position in a loop, e.g.
str:
> end
and continue in searching another image ... Maybe not so elegant, but ...

 [5/19] from: jelinem1:nationwide at: 20-Aug-2001 16:25


Allow me to make an educated guess as to what's happening in the absence of data to test my theory. I've done this sort of thing a long time ago. I doubt that the multiple usage of URL within a 'rejoin is the culprit. When 'parse hangs and grabs the CPU, it is usually an indication that you have an infinite parse loop. The crux here is where the 'parse cursor is in the content string.
>> to "http://"
Places the cursor at the first element of the first match of this string.
>> copy URL to "<br>"
Moves the cursor through the url text to the first element of "<br>". Now we loop:
>> to "http://"
Places the cursor at the first element of the next match of this string. But wait! Where exactly did we find the "next occurance" of this string? When you changed the 'content string you did NOT affect the 'parse cursor. In other words, the 'parse cursor has the same index? relative to the beginning of the string as it did before you made the 'replace. SO...the cursor is now positioned WITHIN the 'link text and effectively points shortly before the second URL that you replaced in 'content! Clear as mud? As a solution, after you finish the replacement you will want to move the 'parse cursor: (length? link) - (length? URL). I think the 'parse word 'skip will do this. - Michael Jelinek Stefan Falk <[syke--amigaextreme--com]> Sent by: [rebol-bounce--rebol--com] 08/20/01 03:27 PM Please respond to rebol-list T To: <[rebol-list--rebol--com]> cc: bcc: Subject: [REBOL] Re: A little parse help Hi again, I get a really strange behaviour from parse when I try to do this (it's also an explanation to what I'm trying to do). if find content "http://" [ parse/all content [ any [ to "http://" copy URL to "<br>" ( link: rejoin [{<a href="} URL {">} URL {</a>}] replace content URL link ) ] ] ] When I try to do this, Rebol crashes, the processor on the web server hits 100% and the only solution is to stop the webserver and then start it again. However, if I do like this: string: "Test" if find content "http://" [ parse/all content [ any [ to "http://" copy URL to "<br>" ( link: rejoin [{<a href="} URL {">} string {</a>}] replace content URL link ) ] ] ] It works. It seems as if using URL two times within Rejoin will cause Rebol to hang. Any idea as to what is causing this? /Regards Stefan Falk - www.amigaextreme.com ----- Original Message ----- From: "Petr Krenzelok" <[Petr--Krenzelok--trz--cz]> To: <[rebol-list--rebol--com]> Sent: Monday, August 20, 2001 9:42 AM Subject: [REBOL] Re: A little parse help
> Stefan Falk wrote: > > Hi,
<<quoted lines omitted: 7>>
> > parse text [ any [ thru "!" copy wanted-text to " " ]] > 1) I think that even your parse rule above is not ever met. 'parse, by
default,
> ommits spaces, so you would be probably better with parse/all here. > 2) I don't know your application, but wouldn't you would be better with
'find?
> e.g. > ->> start: find/any str "!*.???"
<<quoted lines omitted: 7>>
> ->> > If your string is long, you can reassing its position in a loop, e.g.
str:
> end
and continue in searching another image ... Maybe not so elegant, but ...

 [6/19] from: g:santilli:tiscalinet:it at: 21-Aug-2001 19:20


Hello Stefan! On 20-Ago-01, you wrote: SF> if find content "http://" [ SF> parse/all content [ SF> any [ SF> to "http://" copy URL to "<br>" ( SF> link: rejoin [{<a href="} URL {">} URL {</a>}] SF> replace content URL link SF> ) SF> ] SF> ] SF> ] Maybe this will work better (not tested): parse/all content [ any [ to "#http://#" mark1: to "<br>" mark2: ( link: rejoin [{<a href="} URL: copy/part mark1 mark2 {">} URL {</a>}] mark1: change/part mark1 link mark2 ) :mark1 ] ] Regards, Gabriele. -- Gabriele Santilli <[giesse--writeme--com]> - Amigan - REBOL programmer Amiga Group Italia sez. L'Aquila -- http://www.amyresource.it/AGI/

 [7/19] from: syke:amigaextreme at: 24-Aug-2001 20:21


Hi, thanks, this worked! just two questions, what's the last :mark1 there for? and how do I change it to parse until <br> or a space " "? ("<br>" | " ") or (to "<br>" | to " ") doesn't seem to work.. /Regards Stefan

 [8/19] from: jelinem1:nationwide at: 24-Aug-2001 13:34


> what's the last :mark1 there for?
Sets the parse cursor location.
> and how do I change it to parse until <br> or a space " "?
Parse rules will not do this, if I understand your intent correctly. Parse rules WILL look until <br> or space " ", but will not stop at whichever comes first. Parse first looks for <br>: If parse never finds a <br> (up to end of data) then it will look for a space " ", otherwise stopping at the next <br> regardless of spaces. Stefan Falk <[syke--amigaextreme--com]> Sent by: [rebol-bounce--rebol--com] 08/24/01 01:21 PM Please respond to rebol-list T To: <[rebol-list--rebol--com]> cc: bcc: Subject: [REBOL] Re: A little parse help Hi, thanks, this worked! just two questions, what's the last :mark1 there for? and how do I change it to parse until <br> or a space " "? ("<br>" | " ") or (to "<br>" | to " ") doesn't seem to work.. /Regards Stefan

 [9/19] from: petr:krenzelok:trz:cz at: 24-Aug-2001 21:01


----- Original Message ----- From: <[JELINEM1--nationwide--com]> To: <[rebol-list--rebol--com]> Sent: Friday, August 24, 2001 8:34 PM Subject: [REBOL] Re: A little parse help
> > what's the last :mark1 there for? > Sets the parse cursor location.
<<quoted lines omitted: 4>>
> to end of data) then it will look for a space " ", otherwise stopping at > the next <br> regardless of spaces.
FIRST - looooong time requested feature. Carl once agreed it would be usefull, but there are probably other priorities for RT to solve now. However - being able to parse first of [a | b | c] is probably the most missing feature re parsing .. -pekr-

 [10/19] from: max:ordigraphe at: 24-Aug-2001 15:25


> FIRST - looooong time requested feature. Carl once agreed it would be > usefull, but there are probably other priorities for RT to solve now. > However - being able to parse first of [a | b | c] is > probably the most > missing feature re parsing ..
Carl even sent me a mail saying it IS in the plans... but he sent me that just about one year ago! I have been implementing my own document language. The main difference is that it is a natural language and the lack of this parsing feature is making my work Extremely complicated. That is because I do not want to impose strict format structure... so I do not know if the document writer is going to end his line right away or if he wishes to continue on the same line or if he'll put a space or two, or put a space just before the end of the line... Add to this the fact that the keywords themselves are plain english (or any other language, in fact :-) and ARE allowed within the content itself and it makes the parsing a little bit harder still! This parsing feature alone would have cut my development time in half at least! But alas nothing is perfect, life WOULD be dull indeed! ;-) Note to RT: Just one tag (like any, to, some, etc) called "next" would be easy to include in parsing engine no? -Max

 [11/19] from: syke:amigaextreme at: 24-Aug-2001 23:37


Hi, if it doesn't work, cheat! I just replace all "<br>" with " <br>" (added a space in front of 'em) and voila! Parse til the space and everything works fine and dandy! :-) Thanks for all the help guys! /Regards Stefan

 [12/19] from: g:santilli:tiscalinet:it at: 25-Aug-2001 14:53


Hello Stefan! On 24-Ago-01, you wrote: SF> thanks, this worked! I'm happy it was useful. SF> just two questions, SF> what's the last :mark1 there for? To reset the current position for the parser. It's better to always do that when you modify the string you are parsing. In this case, it is even necessary unless you want to loop forever (as others explained in this thread). mark1 is set by CHANGE just after the change (i.e. after the </a>); this way PARSE will continue its work from there. SF> and how do I change it to parse until <br> or a space " "? This is a little more tricky. If you think it is ok to stop at " " or just "<" then you can do it this way: url-chars: complement charset " <" ... to "http://" mark1: some url-chars mark2: ( ... If you need to stop just on space and <br> and not on other tags, it gets a bit more complicated... but I think you don't need this, do you? Regards, Gabriele. -- Gabriele Santilli <[giesse--writeme--com]> - Amigan - REBOL programmer Amiga Group Italia sez. L'Aquila -- http://www.amyresource.it/AGI/

 [13/19] from: syke:amigaextreme at: 25-Aug-2001 21:02


Hi, actually, this is exactly what I need. Basically, I want to parse a html file for all URLs in it, and the end of the URL will obviously contain a space or a <br> (the two cases that separates text). But as you might have read, I've done it by putting a space in front of every <br>, and therefor I just have to parse until a space. /Regards Stefan Falk www.amigaextreme.com

 [14/19] from: lmecir:mbox:vol:cz at: 26-Aug-2001 1:39


Hi,
> I have been implementing my own document language. The main difference > is that it is a natural language and the lack of this parsing feature is
<<quoted lines omitted: 8>>
> This parsing feature alone would have cut my development time in half at > least!
here is my solution: cfunc: function [ {make a closure} [catch] spec [block!] body [block!] ] [locals in-new-context spec2 body2 i] [ locals: copy [] spec2: copy [[throw]] body2: reduce ['do 'func spec2 body] i: 1 repeat item spec [ if all [any-word? :item not set-word? :item] [ append locals to word! :item append spec2 reduce [to word! :item [any-type!]] append body2 reduce ['get/any 'pick 'locals i] i: i + 1 ] ] in-new-context: func [ {do body with locals in new context} [throw] locals ] body2 throw-on-error [ func spec reduce [:in-new-context locals] ] ] a-b: cfunc [ {Generate an A-B parse rule} a [block!] b [block!] /local finish ] [ [ [ b (finish: [to end skip]) | (finish: a) ] finish ] ] comment { Example: a: [any "a" "b"] b: ["aa"] parse "ab" a-b a b parse "aab" a-b a b } not-rule: cfunc [ {Generate a not A parse rule} a [block!] /local finish ] [ [ [ a (finish: [to end skip]) | (finish: []) ] finish ] ] comment { Example: a: [any "a" "b"] parse "ab" not-rule a parse "b" not-rule a parse "" not-rule a } to-rule: cfunc [ {generate a to A parse rule} a [block!] /local nxt finish ] [ [ ( finish: [to end skip] nxt: [skip] ) any [a (nxt: [to end skip] finish: []) nxt | nxt] finish ] ] comment { Example: space-or-br: to-rule [" " | "<br>"] result: "" parse/all "aa" [space-or-br copy result to end] probe result parse/all "a a<br>" [space-or-br copy result to end] probe result parse/all "ab<br> " [space-or-br copy result to end] probe result }

 [15/19] from: g:santilli:tiscalinet:it at: 26-Aug-2001 17:24


Hello Stefan! On 25-Ago-01, you wrote: SF> actually, this is exactly what I need. Basically, I want to SF> parse a html file for all URLs in it, and the end of the URL As I imagined... so stopping at any tag should not create problems for you... Anyway, you already have your solution. :) Regards, Gabriele. -- Gabriele Santilli <[giesse--writeme--com]> - Amigan - REBOL programmer Amiga Group Italia sez. L'Aquila -- http://www.amyresource.it/AGI/

 [16/19] from: syke:amigaextreme at: 26-Aug-2001 18:24


Yes, a note though, a space isn't a tag, so if someone writes an url, and some text behind it, the entire URL will be the URL + the text behind it until the new line. eg. http://www.rebol.com <--- Check this link<br> Would create a really weird link <a href="http://www.rebol.com <--- Check this link">http://www.rebol.com <--- Check this link</a> Try clickin' on that :-) /Regards Stefan Falk www.amigaextreme.com

 [17/19] from: g:santilli:tiscalinet:it at: 27-Aug-2001 19:14


Hello Stefan! On 26-Ago-01, you wrote: SF> Yes, SF> a note though, a space isn't a tag, so if someone writes an Indeed. My version stopped at a space or at any tag. Did you miss it? SF> Would create a really weird link <a SF> href="http://www.rebol.com <--- Check this SF> link">http://www.rebol.com <--- Check this link</a> This is what happens if you stop at <br> only. And this i sthe reason because the code I proposed stops at any tag. What does your version do with: <b>http://www.rebol.com/</b><br> ? :-) Regards, Gabriele. -- Gabriele Santilli <[giesse--writeme--com]> - Amigan - REBOL programmer Amiga Group Italia sez. L'Aquila -- http://www.amyresource.it/AGI/

 [18/19] from: syke:amigaextreme at: 27-Aug-2001 21:26


Hi, sorry, actually I did miss a part of your previous post. ;-) And as a sidenote, all tags except <br> and <a> tags are converted to < and > so stopping at & < would be the best :) /Regards Stefan Falk www.amigaextreme.com

 [19/19] from: brian:hawley at: 28-Aug-2001 1:36


A little late, but... At 08:21 PM 8/24/01 +0200, Stefan Falk wrote:
>Hi, >thanks, this worked! > >just two questions, >what's the last :mark1 there for?
At every step of the parse process, there are two implicit parameters: the series that you are processing and the current position within that series. In parse rules you can assign the series (at its current position) to a word (x) by putting the set-word (x:) in the rules at a given point. You can also reset the implicit parse series (and position) to the value assigned to a word by putting the get-word (:x) at a given point in the rules. If you are changing the series you are working on while you are parsing it, you need to make sure that parse is able to keep track of its implicit position setting. This is not a problem if you are changing the series in front of or at the implicit position, like this: [to "foo" x: (remove/part x 3)] In this case, the implicit position at the point x is set is before the part of the series that is being changed, so parse is not going to get confused. However, if the implicit parse position is after the part of the series that is being changed, like this: [to "<foo" x: thru ">" y: (remove/part x y)] then parse is going to get confused about its implicit position, especially if the length of the series is any different as a result of the change. To deal with this you have to reset the parse position after such changes, like this: [to "<foo" x: thru ">" y: (z: remove/part x y) :z] Does that make sense?
>and how do I change it to parse until <br> or a space " "? > >("<br>" | " ") or (to "<br>" | to " ") doesn't seem to work..
This is a common problem. The general workaround to the problem of scanning until the first of a set of alternate values (a follow set) is to refactor the problem into one of scanning through the values that aren't in the set that you are scanning for. That may sound confusing. In your case, it would be easier if you chose to scan until all html tags, not just <br>. Then you could just scan to the "<" character and use code like this: url-chars: complement charset {< "^(tab)^(newline)} rule: [to "http://" x: some url-chars y: (do something)] Note that url-chars is the complement of the set of chars in that string, or all chars _not_ in the follow set. If you can't distinguish your follow set by looking at one character at a time (say you only want to go to <br> tags but skip other tags) then you have two solutions. You may be able to extend the previous charset solution with more charsets that exclude each of the rest of the letters in the values of the follow set - awkward, but it can be fast for simple follow sets. Or, you can refactor your subrules using tail recursion, like this: non-tag-char: complement charset "<" url-chars: [ some non-tag-char [end | "<br>" | "<" url-chars] ] ; Note the tail recursive reference in the last part rule: [to "http://" x: url-chars y: (do something)] Here's a better example, printing out the first paragraph in html, including nested paragraphs, assuming proper closure: non-lt: complement charset "<" p-rule: [ "<p" [">" | " " thru ">"] ; Consume tag p-rule-cont ; Continue ] p-rule-cont: [ ; Consume non-tag characters any non-lt [ "</p>" ; Close tag | p-rule p-rule-cont ; Nested paragraph, continue | "<" p-rule-cont ; Something else, continue ] ] rule: [to "<p" copy tmp p-rule (print tmp) to end] There are a few factors to note in this example: - You need to make sure that you have a fix-point, a point that the recursion will stop, in this case the end tag. - You need to make sure that every recursive rule will at least consume something before recursing, or it won't stop until the stack overflows. - Parse doesn't backtrack through parens (embedded code). This means that you should put off the embedded code until the point that you can be sure that you have recognized the correct alternate - in this case, after the rule. - Parse does a better job of minimizing recursion overhead than the regular REBOL interpreter does, so this recursion isn't as likely to overflow the stack. I hope this all helps Brian Hawley

Notes
  • Quoted lines have been omitted from some messages.
    View the message alone to see the lines that have been omitted