A little parse help
[1/19] from: syke:amigaextreme at: 20-Aug-2001 9:31
Hi,
I'm kinda tired today ;-)
If I for example have this text:
Quick brown fox jumps !image-brown.gif over the fence
and I want to parse out the image file, I'll just do
parse text [ any [ thru "!" copy wanted-text to " " ]]
But what do I do if the line ends after the .gif?
Basically, I want parse to copy the wanted-text until it either finds a " "
or a newline.
/Regards
Stefan Falk
www.amigaextreme.com
[2/19] from: brett:codeconscious at: 20-Aug-2001 17:39
Try
parse text [ any [ thru "!" copy wanted-text [to " " | to end]]]
Brett.
[3/19] from: petr::krenzelok::trz::cz at: 20-Aug-2001 9:42
Stefan Falk wrote:
> Hi,
> I'm kinda tired today ;-)
>
> If I for example have this text:
>
> Quick brown fox jumps !image-brown.gif over the fence
>
> and I want to parse out the image file, I'll just do
> parse text [ any [ thru "!" copy wanted-text to " " ]]
1) I think that even your parse rule above is not ever met. 'parse, by default,
ommits spaces, so you would be probably better with parse/all here.
2) I don't know your application, but wouldn't you would be better with 'find?
e.g.
->> start: find/any str "!*.???"
== "!image-brown.gif over the fence"
->> end: find start " "
== " over the fence"
->> res: copy/part start end
== "!image-brown.gif"
->> remove res
== "image-brown.gif"
->>
If your string is long, you can reassing its position in a loop, e.g. "str:
end" and continue in searching another image ... Maybe not so elegant, but ...
-pekr-
[4/19] from: syke:amigaextreme at: 20-Aug-2001 22:27
Hi again,
I get a really strange behaviour from parse when I try to do this (it's also
an explanation to what I'm trying to do).
if find content "http://" [
parse/all content [
any [
to "http://" copy URL to "<br>" (
link: rejoin [{<a href="} URL {">} URL {</a>}]
replace content URL link
)
]
]
]
When I try to do this, Rebol crashes, the processor on the web server hits
100% and the only solution
is to stop the webserver and then start it again. However, if I do like
this:
string: "Test"
if find content "http://" [
parse/all content [
any [
to "http://" copy URL to "<br>" (
link: rejoin [{<a href="} URL {">} string {</a>}]
replace content URL link
)
]
]
]
It works.
It seems as if using URL two times within Rejoin will cause Rebol to hang.
Any idea as to what is causing this?
/Regards Stefan Falk - www.amigaextreme.com
----- Original Message -----
From: "Petr Krenzelok" <[Petr--Krenzelok--trz--cz]>
To: <[rebol-list--rebol--com]>
Sent: Monday, August 20, 2001 9:42 AM
Subject: [REBOL] Re: A little parse help
> Stefan Falk wrote:
> > Hi,
<<quoted lines omitted: 7>>
> > parse text [ any [ thru "!" copy wanted-text to " " ]]
> 1) I think that even your parse rule above is not ever met. 'parse, by
default,
> ommits spaces, so you would be probably better with parse/all here.
> 2) I don't know your application, but wouldn't you would be better with
'find?
> e.g.
> ->> start: find/any str "!*.???"
<<quoted lines omitted: 7>>
> ->>
> If your string is long, you can reassing its position in a loop, e.g.
str:
> end
and continue in searching another image ... Maybe not so elegant, but
...
[5/19] from: jelinem1:nationwide at: 20-Aug-2001 16:25
Allow me to make an educated guess as to what's happening in the absence
of data to test my theory. I've done this sort of thing a long time ago.
I doubt that the multiple usage of URL within a 'rejoin is the culprit.
When 'parse hangs and grabs the CPU, it is usually an indication that you
have an infinite parse loop.
The crux here is where the 'parse cursor is in the content string.
>> to "http://"
Places the cursor at the first element of the first match of this string.
>> copy URL to "<br>"
Moves the cursor through the url text to the first element of "<br>". Now
we loop:
>> to "http://"
Places the cursor at the first element of the next match of this string.
But wait! Where exactly did we find the "next occurance" of this string?
When you changed the 'content string you did NOT affect the 'parse cursor.
In other words, the 'parse cursor has the same index? relative to the
beginning of the string as it did before you made the 'replace. SO...the
cursor is now positioned WITHIN the 'link text and effectively points
shortly before the second URL that you replaced in 'content!
Clear as mud?
As a solution, after you finish the replacement you will want to move the
'parse cursor: (length? link) - (length? URL). I think the 'parse word
'skip will do this.
- Michael Jelinek
Stefan Falk
<[syke--amigaextreme--com]>
Sent by: [rebol-bounce--rebol--com]
08/20/01 03:27 PM
Please respond to rebol-list
T
To: <[rebol-list--rebol--com]>
cc:
bcc:
Subject: [REBOL] Re: A little parse help
Hi again,
I get a really strange behaviour from parse when I try to do this (it's
also
an explanation to what I'm trying to do).
if find content "http://" [
parse/all content [
any [
to "http://" copy URL to "<br>" (
link: rejoin [{<a href="} URL {">} URL {</a>}]
replace content URL link
)
]
]
]
When I try to do this, Rebol crashes, the processor on the web server hits
100% and the only solution
is to stop the webserver and then start it again. However, if I do like
this:
string: "Test"
if find content "http://" [
parse/all content [
any [
to "http://" copy URL to "<br>" (
link: rejoin [{<a href="} URL {">} string {</a>}]
replace content URL link
)
]
]
]
It works.
It seems as if using URL two times within Rejoin will cause Rebol to hang.
Any idea as to what is causing this?
/Regards Stefan Falk - www.amigaextreme.com
----- Original Message -----
From: "Petr Krenzelok" <[Petr--Krenzelok--trz--cz]>
To: <[rebol-list--rebol--com]>
Sent: Monday, August 20, 2001 9:42 AM
Subject: [REBOL] Re: A little parse help
> Stefan Falk wrote:
> > Hi,
<<quoted lines omitted: 7>>
> > parse text [ any [ thru "!" copy wanted-text to " " ]]
> 1) I think that even your parse rule above is not ever met. 'parse, by
default,
> ommits spaces, so you would be probably better with parse/all here.
> 2) I don't know your application, but wouldn't you would be better with
'find?
> e.g.
> ->> start: find/any str "!*.???"
<<quoted lines omitted: 7>>
> ->>
> If your string is long, you can reassing its position in a loop, e.g.
str:
> end
and continue in searching another image ... Maybe not so elegant,
but
...
[6/19] from: g:santilli:tiscalinet:it at: 21-Aug-2001 19:20
Hello Stefan!
On 20-Ago-01, you wrote:
SF> if find content "http://" [
SF> parse/all content [
SF> any [
SF> to "http://" copy URL to "<br>" (
SF> link: rejoin [{<a href="} URL {">} URL {</a>}]
SF> replace content URL link
SF> )
SF> ]
SF> ]
SF> ]
Maybe this will work better (not tested):
parse/all content [
any [
to "#http://#" mark1: to "<br>" mark2: (
link: rejoin [{<a href="} URL: copy/part mark1 mark2 {">} URL {</a>}]
mark1: change/part mark1 link mark2
)
:mark1
]
]
Regards,
Gabriele.
--
Gabriele Santilli <[giesse--writeme--com]> - Amigan - REBOL programmer
Amiga Group Italia sez. L'Aquila -- http://www.amyresource.it/AGI/
[7/19] from: syke:amigaextreme at: 24-Aug-2001 20:21
Hi,
thanks, this worked!
just two questions,
what's the last :mark1 there for?
and how do I change it to parse until <br> or a space " "?
("<br>" | " ") or (to "<br>" | to " ") doesn't seem to work..
/Regards Stefan
[8/19] from: jelinem1:nationwide at: 24-Aug-2001 13:34
> what's the last :mark1 there for?
Sets the parse cursor location.
> and how do I change it to parse until <br> or a space " "?
Parse rules will not do this, if I understand your intent correctly. Parse
rules WILL look until <br> or space " ", but will not stop at whichever
comes first. Parse first looks for <br>: If parse never finds a <br> (up
to end of data) then it will look for a space " ", otherwise stopping at
the next <br> regardless of spaces.
Stefan Falk
<[syke--amigaextreme--com]>
Sent by: [rebol-bounce--rebol--com]
08/24/01 01:21 PM
Please respond to rebol-list
T
To: <[rebol-list--rebol--com]>
cc:
bcc:
Subject: [REBOL] Re: A little parse help
Hi,
thanks, this worked!
just two questions,
what's the last :mark1 there for?
and how do I change it to parse until <br> or a space " "?
("<br>" | " ") or (to "<br>" | to " ") doesn't seem to work..
/Regards Stefan
[9/19] from: petr:krenzelok:trz:cz at: 24-Aug-2001 21:01
----- Original Message -----
From: <[JELINEM1--nationwide--com]>
To: <[rebol-list--rebol--com]>
Sent: Friday, August 24, 2001 8:34 PM
Subject: [REBOL] Re: A little parse help
> > what's the last :mark1 there for?
> Sets the parse cursor location.
<<quoted lines omitted: 4>>
> to end of data) then it will look for a space " ", otherwise stopping at
> the next <br> regardless of spaces.
FIRST - looooong time requested feature. Carl once agreed it would be
usefull, but there are probably other priorities for RT to solve now.
However - being able to parse first of [a | b | c] is probably the most
missing feature re parsing ..
-pekr-
[10/19] from: max:ordigraphe at: 24-Aug-2001 15:25
> FIRST - looooong time requested feature. Carl once agreed it would be
> usefull, but there are probably other priorities for RT to solve now.
> However - being able to parse first of [a | b | c] is
> probably the most
> missing feature re parsing ..
Carl even sent me a mail saying it IS in the plans... but he sent me
that just about one year ago!
I have been implementing my own document language. The main difference
is that it is a natural language and the lack of this parsing feature is
making my work Extremely complicated.
That is because I do not want to impose strict format structure... so I
do not know if the document writer is going to end his line right away
or if he wishes to continue on the same line or if he'll put a space or
two, or put a space just before the end of the line...
Add to this the fact that the keywords themselves are plain english (or
any other language, in fact :-) and ARE allowed within the content
itself and it makes the parsing a little bit harder still!
This parsing feature alone would have cut my development time in half at
least!
But alas nothing is perfect, life WOULD be dull indeed! ;-)
Note to RT: Just one tag (like any, to, some, etc) called "next" would
be easy to include in parsing engine no?
-Max
[11/19] from: syke:amigaextreme at: 24-Aug-2001 23:37
Hi,
if it doesn't work, cheat!
I just replace all "<br>" with " <br>" (added a space in front of 'em) and
voila!
Parse til the space and everything works fine and dandy! :-)
Thanks for all the help guys!
/Regards Stefan
[12/19] from: g:santilli:tiscalinet:it at: 25-Aug-2001 14:53
Hello Stefan!
On 24-Ago-01, you wrote:
SF> thanks, this worked!
I'm happy it was useful.
SF> just two questions,
SF> what's the last :mark1 there for?
To reset the current position for the parser. It's better to
always do that when you modify the string you are parsing. In this
case, it is even necessary unless you want to loop forever (as
others explained in this thread). mark1 is set by CHANGE just
after the change (i.e. after the </a>); this way PARSE will
continue its work from there.
SF> and how do I change it to parse until <br> or a space " "?
This is a little more tricky. If you think it is ok to stop at " "
or just "<" then you can do it this way:
url-chars: complement charset " <"
...
to "http://" mark1: some url-chars mark2: (
...
If you need to stop just on space and <br> and not on other tags,
it gets a bit more complicated... but I think you don't need this,
do you?
Regards,
Gabriele.
--
Gabriele Santilli <[giesse--writeme--com]> - Amigan - REBOL programmer
Amiga Group Italia sez. L'Aquila -- http://www.amyresource.it/AGI/
[13/19] from: syke:amigaextreme at: 25-Aug-2001 21:02
Hi,
actually, this is exactly what I need. Basically, I want to parse a html
file for all URLs in it, and the end of the URL will obviously contain a
space or a <br> (the two cases that separates text). But as you might have
read, I've done it by putting a space in front of every <br>, and therefor I
just have to parse until a space.
/Regards Stefan Falk
www.amigaextreme.com
[14/19] from: lmecir:mbox:vol:cz at: 26-Aug-2001 1:39
Hi,
> I have been implementing my own document language. The main difference
> is that it is a natural language and the lack of this parsing feature is
<<quoted lines omitted: 8>>
> This parsing feature alone would have cut my development time in half at
> least!
here is my solution:
cfunc: function [
{make a closure}
[catch]
spec [block!]
body [block!]
] [locals in-new-context spec2 body2 i] [
locals: copy []
spec2: copy [[throw]]
body2: reduce ['do 'func spec2 body]
i: 1
repeat item spec [
if all [any-word? :item not set-word? :item] [
append locals to word! :item
append spec2 reduce [to word! :item [any-type!]]
append body2 reduce ['get/any 'pick 'locals i]
i: i + 1
]
]
in-new-context: func [
{do body with locals in new context}
[throw]
locals
] body2
throw-on-error [
func spec reduce [:in-new-context locals]
]
]
a-b: cfunc [
{Generate an A-B parse rule}
a [block!]
b [block!]
/local finish
] [
[
[
b (finish: [to end skip]) |
(finish: a)
]
finish
]
]
comment {
Example:
a: [any "a" "b"]
b: ["aa"]
parse "ab" a-b a b
parse "aab" a-b a b
}
not-rule: cfunc [
{Generate a not A parse rule}
a [block!]
/local finish
] [
[
[
a (finish: [to end skip]) |
(finish: [])
]
finish
]
]
comment {
Example:
a: [any "a" "b"]
parse "ab" not-rule a
parse "b" not-rule a
parse "" not-rule a
}
to-rule: cfunc [
{generate a to A parse rule}
a [block!]
/local nxt finish
] [
[
(
finish: [to end skip]
nxt: [skip]
)
any [a (nxt: [to end skip] finish: []) nxt | nxt] finish
]
]
comment {
Example:
space-or-br: to-rule [" " | "<br>"]
result: ""
parse/all "aa" [space-or-br copy result to end]
probe result
parse/all "a a<br>" [space-or-br copy result to end]
probe result
parse/all "ab<br> " [space-or-br copy result to end]
probe result
}
[15/19] from: g:santilli:tiscalinet:it at: 26-Aug-2001 17:24
Hello Stefan!
On 25-Ago-01, you wrote:
SF> actually, this is exactly what I need. Basically, I want to
SF> parse a html file for all URLs in it, and the end of the URL
As I imagined... so stopping at any tag should not create problems
for you... Anyway, you already have your solution. :)
Regards,
Gabriele.
--
Gabriele Santilli <[giesse--writeme--com]> - Amigan - REBOL programmer
Amiga Group Italia sez. L'Aquila -- http://www.amyresource.it/AGI/
[16/19] from: syke:amigaextreme at: 26-Aug-2001 18:24
Yes,
a note though, a space isn't a tag, so if someone writes an url, and some
text behind it, the entire URL will be the URL + the text behind it until
the new line.
eg.
http://www.rebol.com <--- Check this link<br>
Would create a really weird link
<a href="http://www.rebol.com <--- Check this link">http://www.rebol.com
<--- Check this link</a>
Try clickin' on that :-)
/Regards
Stefan Falk
www.amigaextreme.com
[17/19] from: g:santilli:tiscalinet:it at: 27-Aug-2001 19:14
Hello Stefan!
On 26-Ago-01, you wrote:
SF> Yes,
SF> a note though, a space isn't a tag, so if someone writes an
Indeed. My version stopped at a space or at any tag. Did you miss
it?
SF> Would create a really weird link <a
SF> href="http://www.rebol.com <--- Check this
SF> link">http://www.rebol.com <--- Check this link</a>
This is what happens if you stop at <br> only. And this i sthe
reason because the code I proposed stops at any tag. What does
your version do with:
<b>http://www.rebol.com/</b><br>
?
:-)
Regards,
Gabriele.
--
Gabriele Santilli <[giesse--writeme--com]> - Amigan - REBOL programmer
Amiga Group Italia sez. L'Aquila -- http://www.amyresource.it/AGI/
[18/19] from: syke:amigaextreme at: 27-Aug-2001 21:26
Hi,
sorry, actually I did miss a part of your previous post. ;-)
And as a sidenote, all tags except <br> and <a> tags are converted to <
and > so stopping at
& <
would be the best :)
/Regards Stefan Falk
www.amigaextreme.com
[19/19] from: brian:hawley at: 28-Aug-2001 1:36
A little late, but...
At 08:21 PM 8/24/01 +0200, Stefan Falk wrote:
>Hi,
>thanks, this worked!
>
>just two questions,
>what's the last :mark1 there for?
At every step of the parse process, there are two implicit
parameters: the series that you are processing and the
current position within that series. In parse rules you
can assign the series (at its current position) to a word
(x) by putting the set-word (x:) in the rules at a given
point. You can also reset the implicit parse series (and
position) to the value assigned to a word by putting the
get-word (:x) at a given point in the rules.
If you are changing the series you are working on while
you are parsing it, you need to make sure that parse is
able to keep track of its implicit position setting. This
is not a problem if you are changing the series in front
of or at the implicit position, like this:
[to "foo" x: (remove/part x 3)]
In this case, the implicit position at the point x is set
is before the part of the series that is being changed,
so parse is not going to get confused. However, if the
implicit parse position is after the part of the series
that is being changed, like this:
[to "<foo" x: thru ">" y: (remove/part x y)]
then parse is going to get confused about its implicit
position, especially if the length of the series is any
different as a result of the change. To deal with this
you have to reset the parse position after such changes,
like this:
[to "<foo" x: thru ">" y: (z: remove/part x y) :z]
Does that make sense?
>and how do I change it to parse until <br> or a space " "?
>
>("<br>" | " ") or (to "<br>" | to " ") doesn't seem to work..
This is a common problem. The general workaround to the
problem of scanning until the first of a set of alternate
values (a follow set) is to refactor the problem into one
of scanning through the values that aren't in the set that
you are scanning for. That may sound confusing.
In your case, it would be easier if you chose to scan until
all html tags, not just <br>. Then you could just scan to
the "<" character and use code like this:
url-chars: complement charset {< "^(tab)^(newline)}
rule: [to "http://" x: some url-chars y: (do something)]
Note that url-chars is the complement of the set of chars
in that string, or all chars _not_ in the follow set.
If you can't distinguish your follow set by looking at one
character at a time (say you only want to go to <br> tags
but skip other tags) then you have two solutions. You may
be able to extend the previous charset solution with more
charsets that exclude each of the rest of the letters in
the values of the follow set - awkward, but it can be fast
for simple follow sets. Or, you can refactor your subrules
using tail recursion, like this:
non-tag-char: complement charset "<"
url-chars: [
some non-tag-char [end | "<br>" | "<" url-chars]
] ; Note the tail recursive reference in the last part
rule: [to "http://" x: url-chars y: (do something)]
Here's a better example, printing out the first paragraph in
html, including nested paragraphs, assuming proper closure:
non-lt: complement charset "<"
p-rule: [
"<p" [">" | " " thru ">"] ; Consume tag
p-rule-cont ; Continue
]
p-rule-cont: [
; Consume non-tag characters
any non-lt [
"</p>" ; Close tag
| p-rule p-rule-cont ; Nested paragraph, continue
| "<" p-rule-cont ; Something else, continue
]
]
rule: [to "<p" copy tmp p-rule (print tmp) to end]
There are a few factors to note in this example:
- You need to make sure that you have a fix-point, a point
that the recursion will stop, in this case the end tag.
- You need to make sure that every recursive rule will at
least consume something before recursing, or it won't
stop until the stack overflows.
- Parse doesn't backtrack through parens (embedded code).
This means that you should put off the embedded code until
the point that you can be sure that you have recognized
the correct alternate - in this case, after the rule.
- Parse does a better job of minimizing recursion overhead
than the regular REBOL interpreter does, so this recursion
isn't as likely to overflow the stack.
I hope this all helps
Brian Hawley
Notes
- Quoted lines have been omitted from some messages.
View the message alone to see the lines that have been omitted