Complex Series Parsing (Part 2)

[1/13] from: depotcity:home at: 8-Mar-2001 12:28

Hello all. How do you parse this... y: [ <text> This is some text <tag> with a tag added </tag> and then some text </text> ] So that you get this n: [ <text> This is some text <tag> <text> with a tag added </tag> <text> and then some text ] I tried this... n: [] z: parse y none foreach val z [ either find val "<" [append n val] [append n rejoin [{<text> } val]] ] But then I get n: [ <text> <text> This <text> is <text> some <text> text <tag> <text> with <text> a <text> tag <text> added </tag> <text> and <text> then <text> some <text> text </text> ] So how do I "collect" all the text until the next "tag"? Terry Brownell

[2/13] from: petr:krenzelok:trz:cz at: 8-Mar-2001 23:29

----- Original Message ----- From: Terry Brownell <[depotcity--home--com]> To: Rebol List <[rebol-list--rebol--com]> Sent: Thursday, March 08, 2001 9:28 PM Subject: [REBOL] Complex Series Parsing (Part 2)

> Hello all. > > How do you parse this... > y: [ > <text> This is some text <tag> with a tag added </tag> and then some text

</text>

> ] > So that you get this

<<quoted lines omitted: 8>>

> n: [] > z: parse y none

have you really tested your code? I receive error for above line, as 'y has to be string, not block ...

> foreach val z [ > either find val "<" [append n val] [append n rejoin [{<text> } val]] > ]

hmm, but what about if your text contains some "<" too? I think more robust solution would be better: str: "<text> This is some text <tag> with a tag added </tag> and then some text </text>" result: copy "" alpha: charset [#"A" - #"Z" #"a" - #"z"] tag: [start: "<" [some alpha | "/" some alpha] ">" end: (tmp: copy/part start end append result either tmp ="<text>" ["<text>"][join newline [tmp newline]])] text: [some [tag | (append result "<text>") start: skip (append result copy/part start 1)]] parse/all str [text to end] print result

[3/13] from: petr:krenzelok:trz:cz at: 8-Mar-2001 23:34

> Hello all. > > How do you parse this... > y: [ > <text> This is some text <tag> with a tag added </tag> and then some text

</text>

> ] > So that you get this

<<quoted lines omitted: 11>>

> either find val "<" [append n val] [append n rejoin [{<text> } val]] > ]

str: "<text> This is some text <tag> with a tag added </tag> and then some text </text>" result: copy "" alpha: charset [#"A" - #"Z" #"a" - #"z"] tag: [start: "<" [some alpha | "/" some alpha] ">" end: (tmp: copy/part start end append result either tmp ="<text>" ["<text>"][join newline [tmp newline "<text>"]])] text: [some [tag | start: skip (append result copy/part start 1)]] parse/all str [text to end] print result

[4/13] from: petr:krenzelok:trz:cz at: 8-Mar-2001 23:44

> Hello all. > > How do you parse this... > y: [ > <text> This is some text <tag> with a tag added </tag> and then some text

</text>

> ] > So that you get this

<<quoted lines omitted: 11>>

> either find val "<" [append n val] [append n rejoin [{<text> } val]] > ]

1) sorry for confusion. Ctrl + S means "send" here in Outlook Express ... damned ... 2) somehow kludgy, as I am not sure what you wanted to achieve, but should work :-) str: "<text> This is some text <tag> with a tag added </tag> and then some text </text>" result: copy "" alpha: charset [#"A" - #"Z" #"a" - #"z"] tag: [start: "<" [some alpha | "/" some alpha] ">" end: (tmp: copy/part start end append result either tmp ="<text>" ["<text>"][join newline [tmp newline "<text>"]])] ; what a one-liner, eh :-) text: [some [tag | start: skip (append result copy/part start 1)] (remove/part skip tail result -6 6)] parse/all str [text to end] print result Cheers, -pekr-

[5/13] from: sterling::rebol::com at: 8-Mar-2001 16:17

I'm not sure I understand what you are really trying to do. Usually with parse, once you describe the format of what you want to parse and the output wou desire, the parse rules just fall out onto the screen. Correct me if I'm wrong: Input is a block with the following format: A <text> tag followed by a series of words with any number of non <text> or </text> tags interspersed and ends with a </text> tag. The desired output is the same block except that every place there is a non <text> tag in the block a <text> tag should be placed after it and before the next series of words. The ending </text> tag should be removed. For this you don't need parse at all. Just march through the block and insert the new <text> tag as needed: y: [ <text> This is some text <tag> with a tag added </tag> and then some text </text> ] forall y [ all [tag? y/1 y/1 <> <text> y/1 <> </text> insert next y <text>] all [y/1 = </text> remove y y: back y] ] probe y: head y Perhaps your rules are a bit more complicated in which caase you need to define them and then see what's the best way to do it. Parse may be necessary but this simple case can be done quickly another way. Sterling

[6/13] from: depotcity:home at: 9-Mar-2001 1:25

This is on the right track. But more complexity would arise... here is an advanced XML structure... y: [<tag0></tag0> <text> this and that <tag1>those </tag1>and these</text><tag2></tag2><text>There and then</text>] output would be... out: [ <tag0> </tag0> <text> this and that <tag1> <text> those </tag1> <text> and these </text> <tag2> </tag2> <text> There and then </text> ] There is method to the madness, I've got the "madness" part down pat, now if I could only come up with "the method". Thanks Terry Brownell

[7/13] from: andrew:wxc at: 9-Mar-2001 22:55

Terry wrote:

> y: [<tag0></tag0> <text> this and that <tag1>those </tag1>and > these</text><tag2></tag2><text>There and then</text>]

Have you considered the effects of haveing "bare" words in your block? Wouldn't it be better if your text words were inside strings? Like: y: [<tag0></tag0> <text> {this and that} <tag1> "those " </tag1> {and these} </text> <tag2> </tag2><text> "There and then" </text>] Then strings with punctuation and invalid rebol words won't stop your script from running. Then it becomes a simple matter to pick out strings and tags in the block. Andrew Martin ICQ: 26227169 http://members.nbci.com/AndrewMartin/

[8/13] from: petr:krenzelok:trz:cz at: 9-Mar-2001 13:06

----- Original Message ----- From: Terry Brownell <[depotcity--home--com]> To: <[rebol-list--rebol--com]> Sent: Friday, March 09, 2001 10:25 AM Subject: [REBOL] Re: Complex Series Parsing (Part 2)

> This is on the right track. But more complexity would arise... here is an > advanced XML structure...

<<quoted lines omitted: 15>>

> </text> > ]

I am sorry but I can't understand the reason you want the input in above state. It's imo buggy. So you manually insert <text> tag in fron of each text, while real <text> tag exists too? Just look at your output - you have 3 <text> tags while you have only one closing </text> tag. Is that correct? -pekr-

[9/13] from: depotcity:home at: 9-Mar-2001 9:48

The entire string represents a series of events over time, starting at the top of the output, and making its way down. The tags "fire" as they are processed, and the corresponding ending tags </tag1> represent "stop" The <text> tag is can have these other "events" embedded. In my particular case, no non <text> tags will have other non <text> tags embedded within, but I imagine one day they would. y: [<tag0></tag0> <text> this and that <tag1>those </tag1>and these</text><tag2></tag2><text>There and then</text>] out: [ <tag0> ; fire </tag0> ; stop <text> this and that ;fire <tag1> ;fire <text> those ;continue </tag1> ;stop <text> and these ;continue </text> ;stop <tag2> ; fire </tag2> ;stop <text> There and then ;fire </text> ;stop ]

[10/13] from: depotcity:home at: 9-Mar-2001 10:04

cont... This method makes the xml coding much cleaner... This xml... some text <text> this is a bit of text with <tag1> some stuff </tag1> and some <tag2>more stuff </tag2> and then some final text </text> is better than.... some text <text> this is a bit of text with </text><tag1><text> some stuff </text></tag1><text> and some</text><tag2> <text>more stuff</text></tag2><text>and then some final text</text> Terry Brownell

[11/13] from: sterling:rebol at: 9-Mar-2001 11:03

Well, before anybody goes further into the "here's something that works for the last input you posted" followed by "but then there's this input that doesn't work" path, lets go back to the definition of input and output. If you use load.markup and trat the REBOL words you have in your block as strings like Andrew suggests (which is a better way to deal with them), then you have these input elements: * <text> -- open text tag * "???" -- some arbitrary string * <???> -- some other open tag * </???> -- some close tag * </text> -- a close text tag Your input looks like this: probe input: load/markup {<tag0></tag0> <text> this and that <tag1>those </tag1>and > these</text><tag2></tag2><text>There and then</text>} == [<tag0> </tag0> " " <text> " this and that^/" <tag1> "those " </tag1> "and > these" </text> <tag2> </tag2> <text> "There and^/then" </text>] You can get rid of the whitespace-only strings if you want to that are created due to whitespace between the tags. Now write the spec: * any combination of input elements up to <text> * open <text> * any combination of "???", <???>, </???> where <text> whould be inserted if front of each "???" * </text> * start the whole process over Done. That's all you've told us so far. Each item above is essentially a parse rule already. Some can be joined together: * [thru <text>] * [any [ </text> [thru <text>] | tag! | string! mark: (insert back mark <text>) string! ] ] Now we just assemble: ; skip the immediate string after <text> so we don't add a second one start-rule: [thru <text> [string! | none]] parse imput [ start-rule any [ </text> start-rule ; start over | tag! ; eat any random tags | string! mark: (insert back mark <text>) string! ] ] probe input And presto! Sterling

[12/13] from: petr:krenzelok:trz:cz at: 9-Mar-2001 23:57

----- Original Message ----- From: <[sterling--rebol--com]> To: <[rebol-list--rebol--com]> Sent: Friday, March 09, 2001 8:03 PM Subject: [REBOL] Re: Complex Series Parsing (Part 2)

> Well, before anybody goes further into the "here's something that > works for the last input you posted" followed by "but then there's

<<quoted lines omitted: 17>>

> You can get rid of the whitespace-only strings if you want to that are > created due to whitespace between the tags.

OK, is there easy way of how to do it without using iteration? If I will use e.g. replace/all blk " " none, it will just replace whitespace with 'none, but we want simply to remove the whitespace :-)

> Now write the spec: > * any combination of input elements up to <text>

<<quoted lines omitted: 24>>

> ] > ]

So you prefer to work with XML-like data in block mode rather than in string mode? Cheers, -pekr-

[13/13] from: sterling:rebol at: 9-Mar-2001 15:48

Nope. It's just one of those things. You could always: replace/all string "> <" "><" if you knew that was safe to do on the original string so that the load/markup wouldn't make the whitespace values. But that still doesn't solve the string with more than one space (or tabs). It's best to iterate the block and remove items that TRIM to EMPTY?: for x length? blk 1 -1 [if empty? trim blk/:x [remove at blk x]] If you don't want any strings in the block permanently trimmed then make it "trim copy blk/:x" instead. Sterling

Notes

Quoted lines have been omitted from some messages.
View the message alone to see the lines that have been omitted