Mailing List Archive: 49091 messages
  • Home
  • Script library
  • AltME Archive
  • Mailing list
  • Articles Index
  • Site search
 

parse or

 [1/23] from: rchristiansen::pop::isdfa::sei-it::com at: 20-Sep-2000 15:00


>> paragraphs: {First paragraph.^/Second "paragraph."^/Third
paragraph.} == {First paragraph. Second "paragraph." Third paragraph.}
>> probe parse paragraphs [{.^/} | {."^/}]
false == false
>> probe parse paragraphs ({.^/} or {."^/})
** Script Error: Cannot use or~ on string! value. ** Where: ".^/" or {." So how DO I parse by a value OR another value? -Ryan

 [2/23] from: jeff:rebol at: 20-Sep-2000 12:26


Howdy, Ryan: paragraphs: {First paragraph.^/Second "paragraph."^/Third paragraph.} parse paragraphs [some [thru {.^/} | thru {."^/} | thru "." end]] -jeff

 [3/23] from: rryost:home at: 20-Sep-2000 13:54


Hi Ryan: Here's a one liner that may help:
>> st: "abcdef"
== "abcdef"
>> parse/all st "ed"
== ["abc" "" "f"] ; An inclusive OR, I guess.
>> parse/all st "gh"
== ["abcdef"] ; No splitting as neither "g" nor "h" is present.
>> parse/all st "gb"
== ["a" "cdef"] ; Split at the single char that matched. Russell [rryost--home--com]

 [4/23] from: rchristiansen:pop:isdfa:sei-it at: 20-Sep-2000 15:54


> paragraphs: {First paragraph.^/Second "paragraph."^/Third paragraph.} > parse paragraphs [some [thru {.^/} | thru {."^/} | thru "." end]]
This returns "true" But what if I'm trying to parse a report and wish to make each paragraph a separate string within a block?
>> paragraphs: {First paragraph.^/Second "paragraph."^/Third
paragraph.} == {First paragraph. Second "paragraph." Third paragraph.}
>> >> paragraphs-breakdown: []
== []
>> >> foreach paragraph parse paragraphs [some [thru {.^/} | thru {."^/} |
thru "." end]] [append paragraphs-breakdown paragraph] ** Script Error: foreach expected data argument of type: series. ** Where: foreach paragraph parse paragraphs [some [thru ".^/" | thru {."} | thru "." end]] Doesn't work. Sorry if I'm being a pain, but when I read the "parse rules" documentation it doesn't make any sense to me. I can't see the usefulness of returning "true" in this situation.

 [5/23] from: jeff:rebol at: 20-Sep-2000 13:41


Howdy, Ryan:
> > paragraphs: {First paragraph.^/Second "paragraph."^/Third paragraph.} > > parse paragraphs [some [thru {.^/} | thru {."^/} | thru "." end]] > > This returns "true"
Yep, that means that PARSE successfully made it through the whole string. To break up your paragraphs, copy out the parts thru each paragraph ending, like this: pb: copy [] parse paragraphs [ some [ copy part [thru {.^/} | thru {."^/} | thru "." end] (append pb part) ] ] pb == ["First paragraph.^/" {Second "paragraph." } "Third paragraph."]
> Sorry if I'm being a pain, but when I read the "parse > rules" documentation it doesn't make any sense to me. I > can't see the usefulness of returning "true" in this > situation.
No pain whatsoever. I like trying to help. -jeff

 [6/23] from: rryost:home at: 20-Sep-2000 15:08


Better ignore my previous post on this subject! I assumed you were trying to create a block with each component one of the paragraphs, using the ^/ as a separator. But my stuff doesn't throw any light on that at all! Please excuse my ineptness! Russell [rryost--home--com]

 [7/23] from: rryost:home at: 20-Sep-2000 15:19


Will Jeff's approach work if the paragraphs contain multiple periods? Russell [rryost--home--com]

 [8/23] from: joel:neely:fedex at: 20-Sep-2000 17:31


Hi, Jeff... you beat me to the "Send" button! ;-) Since I'm now obligated to add value, instead of just saying me too , see additional remarks below. [jeff--rebol--net] wrote:
> Howdy, Ryan: > > > paragraphs: {First paragraph.^/Second "paragraph."^/Third paragraph.}
<<quoted lines omitted: 15>>
> == ["First paragraph.^/" {Second "paragraph." > } "Third paragraph."]
Two other notes, one very minor, one very non-minor: MINOR: Notice that the newline terminator is left on the end of each internal paragraph (because thru is inclusive). There are a variety of tricky ways to deal with this while parsing, but it seems to me that it's simpler to post-process them off (if they are undesirable). NON-MINOR: Let's change the input data slightly, and try it again
>> pb: []
== []
>> paras: {First "para."^/Second para.^/Third para.}
== {First "para." Second para. Third para.}
>> parse paras [ some [
[ copy part [thru {.^/} | thru {."^/} | thru "." end] [ (append pb part) [ ] [ ] == true
>> pb
== [{First "para." Second para. } "Third para."] Notice that now the result block has only TWO elements! Since the first test (the thru {.^/} part) can succeed by grabbing text all the way to the end of the SECOND paragraph, it does so, putting the first two paragraphs into the first output string. I assumed that this is NOT what you wanted, but rather you wanted to copy through either {.^/} or {."^} WHICHEVER COMES NEXT. (Natural language text munching is a real pain, speaking from personal experience! ;-) The strategies I've thought of (I don't have time to code, compare, and recommend right at the moment) are: 1) Write more complicated parse rules, that either 1a) parse to newline, append the copied chunk to a paragraph string under construction, then look at the tail end of the last chunk to see whether it can be extended or whether a new paragraph should be started (based on whether it looked like the end of a sentence). 1b) parse to period, grab and append the next character if it is a quotation mark, append to paragraph under construction, and start a new paragraph if the next character is newline. 2) Use simpler parsing (break on newlines), then make a postpass across the block of "lines", gluing back together wherever the boundary isn't the end of a sentence. Both sound mildly fuzzy (that's not as bad as "really hairy"!) If you can control the original text, perhaps another convention would be handy, such as breaking paragraphs whenever there's a blank line (two consecutive newlines). -jn-

 [9/23] from: rchristiansen:pop:isdfa:sei-it at: 20-Sep-2000 18:11


> I assumed that > this is NOT what you wanted, but rather you wanted to copy through > either {.^/} or {."^} WHICHEVER COMES NEXT. (Natural language > text munching is a real pain, speaking from personal experience! ;-)
Yes, this is what I was looking for. As someone who has never parsed anything before using REBOL (there will be more like me!) the parsing rules are confusing to read in the REBOL docs. My inclination is to want to use a simple statement which will parse until a set of characters is reached OR a different set of characters is reached, whichever comes along first and next.
> The strategies I've thought of (I don't have time to code, compare, > and recommend right at the moment) are:
<<quoted lines omitted: 10>>
> across the block of "lines", gluing back together wherever the > boundary isn't the end of a sentence.
You missed another option, which I had been using previously. Here is the function: breakdown-content: func [ "breakdown an e-mail content field into its parts" msg [object!] "e-mail message" ][ article-info: msg/content end-of-paragraph: rejoin [{.} newline] replace/all article-info end-of-paragraph {.~} content-parts: copy [] foreach part parse/all article-info {~} [ append content-parts trim/lines part ] ] In other words, replace all instances of a set of characters with a new character that can be recognized later. The above example needs to be fixed because it only replaces instances of {.^/} with "~" and I've discovered the tilde is a bad choice, anyway. I need to also be able to replace any set of characters you might find at the end of a paragraph, including {."^/} and {!^/} and {?^/} and {:^/} and {...^/} and I'm sure there are more. I was hoping there would be a quick way to use parse instead of replacing characters first and then parsing. -Ryan

 [10/23] from: jeff:rebol at: 20-Sep-2000 18:03


Howdy, Joel:
> Notice that now the result block has only TWO elements! > Since the first test (the thru {.^/} part) can succeed by
<<quoted lines omitted: 5>>
> language text munching is a >real pain, speaking from > personal experience! ;-)
Sure. In the interests of advancing its popularity, I offered up a simplistic example of PARSE. :-) Paragraphs can end in a variety of punctuation ("!?.-;:), with different quantities (as Russ pointed out), no? -jeff

 [11/23] from: rryost:home at: 20-Sep-2000 21:30


See my stuff interjected below: Russell [rryost--home--com] ----- Original Message ----- From: <[RChristiansen--pop--isdfa--sei-it--com]> To: <[list--rebol--com]> Sent: Wednesday, September 20, 2000 1:54 PM Subject: [REBOL] parse or Re:(2)
> > paragraphs: {First paragraph.^/Second "paragraph."^/Third paragraph.} > > parse paragraphs [some [thru {.^/} | thru {."^/} | thru "." end]] > > This returns "true" > > But what if I'm trying to parse a report and wish to make each > paragraph a separate string within a block?
Simple parsing with the /all refinement will do this in one step. The /all refiinement disables all the default delimiters and uses only the supplied string of characters to break apart the target string. In this case, we'll use the control character "^/", end of line, commonly used to end a paragraph. A console session follows that illustrates this. ( Note I've inserted an extra "test" period within the first paragraph.)
>> paragraphs: {First. paragraph.^/Second "paragraph."^/Third paragraph.}
== {First. paragraph. Second "paragraph." Third paragraph.}
>> ; Now apply the simple parse/all with only the single break character
^/
>> parse/all paragraphs "^/"
== ["First. paragraph." {Second "paragraph."} "Third paragraph."] end of console session. This seems to be just what is wanted. {}'s are used for the second item because it included " 's. The period at the end of First is ignored, along with all the other spaces, ", etc because the /all refinement disabled the usual default break chars. . The previous message now continues:

 [12/23] from: norsepower:uswest at: 21-Sep-2000 0:00


Ahh, but this is not enough, because if the report has more than one newline character following a paragraph, you will end up with empty paragraphs.

 [13/23] from: rryost:home at: 21-Sep-2000 0:24


So what? Seems the application that's going to use the block of paragraphs could easily deal with the "" for an empty paragraph. To me, that's preferable than trying to outguess the final character of every conceivable paragraph! Russell [rryost--home--com]

 [14/23] from: joel:neely:fedex at: 21-Sep-2000 6:51


[jeff--rebol--net] wrote:
> Sure. In the interests of advancing its popularity, I > offered up a simplistic example of PARSE. :-) > > Paragraphs can end in a variety of punctuation ("!?.-;:), > with different quantities (as Russ pointed out), no? > > -jeff
Sure! Nothing wrong with simple examples! My only reason for pointing out that the first alternative would be taken (passing up an instance of the second alternative) was that those of us who "regularly" deal with regular expressions <groan /> find it easy to get tripped up by this kind of behavior. With REs, I can say grab everything up to the first occurrence of any of these complicated patterns in a fairly obvious way. Re-remembering the subtle differences between BNF and REs was (for me, at least) the hardest part of getting productive with parse . Just so I don't come across as being TOO negative ;-) please see my next post! -jn-

 [15/23] from: joel:neely:fedex at: 21-Sep-2000 7:03


Hi, Ryan... [RChristiansen--pop--isdfa--sei-it--com] wrote:
[snip]
> You missed another option, which I had been using previously. Here > is the function: >
[snip]
> In other words, replace all instances of a set of characters with > a new character that can be recognized later... >
You're absolutely right! Thanks for catching my omission. I often use that trick when html-izing text, with the cliche below: replace/all chunk-a-text {&} {@@@} replace/all chunk-a-text {<} {<} replace/all chunk-a-text {>} {>} replace/all chunk-a-text {"} {&quot;} replace/all chunk-a-text {@@@} {&amp;} where the trick, of course, is to hide the ampersands before replacing dangerous characters with entities that are escaped by ampersands. To make up for my omission (and my lack of time/energy last night), the following is a scheme for using parse to attack the paragraphing problem you posted. I know it doesn't handle every possible case, but I think it can be generalized in a fairly obvious way. Enjoy! -jn- (Output first, as an appetizer! I'll leave the code un-indented to facilitate cut-and-paste.) =====================================================================
>> do %parsetest.r
----- First paragraph ends here. ----- ----- A sentence. End of second "paragraph." ----- ----- I'm not sure. Is this the third paragraph? ----- ----- The fourth paragraph contains some embedded linebreaks along the way. Will this work? ----- ----- Another sentence. A "quotation." The end! ----- ----- Well, maybe! -----
>>
===================================================================== REBOL [] paragraphs: {First paragraph ends here. A sentence. End of second "paragraph." I'm not sure. Is this the third paragraph? The fourth paragraph contains some embedded linebreaks along the way. Will this work? Another sentence. A "quotation." The end! Well, maybe!} parblock: copy [] currpar: copy "" stopper: charset {.?!} nonstop: complement stopper fragment: copy "" sent: [ copy fragment [any nonstop stopper] (append currpar fragment) [{^/} (append parblock currpar currpar: copy "") |{"^/} (append parblock append currpar {"} currpar: copy "") | none ] ] parg: [ (parblock: copy [] currpar: copy "") any sent end (if 0 < length? currpar [append parblock currpar]) ] parse/all paragraphs parg foreach currpar parblock [ print ["-----^/" currpar "^/-----"] ]

 [16/23] from: joel:neely:fedex at: 21-Sep-2000 7:08


I don't think that just breaking on {^/} solves the problem as posted. The objective, as I read it, was to break on PARAGRAPHS (not lines) where a paragraph is defined as the end of a sentence that concides with the end of a line. In other words, there shouldn't be a break between lines that contain two (or more) parts of a multiple-line sentence. -jn- [rryost--home--com] wrote:

 [17/23] from: norsepower:uswest at: 21-Sep-2000 7:19


Point well-taken. It seems I have forgotten the KISS law. -Ryan Keep It Simple, Stupid.

 [18/23] from: brett:codeconscious at: 22-Sep-2000 0:13


Hey Joel,
> Re-remembering the subtle differences > between BNF and REs was (for me, at least) the hardest part of > getting productive with parse . >
It would be really handy if you (or others) could list some of the differences you refer to. I've never actually used REs but have read up on them a bit. A list of comparisons would definitely aid understanding of both concepts! Brett.

 [19/23] from: norsepower:uswest at: 21-Sep-2000 8:54


Ahh, yes, of course, the reason for my dilemma in the first place. Paragraphs are much different animals than "lines."

 [20/23] from: rchristiansen:pop:isdfa:sei-it at: 21-Sep-2000 13:11


Joel- Thanks for the parsing routine. There are still a few things about it that I need to look at a few more times for me to understand it perfectly, but I was able to make a re-usable function out of your routine. (I also changed some of the 'words because I like my scripts to read like "plain English.") The function follows... REBOL [] parse-paragraphs: func [ "Parse a document into a block of paragraphs." document [string!] "the document to be parsed" ][ paragraph-block: copy [] current-paragraph: copy "" stopper: charset {.?!:} non-stopper: complement stopper fragment: copy "" sent: [ copy fragment [any non-stopper stopper] (append current-paragraph fragment) [{^/} (append paragraph-block current-paragraph current- paragraph: copy "") |{"^/} (append paragraph-block append current-paragraph {"} current-paragraph: copy "") | none ] ] paragraph: [ (paragraph-block: copy [] current-paragraph: copy "") any sent end (if 0 < length? current-paragraph [append paragraph-block current-paragraph]) ] parse/all document paragraph paragraph-block ]

 [21/23] from: rryost:home at: 21-Sep-2000 12:05


Word wrapping in word processors and even email editors makes the definition of PARAGRAPH given below by Joel questionable, IMHO. Thinking about it, I recognize the start of a new paragraph in literature by the presence of a blank or empty line. Thus the sequence of two "new line" control characters would signal the start of a new PARAGRAPH. I think this is beyond the scope of the simple-parse approach. The more complex rule based functional approach developed in this thread by others is required. I guess the person trying to solve this paragraph parse problem is in the best position to define what he means by "paragraph". As I recall the original problem, paragraphs *were* defined "according to Joel". .Russell [rryost--home--com]

 [22/23] from: joel:neely:fedex at: 21-Sep-2000 14:09


[RChristiansen--pop--isdfa--sei-it--com] wrote:
> Joel- > > ... I also changed some of the 'words because I like my scripts > to read like "plain English.") The function follows... >
Thanks! It's nice to watch these things evolve. (As far as the coding style, I freely confess to getting fairly terse at times; I just can't type quickly enough to keep up with my thinking -- which tells you that I must be a REALLLLY slow typist! ;-) -jn-

 [23/23] from: joel:neely:fedex at: 21-Sep-2000 15:20


Ooooh! Typo on my part. Apologies for any confusion it caused. Where I typed
> > > > The objective, as I read it, was to break on PARAGRAPHS (not lines) > > where a paragraph is defined as the end of a sentence that concides > > with the end of a line... > >
I intended to be saying (correction in all caps) The objective, as I read it, was to break INTO paragraphs (not lines) where THE END OF a paragraph is defined IN THE ORIGINAL MESSAGE as the end of a sentence that concides with the end of the line... I certainly agree that in "real text" the situation becomes much fuzzier and more contextual. For example, there's one school of thought in typography that insists that blank lines as paragraph separators are wrong; that one should use indentation only (without vertical whitespace) to indicate the start of a new paragraph. Of course, this requires that one keep up with the "normal" margins being used in the text, as well as using the multi-line context to distinguish indented-first-line-of-paragraph from indented-multiline-block-quote, etc... [rryost--home--com] wrote:
> Word wrapping in word processors and even email editors makes the definition > of PARAGRAPH given below by Joel questionable, IMHO. Thinking about it, I
<<quoted lines omitted: 3>>
> of the simple-parse approach. The more complex rule based functional > approach developed in this thread by others is required.
Not really. Consider
>> text: {this is some^/text that^/flows.^/^/more sentences appear.}
== {this is some text that flows. more sentences appear.}
>> replace/all text "^/^/" #"^(ff)"
== {this is some text that flows.˙more sentences appear.}
>> parse/all text to-string #"^(ff)"
== ["this is some^/text that^/flows." "more sentences appear."] So, using replace to find the empty lines (as consecutive newlines) allows us to crack the text with a simple parse. (I'm ducking the issue that there may be runs of more than two consecutive newlines. Figuring out how to remove those with a minimal number of replace statements is left as an exercise for the reader... ;-) -jn-

Notes
  • Quoted lines have been omitted from some messages.
    View the message alone to see the lines that have been omitted