Mailing List Archive: 49091 messages
  • Home
  • Script library
  • AltME Archive
  • Mailing list
  • Articles Index
  • Site search

[REBOL] parse or Re:(4)

From: joel:neely:fedex at: 20-Sep-2000 17:31

Hi, Jeff... you beat me to the "Send" button! ;-) Since I'm now obligated to add value, instead of just saying me too , see additional remarks below. [jeff--rebol--net] wrote:
> Howdy, Ryan: > > > > paragraphs: {First paragraph.^/Second "paragraph."^/Third paragraph.} > > > parse paragraphs [some [thru {.^/} | thru {."^/} | thru "." end]] > > > > This returns "true" > > Yep, that means that PARSE successfully made it through the > whole string. > > To break up your paragraphs, copy out the parts thru each > paragraph ending, like this: > > pb: copy [] > parse paragraphs [ > some [ > copy part [thru {.^/} | thru {."^/} | thru "." end] > (append pb part) > ] > ] > > pb > == ["First paragraph.^/" {Second "paragraph." > } "Third paragraph."] >
Two other notes, one very minor, one very non-minor: MINOR: Notice that the newline terminator is left on the end of each internal paragraph (because thru is inclusive). There are a variety of tricky ways to deal with this while parsing, but it seems to me that it's simpler to post-process them off (if they are undesirable). NON-MINOR: Let's change the input data slightly, and try it again
>> pb: []
== []
>> paras: {First "para."^/Second para.^/Third para.}
== {First "para." Second para. Third para.}
>> parse paras [ some [
[ copy part [thru {.^/} | thru {."^/} | thru "." end] [ (append pb part) [ ] [ ] == true
>> pb
== [{First "para." Second para. } "Third para."] Notice that now the result block has only TWO elements! Since the first test (the thru {.^/} part) can succeed by grabbing text all the way to the end of the SECOND paragraph, it does so, putting the first two paragraphs into the first output string. I assumed that this is NOT what you wanted, but rather you wanted to copy through either {.^/} or {."^} WHICHEVER COMES NEXT. (Natural language text munching is a real pain, speaking from personal experience! ;-) The strategies I've thought of (I don't have time to code, compare, and recommend right at the moment) are: 1) Write more complicated parse rules, that either 1a) parse to newline, append the copied chunk to a paragraph string under construction, then look at the tail end of the last chunk to see whether it can be extended or whether a new paragraph should be started (based on whether it looked like the end of a sentence). 1b) parse to period, grab and append the next character if it is a quotation mark, append to paragraph under construction, and start a new paragraph if the next character is newline. 2) Use simpler parsing (break on newlines), then make a postpass across the block of "lines", gluing back together wherever the boundary isn't the end of a sentence. Both sound mildly fuzzy (that's not as bad as "really hairy"!) If you can control the original text, perhaps another convention would be handy, such as breaking paragraphs whenever there's a blank line (two consecutive newlines). -jn-