[REBOL] parse or Re:(4)
From: joel:neely:fedex at: 20-Sep-2000 17:31
Hi, Jeff... you beat me to the "Send" button! ;-)
Since I'm now obligated to add value, instead of just saying
me too
, see additional remarks below.
[jeff--rebol--net] wrote:
> Howdy, Ryan:
>
> > > paragraphs: {First paragraph.^/Second "paragraph."^/Third paragraph.}
> > > parse paragraphs [some [thru {.^/} | thru {."^/} | thru "." end]]
> >
> > This returns "true"
>
> Yep, that means that PARSE successfully made it through the
> whole string.
>
> To break up your paragraphs, copy out the parts thru each
> paragraph ending, like this:
>
> pb: copy []
> parse paragraphs [
> some [
> copy part [thru {.^/} | thru {."^/} | thru "." end]
> (append pb part)
> ]
> ]
>
> pb
> == ["First paragraph.^/" {Second "paragraph."
> } "Third paragraph."]
>
Two other notes, one very minor, one very non-minor:
MINOR: Notice that the newline terminator is left on the end of
each internal paragraph (because thru is inclusive).
There are a variety of tricky ways to deal with this while parsing,
but it seems to me that it's simpler to post-process them off (if
they are undesirable).
NON-MINOR: Let's change the input data slightly, and try it again
>> pb: []
== []
>> paras: {First "para."^/Second para.^/Third para.}
== {First "para."
Second para.
Third para.}
>> parse paras [ some [
[ copy part [thru {.^/} | thru {."^/} | thru "." end]
[ (append pb part)
[ ]
[ ]
== true
>> pb
== [{First "para."
Second para.
} "Third para."]
Notice that now the result block has only TWO elements! Since the
first test (the thru {.^/} part) can succeed by grabbing text all
the way to the end of the SECOND paragraph, it does so, putting the
first two paragraphs into the first output string. I assumed that
this is NOT what you wanted, but rather you wanted to copy through
either {.^/} or {."^} WHICHEVER COMES NEXT. (Natural language
text munching is a real pain, speaking from personal experience! ;-)
The strategies I've thought of (I don't have time to code, compare,
and recommend right at the moment) are:
1) Write more complicated parse rules, that either
1a) parse to newline, append the copied chunk to a paragraph
string under construction, then look at the tail end of
the last chunk to see whether it can be extended or whether
a new paragraph should be started (based on whether it
looked like the end of a sentence).
1b) parse to period, grab and append the next character if it
is a quotation mark, append to paragraph under construction,
and start a new paragraph if the next character is newline.
2) Use simpler parsing (break on newlines), then make a postpass
across the block of "lines", gluing back together wherever the
boundary isn't the end of a sentence.
Both sound mildly fuzzy (that's not as bad as "really hairy"!)
If you can control the original text, perhaps another convention
would be handy, such as breaking paragraphs whenever there's a
blank line (two consecutive newlines).
-jn-