Mailing List Archive: parse or Re:(4)

[REBOL] parse or Re:(4)

From: joel:neely:fedex at: 20-Sep-2000 17:31


Hi, Jeff...  you beat me to the "Send" button! ;-)

Since I'm now obligated to add value, instead of just saying
me too
, see additional remarks below.

[jeff--rebol--net] wrote:
>     Howdy, Ryan:
>
> > > paragraphs: {First paragraph.^/Second "paragraph."^/Third paragraph.}
> > > parse paragraphs [some [thru {.^/}  | thru {."^/} | thru "." end]]
> >
> > This returns "true"
>
>   Yep, that means that PARSE successfully made it through the
>   whole string.
>
>   To break up your paragraphs, copy out the parts thru each
>   paragraph ending, like this:
>
> pb: copy []
> parse  paragraphs [
>     some [
>        copy part [thru  {.^/} | thru {."^/} | thru  "." end]
>        (append pb part)
>     ]
> ]
>
> pb
> == ["First paragraph.^/" {Second "paragraph."
> } "Third paragraph."]
>

Two other notes, one very minor, one very non-minor:

MINOR:  Notice that the newline terminator is left on the end of
        each internal paragraph (because thru is inclusive).
There are a variety of tricky ways to deal with this while parsing,
but it seems to me that it's simpler to post-process them off (if
they are undesirable).

NON-MINOR:  Let's change the input data slightly, and try it again

    >> pb: []
    == []
    >> paras: {First "para."^/Second para.^/Third para.}
    == {First "para."
    Second para.
    Third para.}
    >> parse  paras [ some [
    [        copy part [thru  {.^/} | thru {."^/} | thru  "." end]
    [        (append pb part)
    [        ]
    [    ]
    == true
    >> pb
    == [{First "para."
    Second para.
    } "Third para."]

Notice that now the result block has only TWO elements!  Since the
first test (the  thru {.^/}  part) can succeed by grabbing text all
the way to the end of the SECOND paragraph, it does so, putting the
first two paragraphs into the first output string.  I assumed that
this is NOT what you wanted, but rather you wanted to copy through
either {.^/} or {."^} WHICHEVER COMES NEXT.  (Natural language
text munching is a real pain, speaking from personal experience! ;-)

The strategies I've thought of (I don't have time to code, compare,
and recommend right at the moment) are:

1)  Write more complicated parse rules, that either
    1a)  parse to newline, append the copied chunk to a paragraph
         string under construction, then look at the tail end of
         the last chunk to see whether it can be extended or whether
         a new paragraph should be started (based on whether it
         looked like the end of a sentence).
    1b)  parse to period, grab and append the next character if it
         is a quotation mark, append to paragraph under construction,
         and start a new paragraph if the next character is newline.
2)  Use simpler parsing (break on newlines), then make a postpass
    across the block of "lines", gluing back together wherever the
    boundary isn't the end of a sentence.

Both sound mildly fuzzy (that's not as bad as "really hairy"!)

If you can control the original text, perhaps another convention
would be handy, such as breaking paragraphs whenever there's a
blank line (two consecutive newlines).

-jn-