parse or
[1/23] from: rchristiansen::pop::isdfa::sei-it::com at: 20-Sep-2000 15:00
>> paragraphs: {First paragraph.^/Second "paragraph."^/Third
paragraph.}
== {First paragraph.
Second "paragraph."
Third paragraph.}
>> probe parse paragraphs [{.^/} | {."^/}]
false
== false
>> probe parse paragraphs ({.^/} or {."^/})
** Script Error: Cannot use or~ on string! value.
** Where: ".^/" or {."
So how DO I parse by a value OR another value?
-Ryan
[2/23] from: jeff:rebol at: 20-Sep-2000 12:26
Howdy, Ryan:
paragraphs: {First paragraph.^/Second "paragraph."^/Third paragraph.}
parse paragraphs [some [thru {.^/} | thru {."^/} | thru "." end]]
-jeff
[3/23] from: rryost:home at: 20-Sep-2000 13:54
Hi Ryan: Here's a one liner that may help:
>> st: "abcdef"
== "abcdef"
>> parse/all st "ed"
== ["abc" "" "f"] ; An inclusive OR, I guess.
>> parse/all st "gh"
== ["abcdef"] ; No splitting as neither "g" nor "h" is present.
>> parse/all st "gb"
== ["a" "cdef"] ; Split at the single char that matched.
Russell [rryost--home--com]
[4/23] from: rchristiansen:pop:isdfa:sei-it at: 20-Sep-2000 15:54
> paragraphs: {First paragraph.^/Second "paragraph."^/Third paragraph.}
> parse paragraphs [some [thru {.^/} | thru {."^/} | thru "." end]]
This returns "true"
But what if I'm trying to parse a report and wish to make each
paragraph a separate string within a block?
>> paragraphs: {First paragraph.^/Second "paragraph."^/Third
paragraph.}
== {First paragraph.
Second "paragraph."
Third paragraph.}
>>
>> paragraphs-breakdown: []
== []
>>
>> foreach paragraph parse paragraphs [some [thru {.^/} | thru {."^/} |
thru "." end]] [append paragraphs-breakdown paragraph]
** Script Error: foreach expected data argument of type: series.
** Where: foreach paragraph parse paragraphs [some [thru ".^/" | thru
{."} | thru "." end]]
Doesn't work.
Sorry if I'm being a pain, but when I read the "parse rules"
documentation it doesn't make any sense to me. I can't see the
usefulness of returning "true" in this situation.
[5/23] from: jeff:rebol at: 20-Sep-2000 13:41
Howdy, Ryan:
> > paragraphs: {First paragraph.^/Second "paragraph."^/Third paragraph.}
> > parse paragraphs [some [thru {.^/} | thru {."^/} | thru "." end]]
>
> This returns "true"
Yep, that means that PARSE successfully made it through the
whole string.
To break up your paragraphs, copy out the parts thru each
paragraph ending, like this:
pb: copy []
parse paragraphs [
some [
copy part [thru {.^/} | thru {."^/} | thru "." end]
(append pb part)
]
]
pb
== ["First paragraph.^/" {Second "paragraph."
} "Third paragraph."]
> Sorry if I'm being a pain, but when I read the "parse
> rules" documentation it doesn't make any sense to me. I
> can't see the usefulness of returning "true" in this
> situation.
No pain whatsoever. I like trying to help.
-jeff
[6/23] from: rryost:home at: 20-Sep-2000 15:08
Better ignore my previous post on this subject! I assumed you were trying
to create a block with each component one of the paragraphs, using the ^/ as
a separator. But my stuff doesn't throw any light on that at all!
Please excuse my ineptness!
Russell [rryost--home--com]
[7/23] from: rryost:home at: 20-Sep-2000 15:19
Will Jeff's approach work if the paragraphs contain multiple periods?
Russell [rryost--home--com]
[8/23] from: joel:neely:fedex at: 20-Sep-2000 17:31
Hi, Jeff... you beat me to the "Send" button! ;-)
Since I'm now obligated to add value, instead of just saying
me too
, see additional remarks below.
[jeff--rebol--net] wrote:
> Howdy, Ryan:
> > > paragraphs: {First paragraph.^/Second "paragraph."^/Third paragraph.}
<<quoted lines omitted: 15>>
> == ["First paragraph.^/" {Second "paragraph."
> } "Third paragraph."]
Two other notes, one very minor, one very non-minor:
MINOR: Notice that the newline terminator is left on the end of
each internal paragraph (because thru is inclusive).
There are a variety of tricky ways to deal with this while parsing,
but it seems to me that it's simpler to post-process them off (if
they are undesirable).
NON-MINOR: Let's change the input data slightly, and try it again
>> pb: []
== []
>> paras: {First "para."^/Second para.^/Third para.}
== {First "para."
Second para.
Third para.}
>> parse paras [ some [
[ copy part [thru {.^/} | thru {."^/} | thru "." end]
[ (append pb part)
[ ]
[ ]
== true
>> pb
== [{First "para."
Second para.
} "Third para."]
Notice that now the result block has only TWO elements! Since the
first test (the thru {.^/} part) can succeed by grabbing text all
the way to the end of the SECOND paragraph, it does so, putting the
first two paragraphs into the first output string. I assumed that
this is NOT what you wanted, but rather you wanted to copy through
either {.^/} or {."^} WHICHEVER COMES NEXT. (Natural language
text munching is a real pain, speaking from personal experience! ;-)
The strategies I've thought of (I don't have time to code, compare,
and recommend right at the moment) are:
1) Write more complicated parse rules, that either
1a) parse to newline, append the copied chunk to a paragraph
string under construction, then look at the tail end of
the last chunk to see whether it can be extended or whether
a new paragraph should be started (based on whether it
looked like the end of a sentence).
1b) parse to period, grab and append the next character if it
is a quotation mark, append to paragraph under construction,
and start a new paragraph if the next character is newline.
2) Use simpler parsing (break on newlines), then make a postpass
across the block of "lines", gluing back together wherever the
boundary isn't the end of a sentence.
Both sound mildly fuzzy (that's not as bad as "really hairy"!)
If you can control the original text, perhaps another convention
would be handy, such as breaking paragraphs whenever there's a
blank line (two consecutive newlines).
-jn-
[9/23] from: rchristiansen:pop:isdfa:sei-it at: 20-Sep-2000 18:11
> I assumed that
> this is NOT what you wanted, but rather you wanted to copy through
> either {.^/} or {."^} WHICHEVER COMES NEXT. (Natural language
> text munching is a real pain, speaking from personal experience! ;-)
Yes, this is what I was looking for. As someone who has never
parsed anything before using REBOL (there will be more like me!) the
parsing rules are confusing to read in the REBOL docs. My
inclination is to want to use a simple statement which will parse until
a set of characters is reached OR a different set of characters is
reached, whichever comes along first and next.
> The strategies I've thought of (I don't have time to code, compare,
> and recommend right at the moment) are:
<<quoted lines omitted: 10>>
> across the block of "lines", gluing back together wherever the
> boundary isn't the end of a sentence.
You missed another option, which I had been using previously. Here
is the function:
breakdown-content: func [
"breakdown an e-mail content field into its parts"
msg [object!] "e-mail message"
][
article-info: msg/content
end-of-paragraph: rejoin [{.} newline]
replace/all article-info end-of-paragraph {.~}
content-parts: copy []
foreach part parse/all article-info {~} [ append content-parts
trim/lines part ]
]
In other words, replace all instances of a set of characters with a new
character that can be recognized later. The above example needs to
be fixed because it only replaces instances of {.^/} with "~" and I've
discovered the tilde is a bad choice, anyway. I need to also be able
to replace any set of characters you might find at the end of a
paragraph, including {."^/} and {!^/} and {?^/} and {:^/} and {...^/} and
I'm sure there are more.
I was hoping there would be a quick way to use parse instead of
replacing characters first and then parsing.
-Ryan
[10/23] from: jeff:rebol at: 20-Sep-2000 18:03
Howdy, Joel:
> Notice that now the result block has only TWO elements!
> Since the first test (the thru {.^/} part) can succeed by
<<quoted lines omitted: 5>>
> language text munching is a >real pain, speaking from
> personal experience! ;-)
Sure. In the interests of advancing its popularity, I
offered up a simplistic example of PARSE. :-)
Paragraphs can end in a variety of punctuation ("!?.-;:),
with different quantities (as Russ pointed out), no?
-jeff
[11/23] from: rryost:home at: 20-Sep-2000 21:30
See my stuff interjected below:
Russell [rryost--home--com]
----- Original Message -----
From: <[RChristiansen--pop--isdfa--sei-it--com]>
To: <[list--rebol--com]>
Sent: Wednesday, September 20, 2000 1:54 PM
Subject: [REBOL] parse or Re:(2)
> > paragraphs: {First paragraph.^/Second "paragraph."^/Third paragraph.}
> > parse paragraphs [some [thru {.^/} | thru {."^/} | thru "." end]]
>
> This returns "true"
>
> But what if I'm trying to parse a report and wish to make each
> paragraph a separate string within a block?
Simple parsing with the /all refinement will do this in one step. The /all
refiinement disables all the default delimiters and uses only the supplied
string of characters to break apart the target string. In this case, we'll
use the control character "^/", end of line, commonly used to end a
paragraph.
A console session follows that illustrates this. ( Note I've inserted an
extra "test" period within the first paragraph.)
>> paragraphs: {First. paragraph.^/Second "paragraph."^/Third paragraph.}
== {First. paragraph.
Second "paragraph."
Third paragraph.}
>> ; Now apply the simple parse/all with only the single break character
^/
>> parse/all paragraphs "^/"
== ["First. paragraph." {Second "paragraph."} "Third paragraph."]
end of console session.
This seems to be just what is wanted. {}'s are used for the second item
because it included " 's. The period at the end of First is ignored, along
with all the other spaces, ", etc because the /all refinement disabled the
usual default break chars.
.
The previous message now continues:
[12/23] from: norsepower:uswest at: 21-Sep-2000 0:00
Ahh, but this is not enough, because if the report has more than one newline
character following a paragraph, you will end up with empty paragraphs.
[13/23] from: rryost:home at: 21-Sep-2000 0:24
So what? Seems the application that's going to use the block of paragraphs
could easily deal with the "" for an empty paragraph. To me, that's
preferable than trying to outguess the final character of every conceivable
paragraph!
Russell [rryost--home--com]
[14/23] from: joel:neely:fedex at: 21-Sep-2000 6:51
[jeff--rebol--net] wrote:
> Sure. In the interests of advancing its popularity, I
> offered up a simplistic example of PARSE. :-)
>
> Paragraphs can end in a variety of punctuation ("!?.-;:),
> with different quantities (as Russ pointed out), no?
>
> -jeff
Sure! Nothing wrong with simple examples! My only reason for
pointing out that the first alternative would be taken (passing
up an instance of the second alternative) was that those of us
who "regularly" deal with regular expressions <groan /> find it
easy to get tripped up by this kind of behavior. With REs, I
can say
grab everything up to the first occurrence of any of
these complicated patterns
in a fairly obvious way. Re-remembering the subtle differences
between BNF and REs was (for me, at least) the hardest part of
getting productive with parse .
Just so I don't come across as being TOO negative ;-) please
see my next post!
-jn-
[15/23] from: joel:neely:fedex at: 21-Sep-2000 7:03
Hi, Ryan...
[RChristiansen--pop--isdfa--sei-it--com] wrote:
[snip]
> You missed another option, which I had been using previously. Here
> is the function:
>
[snip]
> In other words, replace all instances of a set of characters with
> a new character that can be recognized later...
>
You're absolutely right! Thanks for catching my omission. I often
use that trick when html-izing text, with the cliche below:
replace/all chunk-a-text {&} {@@@}
replace/all chunk-a-text {<} {<}
replace/all chunk-a-text {>} {>}
replace/all chunk-a-text {"} {"}
replace/all chunk-a-text {@@@} {&}
where the trick, of course, is to hide the ampersands before replacing
dangerous characters with entities that are escaped by ampersands.
To make up for my omission (and my lack of time/energy last night),
the following is a scheme for using parse to attack the paragraphing
problem you posted. I know it doesn't handle every possible case, but
I think it can be generalized in a fairly obvious way.
Enjoy!
-jn-
(Output first, as an appetizer! I'll leave the code un-indented to
facilitate cut-and-paste.)
=====================================================================
>> do %parsetest.r
-----
First paragraph ends here.
-----
-----
A sentence. End of second "paragraph."
-----
-----
I'm not sure. Is this the third paragraph?
-----
-----
The fourth paragraph
contains some embedded
linebreaks
along the way. Will
this work?
-----
-----
Another sentence. A "quotation." The end!
-----
-----
Well, maybe!
-----
>>
=====================================================================
REBOL []
paragraphs: {First paragraph ends here.
A sentence. End of second "paragraph."
I'm not sure. Is this the third paragraph?
The fourth paragraph
contains some embedded
linebreaks
along the way. Will
this work?
Another sentence. A "quotation." The end!
Well, maybe!}
parblock: copy []
currpar: copy ""
stopper: charset {.?!}
nonstop: complement stopper
fragment: copy ""
sent: [
copy fragment [any nonstop stopper]
(append currpar fragment)
[{^/} (append parblock currpar currpar: copy "")
|{"^/} (append parblock append currpar {"} currpar: copy "")
| none
]
]
parg: [
(parblock: copy [] currpar: copy "")
any sent end
(if 0 < length? currpar [append parblock currpar])
]
parse/all paragraphs parg
foreach currpar parblock [
print ["-----^/" currpar "^/-----"]
]
[16/23] from: joel:neely:fedex at: 21-Sep-2000 7:08
I don't think that just breaking on {^/} solves the problem as posted.
The objective, as I read it, was to break on PARAGRAPHS (not lines)
where a paragraph is defined as the end of a sentence that concides
with the end of a line. In other words, there shouldn't be a break
between lines that contain two (or more) parts of a multiple-line
sentence.
-jn-
[rryost--home--com] wrote:
[17/23] from: norsepower:uswest at: 21-Sep-2000 7:19
Point well-taken. It seems I have forgotten the KISS law.
-Ryan
Keep It Simple, Stupid.
[18/23] from: brett:codeconscious at: 22-Sep-2000 0:13
Hey Joel,
> Re-remembering the subtle differences
> between BNF and REs was (for me, at least) the hardest part of
> getting productive with parse .
>
It would be really handy if you (or others) could list some of the
differences you refer to. I've never actually used REs but have read up on
them a bit.
A list of comparisons would definitely aid understanding of both concepts!
Brett.
[19/23] from: norsepower:uswest at: 21-Sep-2000 8:54
Ahh, yes, of course, the reason for my dilemma in the first place.
Paragraphs
are much different animals than "lines."
[20/23] from: rchristiansen:pop:isdfa:sei-it at: 21-Sep-2000 13:11
Joel-
Thanks for the parsing routine. There are still a few things about it
that I need to look at a few more times for me to understand it
perfectly, but I was able to make a re-usable function out of your
routine. (I also changed some of the 'words because I like my scripts
to read like "plain English.") The function follows...
REBOL []
parse-paragraphs: func [
"Parse a document into a block of paragraphs."
document [string!] "the document to be parsed"
][
paragraph-block: copy []
current-paragraph: copy ""
stopper: charset {.?!:}
non-stopper: complement stopper
fragment: copy ""
sent: [
copy fragment [any non-stopper stopper]
(append current-paragraph fragment)
[{^/} (append paragraph-block current-paragraph current-
paragraph: copy "")
|{"^/} (append paragraph-block append current-paragraph {"}
current-paragraph: copy "")
| none
]
]
paragraph: [
(paragraph-block: copy [] current-paragraph: copy "")
any sent end
(if 0 < length? current-paragraph [append paragraph-block
current-paragraph])
]
parse/all document paragraph
paragraph-block
]
[21/23] from: rryost:home at: 21-Sep-2000 12:05
Word wrapping in word processors and even email editors makes the definition
of PARAGRAPH given below by Joel questionable, IMHO. Thinking about it, I
recognize the start of a new paragraph in literature by the presence of a
blank or empty line. Thus the sequence of two "new line" control characters
would signal the start of a new PARAGRAPH. I think this is beyond the scope
of the simple-parse approach. The more complex rule based functional
approach developed in this thread by others is required.
I guess the person trying to solve this paragraph parse problem is in the
best position to define what he means by "paragraph". As I recall the
original problem, paragraphs *were* defined "according to Joel".
.Russell [rryost--home--com]
[22/23] from: joel:neely:fedex at: 21-Sep-2000 14:09
[RChristiansen--pop--isdfa--sei-it--com] wrote:
> Joel-
>
> ... I also changed some of the 'words because I like my scripts
> to read like "plain English.") The function follows...
>
Thanks! It's nice to watch these things evolve. (As far as the
coding style, I freely confess to getting fairly terse at times;
I just can't type quickly enough to keep up with my thinking --
which tells you that I must be a REALLLLY slow typist! ;-)
-jn-
[23/23] from: joel:neely:fedex at: 21-Sep-2000 15:20
Ooooh! Typo on my part. Apologies for any confusion it caused. Where
I typed
> >
> > The objective, as I read it, was to break on PARAGRAPHS (not lines)
> > where a paragraph is defined as the end of a sentence that concides
> > with the end of a line...
> >
I intended to be saying (correction in all caps)
The objective, as I read it, was to break INTO paragraphs (not
lines)
where THE END OF a paragraph is defined IN THE ORIGINAL MESSAGE as
the end of a sentence that concides with the end of the line...
I certainly agree that in "real text" the situation becomes much fuzzier
and more contextual.
For example, there's one school of thought in typography that insists
that blank lines as paragraph separators are wrong; that one should
use indentation only (without vertical whitespace) to indicate the
start of a new paragraph. Of course, this requires that one keep up
with the "normal" margins being used in the text, as well as using the
multi-line context to distinguish indented-first-line-of-paragraph
from indented-multiline-block-quote, etc...
[rryost--home--com] wrote:
> Word wrapping in word processors and even email editors makes the definition
> of PARAGRAPH given below by Joel questionable, IMHO. Thinking about it, I
<<quoted lines omitted: 3>>
> of the simple-parse approach. The more complex rule based functional
> approach developed in this thread by others is required.
Not really. Consider
>> text: {this is some^/text that^/flows.^/^/more sentences appear.}
== {this is some
text that
flows.
more sentences appear.}
>> replace/all text "^/^/" #"^(ff)"
== {this is some
text that
flows.˙more sentences appear.}
>> parse/all text to-string #"^(ff)"
== ["this is some^/text that^/flows." "more sentences appear."]
So, using replace to find the empty lines (as consecutive newlines)
allows us to crack the text with a simple parse. (I'm ducking the
issue that there may be runs of more than two consecutive newlines.
Figuring out how to remove those with a minimal number of replace
statements is left as an exercise for the reader... ;-)
-jn-
Notes
- Quoted lines have been omitted from some messages.
View the message alone to see the lines that have been omitted