Parsing AEIOU and sometimes Y

[1/16] from: edoconnor::gmail::com at: 1-May-2007 17:26

I've been playing with the Porter stemming algorithm (http://www.tartarus.org/~martin/PorterStemmer/) in REBOL, with the intent to generate synonym lists, word frequencies and tag-clouds. A stemming algorithm is designed to strip suffixes from english words, leaving behind the word stem. That is, a set of related words might be mapped to a single root, e.g., deriving "program" from: programs programmed programmer programming programmable etc. The basic Porter algorithm is not very complex, and is a good learning exercise. Having said that, however, there is a fundamental piece relating to 'parse that I could use some help with. From the original paper: http://www.tartarus.org/~martin/PorterStemmer/def.txt A consonant in a word is a letter other than A, E, I, O or U, and other than Y preceded by a consonant. In other words, the letter "y" is considered to be a consonant when it is preceded by a vowel. When "y" is preceded by another consonant is to be interpreted as a vowel. How would I approach this type of rule in 'parse? Intuition leads me toward the following code: vow: charset "aeiouAEIOU" con: charset [#"a" - #"x" #"A" - #"X" "zZ"] ycon: [vow "y" (print "y = consonant")] yvow: [con "y" (print "y = vowel")] rule: [any con some [vow | yvow] some [con | ycon] any vow] parse "toy" rule ; y = consonant parse "crazy" rule ; y = vowel parse "crybaby" rule ; y = vowel; y = vowel parse "yay" rule ; y = consonant ; y = vowel My track record for intuition is poor, so I'm not surprised to discover these sub-rules for "y" are wrong. Given that there is no back-tracking in 'parse, I ask the gurus here for suggestions (or insight) for this type of problem. [And yes, I'll add the finished algo to the script library.] Best, Ed

[2/16] from: gregg::pointillistic::com at: 1-May-2007 17:00

Re: [REBOL parse] Parsing AEIOU and sometimes Y

Hi Ed, EOC> "A consonant in a word is a letter other than A, E, I, O or U, and other EOC> than Y preceded by a consonant." EOC> In other words, the letter "y" is considered to be a consonant when it EOC> is preceded by a vowel. When "y" is preceded by another consonant is EOC> to be interpreted as a vowel. Having not read the algorithm spec, is this on track? vow: charset "aeiouAEIOU" y: charset "yY" con: exclude charset [#"a" - #"z" #"A" - #"Z"] union vow y ycon: [vow y (print "y = consonant")] yvow: [con y (print "y = vowel")] just-y: [y (print "y = consonant")] rule: [some [yvow | ycon | con | vow | just-y]] parse "toy" rule ; y = consonant parse "crazy" rule ; y = vowel parse "crybaby" rule ; y = vowel; y = vowel parse "yay" rule ; y = consonant ; y = consonant parse "yoyo" rule ; y = consonant ; y = consonant -- Gregg

[3/16] from: edoconnor::gmail::com at: 1-May-2007 20:57

Hi Greg-- This is shaping up pretty well. This is the path I was headed down before I assumed there must be a better way. A key piece of the algorithm requires counting the number ("measure") of vowel-consonant pair-sets according to the following pattern: [C] VC (m)... [V] Then the value of m is then used to apply suffix swapping rules. So... setting aside the "sometimes y" issue, the above can be expressed in 'parse as: rule: [any consonant [some vowel some consonant (m: m + 1)] any vowel] Once we introduce the "sometimes y" requirement, this simple rule becomes messy quickly. At the end of the day, I need to add up the [some vowel some consonant] rule under with that added "sometimes y" complexity. Thanks to your pointers, I think I'll be able to continue pursuing it in this direction, hopefully keeping the mess under control. Thanks, Ed On 5/1/07, Gregg Irwin wrote:

[4/16] from: anton:wilddsl:au at: 2-May-2007 16:33

Hi Ed, Here's my go at it: ; y can't be a consonant, if it's preceded by a consonant vow: charset "aeiouAEIOU" y: charset "yY" con: exclude charset [#"a" - #"z" #"A" - #"Z"] union vow y y-as-vowel: [y (print "y = vowel")] y-as-consonant: [y pos: vow :pos (print "y = consonant")] ; to be interpreted as a consonant, y must also be followed by a vowel vowel: [[vow | y-as-vowel] (consonant/1: [con | y-as-consonant])] ; enable y as a consonant consonant: [[con | y-as-consonant] (consonant/1: [con])] ; disable y as a consonant rule: [(consonant/1: [con | y-as-consonant]) any [consonant | vowel]] parse "toy" rule ; y = vowel parse "crazy" rule ; y = vowel parse "crybaby" rule ; y = vowel; y = vowel parse "yay" rule ; y = consonant ; y = vowel parse "yoyo" rule ; y = consonant ; y = consonant The whole Y issue is complicated, isn't it ? Y can be a vowel, consonant or semi-vowel (a type of consonant)... Regards, Anton.

[5/16] from: lmecir::mbox::vol::cz at: 2-May-2007 10:15

Hi Ed, ...

> A key piece of the algorithm requires counting the number ("measure") > of vowel-consonant pair-sets according to the following pattern:

<<quoted lines omitted: 11>>

> Thanks, > Ed

how about this one: simple-vowel: charset "aeiouAEIOU" simple-consonant: exclude charset [#"b" - #"x" #"z" #"B" - #"X" #"Z"] simple-vowel y: charset "yY" pbc: first [ ( preceded-by-consonant: none not-preceded-by-consonant: [end skip] ) ] not-pbc: first [ ( preceded-by-consonant: [end skip] not-preceded-by-consonant: none ) ] vowel: [ [simple-vowel | preceded-by-consonant y (print "y = vowel")] not-pbc ] consonant: [ [simple-consonant | not-preceded-by-consonant y (print "y consonant")] pbc ] rule: [not-pbc any [consonant | vowel]] -L

[6/16] from: santilli:gabriele:gmai:l at: 2-May-2007 12:18

2007/5/2, Anton Rolls <anton-wilddsl.net.au>:

> Hi Ed, > > Here's my go at it:

[...] Looks like a state machine to me. vowel: charset "aoeuiAOEUI" y: charset "yY" consonant: exclude charset [#"b" - #"x" #"z" #"B" - #"X" #"Z"] vowel start: [consonant after-cons | vowel after-vow | y after-y | end] after-cons: [consonant after-cons | vowel after-vow | y after-cy | end] after-vow: [consonant after-cons | vowel after-vow | y after-vy | end] after-y: [vowel (print "y = consonant") after-vow | consonant (print ??? ) after-cons | end (print "???")] after-cy: [consonant (print "y = vowel") after-cons | vowel (print "y = semivowel?") after-vow | end (print "y = vowel")] after-vy: [consonant (print "y = semivowel?") after-cons | vowel (print "y = consonant") after-vow | end (print "y = vowel")] Note, since this implementation in recursive, it is not usable to parse long text. It is easy to make it non-recursive though, although it may be a bit less readable. (Another approach for better readability may be to use a FSM interpreter like mine: http://www.colellachiara.com/soft/MD3/fsm.html ) HTH, Gabriele.

[7/16] from: santilli::gabriele::gmail::com at: 2-May-2007 12:19

2007/5/2, Gabriele Santilli <santilli.gabriele-gmail.com>:

> after-y: [vowel (print "y = consonant") after-vow | consonant (print > "???") after-cons | end (print "???")]

Sorry for the wrapping, it's Gmail's fault. (Can it be overridden?) Regards, Gabriele.

[8/16] from: edoconnor::gmail::com at: 2-May-2007 11:41

Ok, my head is starting to hurt. Thanks for the suggestions everyone. First, a small correction from my original email:

>> parse "yay" rule ; y = consonant ; y = consonant

I believe all of the solutions provided successfully handle the sometimes y case. I admit that I became a bit dizzy trying to work through them. To recap, the ultimate goal is to count V C pairs/groups, where V equals an arbitrary number of vowels, and C equals an arbitrary number of consonants, i.e., simple-VC-rule: [some vowel some consonant (m: m + 1)] For "banana" m=2, for "elephant" m=3, for "schmaltz" m=1, for "beauty" m=1. Given the need to count the VC pattern within words, my updated pseudocode follows: Y: charset "yY" vowel: charset "aeiouAEIOU" consonant: exclude charset [#"a" - #"z" #"A" - #"Z"] union vowel Y yvow: [consonant Y] ycon: [vowel Y] C: [consonant | Y] VC: [some [yvow | vowel] some [ycon | consonant] (m: m + 1)] V: [yvow | vowel] rule: [any C VC any V] m: 0 parse "toy" rule print m ; m=1 m: 0 parse "crazy" rule print m ; m=1 m: 0 parse "crybaby" rule print m ; m=2 m: 0 parse "yay" rule rule print m ; m=1 I'm not able to trigger (m: m + 1) and record a match for the pattern VC like I'm able to for 'simple-VC-rule above. It would be very helpful for me to understand this better, because it serves as a hidden line of demarcation beyond which 'parse solutions become dramatically more complex. Thanks again for your co-exploration. This is purely an exercise and I'm not looking for anyone to solve this for me. From time to time I uncover little rabbit holes in my understanding 'parse and other areas of REBOL. I think it would be really helpful to understand some of these murky areas, and to refine ways of explaining these concepts to others. Thanks for your time. Ed

[9/16] from: lmecir:mbox:vol:cz at: 2-May-2007 19:00

Hi Ed,

> Ok, my head is starting to hurt. Thanks for the suggestions everyone. > > First, a small correction from my original email: > >>> parse "yay" rule ; y = consonant ; y = consonant >>>

yes, that is what Gregg suggested...

> I believe all of the solutions provided successfully handle the > "sometimes y" case. I admit that I became a bit dizzy trying to work > through them. >

you should try the cut and paste (clipboard://) method

the above rule does not count pairs, rule [(m: 0) some [vowel consonant (m: m + 1)]] does.

[10/16] from: edoconnor::gmail at: 2-May-2007 14:24

On 5/2/07, Ladislav Mecir wrote:

> To recap, the ultimate goal is to count V C pairs/groups, where V > equals an arbitrary number of vowels, and C equals an arbitrary number > of consonants, i.e., > > simple-VC-rule: [some vowel some consonant (m: m + 1)] > >> the above rule does not count pairs, rule [(m: 0) some [vowel consonant >> (m: m + 1)]] does.

Thanks, you are right. Prior to the "sometimes y" complication, I had been relying on the following rule to build out the algorithm: [any consonant any [some vowel some consonant (m: m + 1)] any vowel] This simplified rule was successful for generating the m value for words-- based on my test data. I misquoted this rule in my email. Thanks for the clarification. Ed

[11/16] from: lmecir:mbox:vol:cz at: 2-May-2007 20:47

Ed O'Connor napsal(a):

> On 5/2/07, Ladislav Mecir wrote: >> To recap, the ultimate goal is to count V C pairs/groups, where V

<<quoted lines omitted: 13>>

> Thanks for the clarification. > Ed

you can forget about the "sometimes" y when using my suggestion: simple-vowel: charset "aeiouAEIOU" y: charset "yY" simple-consonant: exclude charset [ #"b" - #"x" #"z" #"B" - #"X" #"Z" ] simple-vowel pbc: first [ ( preceded-by-consonant: none not-preceded-by-consonant: [end skip] ) ] not-pbc: first [ ( preceded-by-consonant: [end skip] not-preceded-by-consonant: none ) ] vowel: [ [simple-vowel | preceded-by-consonant y] not-pbc ] consonant: [ [simple-consonant | not-preceded-by-consonant y] pbc ] rule: [ (m: 0) any consonant any [some vowel some consonant (m: m + 1)] any vowel ] parse "banana" rule the trouble is, that your pair counting rule still doesn't look exact -L

[12/16] from: edoconnor::gmail::com at: 2-May-2007 15:25

Hi Ladislav-- On 5/2/07, Ladislav Mecir wrote:

> you can forget about the "sometimes" y when using my suggestion:

I'd be delighted to forget about that! I sure am glad there aren't other special rules, such as "i before e, except after c."

> parse "banana" rule > the trouble is, that your pair counting rule still doesn't look exact

Hmmm... it appears to generate correct results... here is the relevant passage from the research paper: [C]VCVC ... [V] where the square brackets denote arbitrary presence of their contents. Using (VC){m} to denote VC repeated m times, this may again be written as [C](VC){m}[V]. m will be called the measure of any word or word part when represented in this form. The case m=0 covers the null word. Here are some examples: m=0 TR, EE, TREE, Y, BY. m=1 TROUBLE, OATS, TREES, IVY. m=2 TROUBLES, PRIVATE, OATEN, ORRERY. The rule you've written returns the correct m for these words, as well as anything I've thrown at it so far. Thanks

[13/16] from: lmecir:mbox:vol:cz at: 2-May-2007 23:01

Ed O'Connor napsal(a):

> Hi Ladislav-- > On 5/2/07, Ladislav Mecir wrote:

<<quoted lines omitted: 20>>

> as anything I've thrown at it so far. > Thanks

aha, sorry, I thought that "tree" was supposed to yield m = 1. Here is an optimized version: vowel-after-consonant: charset "aeiouAEIOU" vowel-otherwise: charset "aeiouyAEIOUY" consonant-after-consonant: exclude charset [ #"a" - #"z" #"A" - #"Z" ] vowel-otherwise consonant-otherwise: union consonant-after-consonant charset "yY" after-consonant: first [ ( vowel: [vowel-after-consonant otherwise] consonant: [consonant-after-consonant after-consonant] ) ] otherwise: first [ ( vowel: [vowel-otherwise otherwise] consonant: [consonant-otherwise after-consonant] ) ] rule: [ (m: 0) otherwise any consonant any [some vowel some consonant (m: m + 1)] any vowel ] -L

[14/16] from: edoconnor::gmail::com at: 2-May-2007 20:01

That's really great Ladislav. The only thing that saddens me somewhat is that I would probably never arrive at a solution like this on my own (the same goes for the other solutions offered, of course). For those new to 'parse, would you be so kind as to step through your code and explain it? If you prefer, you can send your comments to me directly and I will work with you to produce a short explanation that may be appropriate for newbies. From here, I'll complete the algorithm, create a tag-cloud indexer and see if I can call an old favor from Oldes to generate a Flash interface. Best regards and thanks to all, Ed On 5/2/07, Ladislav Mecir wrote:

[15/16] from: lmecir:mbox:vol:cz at: 3-May-2007 8:48

OK, Ed, a commented version:

> That's really great Ladislav. > The only thing that saddens me somewhat is that I would probably never

<<quoted lines omitted: 9>>

> Best regards and thanks to all, > Ed

; vowel variants vowel-after-consonant: charset "aeiouAEIOU" vowel-otherwise: charset "aeiouyAEIOUY" ; consonant variants consonant-after-consonant: exclude charset [ #"a" - #"z" #"A" - #"Z" ] vowel-otherwise consonant-otherwise: union consonant-after-consonant charset "yY" ; adjusting the Vowel and Consonant rules to the Otherwise state otherwise: first [ ( ; vowel detection does not change state vowel: vowel-otherwise ; consonant detection changes state to After-consonant consonant: [consonant-otherwise after-consonant] ) ] ; adjusting the Vowel and Consonant rules to the After-consonant state after-consonant: first [ ( ; vowel detection provokes transition to the Otherwise state vowel: [vowel-after-consonant otherwise] ; consonant detection does not change state consonant: consonant-after-consonant ) ] rule: [ ; initialization ( ; zeroing the counter m: 0 ) ; setting the state to Otherwise otherwise ; initialization end any consonant any [some vowel some consonant (m: m + 1)] any vowel ] -L

[16/16] from: lmecir::mbox::vol::cz at: 3-May-2007 8:58

Ladislav Mecir napsal(a): sorry, I succeeded to swap the charset definition, here is a correction: ; vowel variants vowel-after-consonant: charset "aeiouyAEIOUY" vowel-otherwise: charset "aeiouAEIOU" ; consonant variants consonant-after-consonant: exclude charset [ #"a" - #"z" #"A" - #"Z" ] vowel-after-consonant consonant-otherwise: union consonant-after-consonant charset "yY" ; adjusting the Vowel and Consonant rules to the Otherwise state otherwise: first [ ( ; vowel detection does not change state vowel: vowel-otherwise ; consonant detection changes state to After-consonant consonant: [consonant-otherwise after-consonant] ) ] ; adjusting the Vowel and Consonant rules to the After-consonant state after-consonant: first [ ( ; vowel detection provokes transition to the Otherwise state vowel: [vowel-after-consonant otherwise] ; consonant detection does not change state consonant: consonant-after-consonant ) ] rule: [ ; initialization ( ; zeroing the counter m: 0 ) ; setting the state to Otherwise otherwise ; initialization end any consonant any [some vowel some consonant (m: m + 1)] any vowel ] -L

Notes

Quoted lines have been omitted from some messages.
View the message alone to see the lines that have been omitted