Parsing AEIOU and sometimes Y
[1/16] from: edoconnor::gmail::com at: 1-May-2007 17:26
I've been playing with the Porter stemming algorithm
(http://www.tartarus.org/~martin/PorterStemmer/) in REBOL, with the
intent to generate synonym lists, word frequencies and tag-clouds. A
stemming algorithm is designed to strip suffixes from english words,
leaving behind the word stem. That is, a set of related words might be
mapped to a single root, e.g., deriving "program" from:
programs
programmed
programmer
programming
programmable
etc.
The basic Porter algorithm is not very complex, and is a good learning
exercise. Having said that, however, there is a fundamental piece
relating to 'parse that I could use some help with.
From the original paper: http://www.tartarus.org/~martin/PorterStemmer/def.txt
A consonant in a word is a letter other than A, E, I, O or U, and other
than Y preceded by a consonant.
In other words, the letter "y" is considered to be a consonant when it
is preceded by a vowel. When "y" is preceded by another consonant is
to be interpreted as a vowel.
How would I approach this type of rule in 'parse? Intuition leads me
toward the following code:
vow: charset "aeiouAEIOU"
con: charset [#"a" - #"x" #"A" - #"X" "zZ"]
ycon: [vow "y" (print "y = consonant")]
yvow: [con "y" (print "y = vowel")]
rule: [any con some [vow | yvow] some [con | ycon] any vow]
parse "toy" rule ; y = consonant
parse "crazy" rule ; y = vowel
parse "crybaby" rule ; y = vowel; y = vowel
parse "yay" rule ; y = consonant ; y = vowel
My track record for intuition is poor, so I'm not surprised to
discover these sub-rules for "y" are wrong. Given that there is no
back-tracking in 'parse, I ask the gurus here for suggestions (or
insight) for this type of problem.
[And yes, I'll add the finished algo to the script library.]
Best,
Ed
[2/16] from: gregg::pointillistic::com at: 1-May-2007 17:00
Re: [REBOL parse] Parsing AEIOU and sometimes Y
Hi Ed,
EOC> "A consonant in a word is a letter other than A, E, I, O or U, and other
EOC> than Y preceded by a consonant."
EOC> In other words, the letter "y" is considered to be a consonant when it
EOC> is preceded by a vowel. When "y" is preceded by another consonant is
EOC> to be interpreted as a vowel.
Having not read the algorithm spec, is this on track?
vow: charset "aeiouAEIOU"
y: charset "yY"
con: exclude charset [#"a" - #"z" #"A" - #"Z"] union vow y
ycon: [vow y (print "y = consonant")]
yvow: [con y (print "y = vowel")]
just-y: [y (print "y = consonant")]
rule: [some [yvow | ycon | con | vow | just-y]]
parse "toy" rule ; y = consonant
parse "crazy" rule ; y = vowel
parse "crybaby" rule ; y = vowel; y = vowel
parse "yay" rule ; y = consonant ; y = consonant
parse "yoyo" rule ; y = consonant ; y = consonant
-- Gregg
[3/16] from: edoconnor::gmail::com at: 1-May-2007 20:57
Hi Greg--
This is shaping up pretty well. This is the path I was headed down
before I assumed there must be a better way.
A key piece of the algorithm requires counting the number ("measure")
of vowel-consonant pair-sets according to the following pattern:
[C] VC (m)... [V]
Then the value of m is then used to apply suffix swapping rules. So...
setting aside the "sometimes y" issue, the above can be expressed in
'parse as:
rule: [any consonant [some vowel some consonant (m: m + 1)] any vowel]
Once we introduce the "sometimes y" requirement, this simple rule
becomes messy quickly. At the end of the day, I need to add up the
[some vowel some consonant] rule under with that added "sometimes y"
complexity. Thanks to your pointers, I think I'll be able to continue
pursuing it in this direction, hopefully keeping the mess under
control.
Thanks,
Ed
On 5/1/07, Gregg Irwin wrote:
[4/16] from: anton:wilddsl:au at: 2-May-2007 16:33
Hi Ed,
Here's my go at it:
; y can't be a consonant, if it's preceded by a consonant
vow: charset "aeiouAEIOU"
y: charset "yY"
con: exclude charset [#"a" - #"z" #"A" - #"Z"] union vow y
y-as-vowel: [y (print "y = vowel")]
y-as-consonant: [y pos: vow :pos (print "y = consonant")] ; to be
interpreted as a consonant, y must also be followed by a vowel
vowel: [[vow | y-as-vowel] (consonant/1: [con | y-as-consonant])] ;
enable y as a consonant
consonant: [[con | y-as-consonant] (consonant/1: [con])] ; disable y as a
consonant
rule: [(consonant/1: [con | y-as-consonant]) any [consonant | vowel]]
parse "toy" rule ; y = vowel
parse "crazy" rule ; y = vowel
parse "crybaby" rule ; y = vowel; y = vowel
parse "yay" rule ; y = consonant ; y = vowel
parse "yoyo" rule ; y = consonant ; y = consonant
The whole Y issue is complicated, isn't it ?
Y can be a vowel, consonant or semi-vowel (a type of consonant)...
Regards,
Anton.
[5/16] from: lmecir::mbox::vol::cz at: 2-May-2007 10:15
Hi Ed,
...
> A key piece of the algorithm requires counting the number ("measure")
> of vowel-consonant pair-sets according to the following pattern:
<<quoted lines omitted: 11>>
> Thanks,
> Ed
how about this one:
simple-vowel: charset "aeiouAEIOU"
simple-consonant: exclude charset [#"b" - #"x" #"z" #"B" - #"X" #"Z"]
simple-vowel
y: charset "yY"
pbc: first [
(
preceded-by-consonant: none
not-preceded-by-consonant: [end skip]
)
]
not-pbc: first [
(
preceded-by-consonant: [end skip]
not-preceded-by-consonant: none
)
]
vowel: [
[simple-vowel | preceded-by-consonant y (print "y = vowel")] not-pbc
]
consonant: [
[simple-consonant | not-preceded-by-consonant y (print "y consonant")] pbc
]
rule: [not-pbc any [consonant | vowel]]
-L
[6/16] from: santilli:gabriele:gmai:l at: 2-May-2007 12:18
2007/5/2, Anton Rolls <anton-wilddsl.net.au>:
> Hi Ed,
>
> Here's my go at it:
[...]
Looks like a state machine to me.
vowel: charset "aoeuiAOEUI"
y: charset "yY"
consonant: exclude charset [#"b" - #"x" #"z" #"B" - #"X" #"Z"] vowel
start: [consonant after-cons | vowel after-vow | y after-y | end]
after-cons: [consonant after-cons | vowel after-vow | y after-cy | end]
after-vow: [consonant after-cons | vowel after-vow | y after-vy | end]
after-y: [vowel (print "y = consonant") after-vow | consonant (print
???
) after-cons | end (print "???")]
after-cy: [consonant (print "y = vowel") after-cons | vowel (print "y
= semivowel?") after-vow | end (print "y = vowel")]
after-vy: [consonant (print "y = semivowel?") after-cons | vowel
(print "y = consonant") after-vow | end (print "y = vowel")]
Note, since this implementation in recursive, it is not usable to
parse long text. It is easy to make it non-recursive though, although
it may be a bit less readable.
(Another approach for better readability may be to use a FSM
interpreter like mine: http://www.colellachiara.com/soft/MD3/fsm.html
)
HTH,
Gabriele.
[7/16] from: santilli::gabriele::gmail::com at: 2-May-2007 12:19
2007/5/2, Gabriele Santilli <santilli.gabriele-gmail.com>:
> after-y: [vowel (print "y = consonant") after-vow | consonant (print
> "???") after-cons | end (print "???")]
Sorry for the wrapping, it's Gmail's fault. (Can it be overridden?)
Regards,
Gabriele.
[8/16] from: edoconnor::gmail::com at: 2-May-2007 11:41
Ok, my head is starting to hurt. Thanks for the suggestions everyone.
First, a small correction from my original email:
>> parse "yay" rule ; y = consonant ; y = consonant
I believe all of the solutions provided successfully handle the
sometimes y
case. I admit that I became a bit dizzy trying to work
through them.
To recap, the ultimate goal is to count V C pairs/groups, where V
equals an arbitrary number of vowels, and C equals an arbitrary number
of consonants, i.e.,
simple-VC-rule: [some vowel some consonant (m: m + 1)]
For "banana" m=2, for "elephant" m=3, for "schmaltz" m=1, for "beauty" m=1.
Given the need to count the VC pattern within words, my updated
pseudocode follows:
Y: charset "yY"
vowel: charset "aeiouAEIOU"
consonant: exclude charset [#"a" - #"z" #"A" - #"Z"] union vowel Y
yvow: [consonant Y]
ycon: [vowel Y]
C: [consonant | Y]
VC: [some [yvow | vowel] some [ycon | consonant] (m: m + 1)]
V: [yvow | vowel]
rule: [any C VC any V]
m: 0 parse "toy" rule print m ; m=1
m: 0 parse "crazy" rule print m ; m=1
m: 0 parse "crybaby" rule print m ; m=2
m: 0 parse "yay" rule rule print m ; m=1
I'm not able to trigger (m: m + 1) and record a match for the pattern
VC like I'm able to for 'simple-VC-rule above. It would be very
helpful for me to understand this better, because it serves as a
hidden line of demarcation beyond which 'parse solutions become
dramatically more complex.
Thanks again for your co-exploration. This is purely an exercise and
I'm not looking for anyone to solve this for me. From time to time I
uncover little rabbit holes in my understanding 'parse and other areas
of REBOL. I think it would be really helpful to understand some of
these murky areas, and to refine ways of explaining these concepts to
others.
Thanks for your time.
Ed
[9/16] from: lmecir:mbox:vol:cz at: 2-May-2007 19:00
Hi Ed,
> Ok, my head is starting to hurt. Thanks for the suggestions everyone.
>
> First, a small correction from my original email:
>
>>> parse "yay" rule ; y = consonant ; y = consonant
>>>
yes, that is what Gregg suggested...
> I believe all of the solutions provided successfully handle the
> "sometimes y" case. I admit that I became a bit dizzy trying to work
> through them.
>
you should try the cut and paste (clipboard://) method
> To recap, the ultimate goal is to count V C pairs/groups, where V
> equals an arbitrary number of vowels, and C equals an arbitrary number
> of consonants, i.e.,
>
> simple-VC-rule: [some vowel some consonant (m: m + 1)]
>
the above rule does not count pairs, rule [(m: 0) some [vowel consonant
(m: m + 1)]] does.
[10/16] from: edoconnor::gmail at: 2-May-2007 14:24
On 5/2/07, Ladislav Mecir wrote:
> To recap, the ultimate goal is to count V C pairs/groups, where V
> equals an arbitrary number of vowels, and C equals an arbitrary number
> of consonants, i.e.,
>
> simple-VC-rule: [some vowel some consonant (m: m + 1)]
>
>> the above rule does not count pairs, rule [(m: 0) some [vowel consonant
>> (m: m + 1)]] does.
Thanks, you are right. Prior to the "sometimes y" complication, I had
been relying on the following rule to build out the algorithm:
[any consonant any [some vowel some consonant (m: m + 1)] any vowel]
This simplified rule was successful for generating the m value for
words-- based on my test data. I misquoted this rule in my email.
Thanks for the clarification.
Ed
[11/16] from: lmecir:mbox:vol:cz at: 2-May-2007 20:47
Ed O'Connor napsal(a):
> On 5/2/07, Ladislav Mecir wrote:
>> To recap, the ultimate goal is to count V C pairs/groups, where V
<<quoted lines omitted: 13>>
> Thanks for the clarification.
> Ed
you can forget about the "sometimes" y when using my suggestion:
simple-vowel: charset "aeiouAEIOU"
y: charset "yY"
simple-consonant: exclude charset [
#"b" - #"x" #"z" #"B" - #"X" #"Z"
] simple-vowel
pbc: first [
(
preceded-by-consonant: none
not-preceded-by-consonant: [end skip]
)
]
not-pbc: first [
(
preceded-by-consonant: [end skip]
not-preceded-by-consonant: none
)
]
vowel: [
[simple-vowel | preceded-by-consonant y] not-pbc
]
consonant: [
[simple-consonant | not-preceded-by-consonant y] pbc
]
rule: [
(m: 0)
any consonant any [some vowel some consonant (m: m + 1)]
any vowel
]
parse "banana" rule
the trouble is, that your pair counting rule still doesn't look exact
-L
[12/16] from: edoconnor::gmail::com at: 2-May-2007 15:25
Hi Ladislav--
On 5/2/07, Ladislav Mecir wrote:
> you can forget about the "sometimes" y when using my suggestion:
I'd be delighted to forget about that! I sure am glad there aren't
other special rules, such as "i before e, except after c."
> parse "banana" rule
> the trouble is, that your pair counting rule still doesn't look exact
Hmmm... it appears to generate correct results... here is the relevant
passage from the research paper:
[C]VCVC ... [V]
where the square brackets denote arbitrary presence of their contents.
Using (VC){m} to denote VC repeated m times, this may again be written
as
[C](VC){m}[V].
m will be called the measure of any word or word part when represented in
this form. The case m=0 covers the null word. Here are some examples:
m=0 TR, EE, TREE, Y, BY.
m=1 TROUBLE, OATS, TREES, IVY.
m=2 TROUBLES, PRIVATE, OATEN, ORRERY.
The rule you've written returns the correct m for these words, as well
as anything I've thrown at it so far.
Thanks
[13/16] from: lmecir:mbox:vol:cz at: 2-May-2007 23:01
Ed O'Connor napsal(a):
> Hi Ladislav--
> On 5/2/07, Ladislav Mecir wrote:
<<quoted lines omitted: 20>>
> as anything I've thrown at it so far.
> Thanks
aha, sorry, I thought that "tree" was supposed to yield m = 1. Here is
an optimized version:
vowel-after-consonant: charset "aeiouAEIOU"
vowel-otherwise: charset "aeiouyAEIOUY"
consonant-after-consonant: exclude charset [
#"a" - #"z" #"A" - #"Z"
] vowel-otherwise
consonant-otherwise: union consonant-after-consonant charset "yY"
after-consonant: first [
(
vowel: [vowel-after-consonant otherwise]
consonant: [consonant-after-consonant after-consonant]
)
]
otherwise: first [
(
vowel: [vowel-otherwise otherwise]
consonant: [consonant-otherwise after-consonant]
)
]
rule: [
(m: 0)
otherwise
any consonant
any [some vowel some consonant (m: m + 1)]
any vowel
]
-L
[14/16] from: edoconnor::gmail::com at: 2-May-2007 20:01
That's really great Ladislav.
The only thing that saddens me somewhat is that I would probably never
arrive at a solution like this on my own (the same goes for the other
solutions offered, of course).
For those new to 'parse, would you be so kind as to step through your
code and explain it? If you prefer, you can send your comments to me
directly and I will work with you to produce a short explanation that
may be appropriate for newbies.
From here, I'll complete the algorithm, create a tag-cloud indexer and
see if I can call an old favor from Oldes to generate a Flash
interface.
Best regards and thanks to all,
Ed
On 5/2/07, Ladislav Mecir wrote:
[15/16] from: lmecir:mbox:vol:cz at: 3-May-2007 8:48
OK, Ed, a commented version:
> That's really great Ladislav.
> The only thing that saddens me somewhat is that I would probably never
<<quoted lines omitted: 9>>
> Best regards and thanks to all,
> Ed
; vowel variants
vowel-after-consonant: charset "aeiouAEIOU"
vowel-otherwise: charset "aeiouyAEIOUY"
; consonant variants
consonant-after-consonant: exclude charset [
#"a" - #"z" #"A" - #"Z"
] vowel-otherwise
consonant-otherwise: union consonant-after-consonant charset "yY"
; adjusting the Vowel and Consonant rules to the Otherwise state
otherwise: first [
(
; vowel detection does not change state
vowel: vowel-otherwise
; consonant detection changes state to After-consonant
consonant: [consonant-otherwise after-consonant]
)
]
; adjusting the Vowel and Consonant rules to the After-consonant state
after-consonant: first [
(
; vowel detection provokes transition to the Otherwise state
vowel: [vowel-after-consonant otherwise]
; consonant detection does not change state
consonant: consonant-after-consonant
)
]
rule: [
; initialization
(
; zeroing the counter
m: 0
)
; setting the state to Otherwise
otherwise
; initialization end
any consonant
any [some vowel some consonant (m: m + 1)]
any vowel
]
-L
[16/16] from: lmecir::mbox::vol::cz at: 3-May-2007 8:58
Ladislav Mecir napsal(a):
sorry, I succeeded to swap the charset definition, here is a correction:
; vowel variants
vowel-after-consonant: charset "aeiouyAEIOUY"
vowel-otherwise: charset "aeiouAEIOU"
; consonant variants
consonant-after-consonant: exclude charset [
#"a" - #"z" #"A" - #"Z"
] vowel-after-consonant
consonant-otherwise: union consonant-after-consonant charset "yY"
; adjusting the Vowel and Consonant rules to the Otherwise state
otherwise: first [
(
; vowel detection does not change state
vowel: vowel-otherwise
; consonant detection changes state to After-consonant
consonant: [consonant-otherwise after-consonant]
)
]
; adjusting the Vowel and Consonant rules to the After-consonant state
after-consonant: first [
(
; vowel detection provokes transition to the Otherwise state
vowel: [vowel-after-consonant otherwise]
; consonant detection does not change state
consonant: consonant-after-consonant
)
]
rule: [
; initialization
(
; zeroing the counter
m: 0
)
; setting the state to Otherwise
otherwise
; initialization end
any consonant
any [some vowel some consonant (m: m + 1)]
any vowel
]
-L
Notes
- Quoted lines have been omitted from some messages.
View the message alone to see the lines that have been omitted