[REBOL] Parsing AEIOU and sometimes Y
From: edoconnor::gmail::com at: 1-May-2007 17:26
I've been playing with the Porter stemming algorithm
(http://www.tartarus.org/~martin/PorterStemmer/) in REBOL, with the
intent to generate synonym lists, word frequencies and tag-clouds. A
stemming algorithm is designed to strip suffixes from english words,
leaving behind the word stem. That is, a set of related words might be
mapped to a single root, e.g., deriving "program" from:
programs
programmed
programmer
programming
programmable
etc.
The basic Porter algorithm is not very complex, and is a good learning
exercise. Having said that, however, there is a fundamental piece
relating to 'parse that I could use some help with.
From the original paper: http://www.tartarus.org/~martin/PorterStemmer/def.txt
A consonant in a word is a letter other than A, E, I, O or U, and other
than Y preceded by a consonant.
In other words, the letter "y" is considered to be a consonant when it
is preceded by a vowel. When "y" is preceded by another consonant is
to be interpreted as a vowel.
How would I approach this type of rule in 'parse? Intuition leads me
toward the following code:
vow: charset "aeiouAEIOU"
con: charset [#"a" - #"x" #"A" - #"X" "zZ"]
ycon: [vow "y" (print "y = consonant")]
yvow: [con "y" (print "y = vowel")]
rule: [any con some [vow | yvow] some [con | ycon] any vow]
parse "toy" rule ; y = consonant
parse "crazy" rule ; y = vowel
parse "crybaby" rule ; y = vowel; y = vowel
parse "yay" rule ; y = consonant ; y = vowel
My track record for intuition is poor, so I'm not surprised to
discover these sub-rules for "y" are wrong. Given that there is no
back-tracking in 'parse, I ask the gurus here for suggestions (or
insight) for this type of problem.
[And yes, I'll add the finished algo to the script library.]
Best,
Ed