Mailing List Archive: 49091 messages
  • Home
  • Script library
  • AltME Archive
  • Mailing list
  • Articles Index
  • Site search
 

[REBOL] Parsing AEIOU and sometimes Y

From: edoconnor::gmail::com at: 1-May-2007 17:26

I've been playing with the Porter stemming algorithm (http://www.tartarus.org/~martin/PorterStemmer/) in REBOL, with the intent to generate synonym lists, word frequencies and tag-clouds. A stemming algorithm is designed to strip suffixes from english words, leaving behind the word stem. That is, a set of related words might be mapped to a single root, e.g., deriving "program" from: programs programmed programmer programming programmable etc. The basic Porter algorithm is not very complex, and is a good learning exercise. Having said that, however, there is a fundamental piece relating to 'parse that I could use some help with. From the original paper: http://www.tartarus.org/~martin/PorterStemmer/def.txt A consonant in a word is a letter other than A, E, I, O or U, and other than Y preceded by a consonant. In other words, the letter "y" is considered to be a consonant when it is preceded by a vowel. When "y" is preceded by another consonant is to be interpreted as a vowel. How would I approach this type of rule in 'parse? Intuition leads me toward the following code: vow: charset "aeiouAEIOU" con: charset [#"a" - #"x" #"A" - #"X" "zZ"] ycon: [vow "y" (print "y = consonant")] yvow: [con "y" (print "y = vowel")] rule: [any con some [vow | yvow] some [con | ycon] any vow] parse "toy" rule ; y = consonant parse "crazy" rule ; y = vowel parse "crybaby" rule ; y = vowel; y = vowel parse "yay" rule ; y = consonant ; y = vowel My track record for intuition is poor, so I'm not surprised to discover these sub-rules for "y" are wrong. Given that there is no back-tracking in 'parse, I ask the gurus here for suggestions (or insight) for this type of problem. [And yes, I'll add the finished algo to the script library.] Best, Ed