[REBOL] Re: Frequency of phrases
From: tomc:darkwing:uoregon at: 23-Aug-2002 11:50
On Thu, 22 Aug 2002, Louis A. Turk wrote:
> Hi rebols,
>
> Goal: To find the length and frequency of use of all the unique phrases in
> a text file.
>
> Phrase: A phrase will be defined as a string len characters long and with a
> space at each end. All phrases 100 characters long are to be processed
> first, then all phrases of length len - 1 and so on until len = 5.
>
> The text file: To simplify things, manually place a space at the beginning
> and at the end of the file to be processed. To further simplify things,
> place a space before all punctuation marks.
>
> Achieving this goal is proving to be quite a bit more complicated then I
> thought at first, and will be extremely time consuming if not done properly.
>
> What is the best way to do this?
ask on the list then pick your solution
> Louis
quick and dirty
rebol[]
buf: read %<whatever>
replace/all buf "^/" " "
replace/all buf "." " ."
replace/all buf "!" " !"
replace/all buf "?" " ?"
replace/all buf " " " "
insert buf " "
append buf " "
end: index? next find/reverse find/last buf " " " "
hsh: make hash! (length? buf)
cnt: 0
phr: copy ""
fub: copy ""
while [(index? buf) < end] [
fub: find next find next find buf " " " " " "
phr: trim copy/part buf either fub
[fub][fub: back tail buf]
while[all[(length? phr) < 101
(length? parse phr none) > 2
not tail? fub] ][
cnt: select hsh phr
either cnt
[change next find hsh phr (cnt + 1)]
[append hsh reduce[:phr 1]]
fub: next find fub " "
either fub
[phr: trim copy/part buf fub]
[fub: tail buf]
]
buf: next find buf " "
]
shsh: copy []
foreach [k v] hsh [append/only shsh reduce[k v] ]
sort/compare shsh func[a b][a/2 > b/2]
foreach sh shsh [print sh]
--------------------------------------------
not so sure I would want phrases to span over
sentence ending puncuation but that is what you asked for