Frequency of phrases
[1/7] from: louisaturk:coxinet at: 22-Aug-2002 13:01
Hi rebols,
Goal: To find the length and frequency of use of all the unique phrases in
a text file.
Phrase: A phrase will be defined as a string len characters long and with a
space at each end. All phrases 100 characters long are to be processed
first, then all phrases of length len - 1 and so on until len = 5.
The text file: To simplify things, manually place a space at the beginning
and at the end of the file to be processed. To further simplify things,
place a space before all punctuation marks.
Achieving this goal is proving to be quite a bit more complicated then I
thought at first, and will be extremely time consuming if not done properly.
What is the best way to do this?
Louis
[2/7] from: reffy:ulrich at: 22-Aug-2002 14:22
Can you send a sample file?
[3/7] from: louisaturk:coxinet at: 22-Aug-2002 16:36
Hi Reffy,
At 02:22 PM 8/22/2002 -0800, you wrote:
>Can you send a sample file?
I'm sending you the file as an attachment off list.
Also, there is one more requirement I forgot to mention. Each phrase must
contain at least three words.
Louis
[4/7] from: chalz:earthlink at: 23-Aug-2002 0:37
> Phrase: A phrase will be defined as a string len characters long and with a
> space at each end. All phrases 100 characters long are to be processed
> first, then all phrases of length len - 1 and so on until len = 5.
My apologies, but what about a phrase such as:
This is a phrase.<cr>
There is no whitespace at beginning or end.
> The text file: To simplify things, manually place a space at the beginning
> and at the end of the file to be processed. To further simplify things,
> place a space before all punctuation marks.
Eek. Or possibly allow for cases to accept punctuation, so long as there is
no other printable non-whitespace character afterwards? For instance, accept:
This is a phrase. This too is a phrase.<cr>
... As two phrases. However, do not end with the dot here as a phrase
ending:
This phrase mentions index.html.<cr>
Yes?
--Charles
[5/7] from: louisaturk:coxinet at: 23-Aug-2002 1:20
Hi Charles,
At 12:37 AM 8/23/2002 -0400, you wrote:
> > Phrase: A phrase will be defined as a string len characters long and with a
> > space at each end. All phrases 100 characters long are to be processed
> > first, then all phrases of length len - 1 and so on until len = 5.
> My apologies, but what about a phrase such as:
>This is a phrase.<cr>
> There is no whitespace at beginning or end.
This really is what I need, strange as it may sound.
> > The text file: To simplify things, manually place a space at the beginning
> > and at the end of the file to be processed. To further simplify things,
> > place a space before all punctuation marks.
My source file really is like this also.
> Eek. Or possibly allow for cases to accept punctuation, so long as
> there is
<<quoted lines omitted: 4>>
>This phrase mentions index.html.<cr>
> Yes?
Not necessary for my needs right now. I am just going to need this script
for one use, but the resulting data is going to be very helpful. The
script would certainly be more generally useful if it can process more
normal text (and would work for me), but it should be much simpler to make
the script as I need it.
Correction (in all caps): A phrase will be defined as a string len
characters long and with a space at each end AND CONTAINING AT LEAST THREE
WORDS.
Thanks,
Louis
[6/7] from: tomc:darkwing:uoregon at: 23-Aug-2002 11:50
On Thu, 22 Aug 2002, Louis A. Turk wrote:
> Hi rebols,
> Goal: To find the length and frequency of use of all the unique phrases in
<<quoted lines omitted: 8>>
> thought at first, and will be extremely time consuming if not done properly.
> What is the best way to do this?
ask on the list then pick your solution
> Louis
quick and dirty
rebol[]
buf: read %<whatever>
replace/all buf "^/" " "
replace/all buf "." " ."
replace/all buf "!" " !"
replace/all buf "?" " ?"
replace/all buf " " " "
insert buf " "
append buf " "
end: index? next find/reverse find/last buf " " " "
hsh: make hash! (length? buf)
cnt: 0
phr: copy ""
fub: copy ""
while [(index? buf) < end] [
fub: find next find next find buf " " " " " "
phr: trim copy/part buf either fub
[fub][fub: back tail buf]
while[all[(length? phr) < 101
(length? parse phr none) > 2
not tail? fub] ][
cnt: select hsh phr
either cnt
[change next find hsh phr (cnt + 1)]
[append hsh reduce[:phr 1]]
fub: next find fub " "
either fub
[phr: trim copy/part buf fub]
[fub: tail buf]
]
buf: next find buf " "
]
shsh: copy []
foreach [k v] hsh [append/only shsh reduce[k v] ]
sort/compare shsh func[a b][a/2 > b/2]
foreach sh shsh [print sh]
--------------------------------------------
not so sure I would want phrases to span over
sentence ending puncuation but that is what you asked for
[7/7] from: louisaturk:coxinet at: 23-Aug-2002 23:42
Hi Tom,
At 11:50 AM 8/23/2002 -0700, you wrote:
>--------------------------------------------
>not so sure I would want phrases to span over
>sentence ending puncuation but that is what you asked for
Many thanks. I'm studying your code, and it has already given me some
ideas. I'll probably end up making major changes to the code I have been
working on all day.
Louis
Notes
- Quoted lines have been omitted from some messages.
View the message alone to see the lines that have been omitted