Frequency of phrases

[1/7] from: louisaturk:coxinet at: 22-Aug-2002 13:01

Hi rebols, Goal: To find the length and frequency of use of all the unique phrases in a text file. Phrase: A phrase will be defined as a string len characters long and with a space at each end. All phrases 100 characters long are to be processed first, then all phrases of length len - 1 and so on until len = 5. The text file: To simplify things, manually place a space at the beginning and at the end of the file to be processed. To further simplify things, place a space before all punctuation marks. Achieving this goal is proving to be quite a bit more complicated then I thought at first, and will be extremely time consuming if not done properly. What is the best way to do this? Louis

[2/7] from: reffy:ulrich at: 22-Aug-2002 14:22

Can you send a sample file?

[3/7] from: louisaturk:coxinet at: 22-Aug-2002 16:36

Hi Reffy, At 02:22 PM 8/22/2002 -0800, you wrote:

>Can you send a sample file?

I'm sending you the file as an attachment off list. Also, there is one more requirement I forgot to mention. Each phrase must contain at least three words. Louis

[4/7] from: chalz:earthlink at: 23-Aug-2002 0:37

> Phrase: A phrase will be defined as a string len characters long and with a > space at each end. All phrases 100 characters long are to be processed > first, then all phrases of length len - 1 and so on until len = 5.

My apologies, but what about a phrase such as: This is a phrase.<cr> There is no whitespace at beginning or end.

> The text file: To simplify things, manually place a space at the beginning > and at the end of the file to be processed. To further simplify things, > place a space before all punctuation marks.

Eek. Or possibly allow for cases to accept punctuation, so long as there is no other printable non-whitespace character afterwards? For instance, accept: This is a phrase. This too is a phrase.<cr> ... As two phrases. However, do not end with the dot here as a phrase ending: This phrase mentions index.html.<cr> Yes? --Charles

[5/7] from: louisaturk:coxinet at: 23-Aug-2002 1:20

Hi Charles, At 12:37 AM 8/23/2002 -0400, you wrote:

> > Phrase: A phrase will be defined as a string len characters long and with a > > space at each end. All phrases 100 characters long are to be processed > > first, then all phrases of length len - 1 and so on until len = 5. > My apologies, but what about a phrase such as: >This is a phrase.<cr> > There is no whitespace at beginning or end.

This really is what I need, strange as it may sound.

> > The text file: To simplify things, manually place a space at the beginning > > and at the end of the file to be processed. To further simplify things, > > place a space before all punctuation marks.

My source file really is like this also.

> Eek. Or possibly allow for cases to accept punctuation, so long as > there is

<<quoted lines omitted: 4>>

>This phrase mentions index.html.<cr> > Yes?

Not necessary for my needs right now. I am just going to need this script for one use, but the resulting data is going to be very helpful. The script would certainly be more generally useful if it can process more normal text (and would work for me), but it should be much simpler to make the script as I need it. Correction (in all caps): A phrase will be defined as a string len characters long and with a space at each end AND CONTAINING AT LEAST THREE WORDS. Thanks, Louis

[6/7] from: tomc:darkwing:uoregon at: 23-Aug-2002 11:50

On Thu, 22 Aug 2002, Louis A. Turk wrote:

> Hi rebols, > Goal: To find the length and frequency of use of all the unique phrases in

<<quoted lines omitted: 8>>

> thought at first, and will be extremely time consuming if not done properly. > What is the best way to do this?

ask on the list then pick your solution

> Louis

quick and dirty rebol[] buf: read %<whatever> replace/all buf "^/" " " replace/all buf "." " ." replace/all buf "!" " !" replace/all buf "?" " ?" replace/all buf " " " " insert buf " " append buf " " end: index? next find/reverse find/last buf " " " " hsh: make hash! (length? buf) cnt: 0 phr: copy "" fub: copy "" while [(index? buf) < end] [ fub: find next find next find buf " " " " " " phr: trim copy/part buf either fub [fub][fub: back tail buf] while[all[(length? phr) < 101 (length? parse phr none) > 2 not tail? fub] ][ cnt: select hsh phr either cnt [change next find hsh phr (cnt + 1)] [append hsh reduce[:phr 1]] fub: next find fub " " either fub [phr: trim copy/part buf fub] [fub: tail buf] ] buf: next find buf " " ] shsh: copy [] foreach [k v] hsh [append/only shsh reduce[k v] ] sort/compare shsh func[a b][a/2 > b/2] foreach sh shsh [print sh] -------------------------------------------- not so sure I would want phrases to span over sentence ending puncuation but that is what you asked for

[7/7] from: louisaturk:coxinet at: 23-Aug-2002 23:42

Hi Tom, At 11:50 AM 8/23/2002 -0700, you wrote:

>-------------------------------------------- >not so sure I would want phrases to span over >sentence ending puncuation but that is what you asked for

Many thanks. I'm studying your code, and it has already given me some ideas. I'll probably end up making major changes to the code I have been working on all day. Louis

Notes

Quoted lines have been omitted from some messages.
View the message alone to see the lines that have been omitted