Script Library: 1238 scripts
  • Home
  • Script library
  • AltME Archive
  • Mailing list
  • Articles Index
  • Site search
 
View scriptLicenseDownload documentation as: HTML or editable
Download scriptHistoryOther scripts by: peterwood · sunanda

Documentation for: make-word-list.r


     script: make-word-lisr.r
      title: List words in a string
     author: Peter W A Wood
       date: 2-Apr-2007
    Version: 1.0.0
 

purpose

make-word-list.r lists all the unique words in a document.

make-word-list.r is used inside skimp.r - the simple keyword management program that is used in REBOL.org for many of its search indexes.

I would like to thank Sunanda without whom this script wouldn't have been started, tested, optimised or documented.

1. usage

 USAGE:
     make-word-list config content index-name /for-search

 ARGUMENTS:
      config -- changes to the default configuration (Type: object or none)
                (See below)

      content -- the string for which words are to be extracted

 REFINEMENTS:
      /for-search -- The character specified as the "not-prefix" is not removed
                     from the front of words
 

1.1. configuration object

The configuration object used by make-word-list function provides almost complete control over what makes a word. You only need to supply changes not all the entries in the configuration object. For example, the following configuration object will only recognise words starting with "a"

   my-config: make object! [ word-start: charset [#"a"]
   probe make-word-list my-config "aword bword cword dad"
  
The result would be ["aword" "ad"] If you are happy with the default settings, you can supply none instead of a parameter object.
name format use default notes
alpha charset basic alphabet, both upper and lower case charset [#"a" - #"z" #"A" - #"Z"]
digit charset basic digits charset [#"0" - #"9"]
word-start charset characters that start a word alpha
word-letter charset characters used in words union alpha union digit charset ["~"] There is no need to include "-", ".", "/" or "\" as words separated by these, such as paths or domain names are handled specially
final-letters charset additional characters that may come at the end of a word charset ["!" "?"]
number charset characters that make up the body of a number number: union digit charset [".,"]
number-prefix charset characters that may be at the start of the a number charset ["+-£$¢"]
number-postfix charset characters that may appear at the end of a number charset ["+-"]
word-length pair minimum x maximum string lengths to be included 1x40 (3x50 means 1-letter, two-letter and 51+ letter strings will not be recognised
not-prefix string none single character that will be used as a not prefix in searches "~" set to none if no not prefix is required
stop-list block of strings words that will be ignored ["a" "is" "the"]
ignore-tags logic ignore tags if true false
index-pairs logic true to index-pairs; false to ignore them true
hyphen charset characters used to hyphenate words charset ["-_"]

1.2. /for-search refinement

There are many different reasons why you may want to extract the words from a string. Two of the most common are:

  • the string is the source document that you want to index
  • the string is a user-supplied search string that you want to turn into separate words so you can search for them using find-words

The two uses may in your application need to behave slightly differently, especially with regard to handling the not-prefix. As an example, the default make-word-list function acts differently when given a string that contains tildes -- a leading tilde is preserved with the /for-search refinement:

     >> make-word-list none "I have some ~tildes in t~~~his ~string~"
     == ["have" "i" "in" "some" "string~" "tildes" "t~~~his"]
     >> make-word-list/for-search none "I have some ~tildes in t~~~his ~string~"
     == ["have" "i" "in" "some" "t~~~his" "~string~" "~tildes"]