Script Library: 1247 scripts

Documentation for: make-word-list.r

     script: make-word-lisr.r
      title: List words in a string
     author: Peter W A Wood
       date: 2-Apr-2007
    Version: 1.0.0

purpose

1. usage

1.1. configuration object

1.2. /for-search refinement

purpose

make-word-list.r lists all the unique words in a document.

make-word-list.r is used inside skimp.r - the simple keyword management program that is used in REBOL.org for many of its search indexes.

I would like to thank Sunanda without whom this script wouldn't have been started, tested, optimised or documented.

1. usage

 USAGE:
     make-word-list config content index-name /for-search

 ARGUMENTS:
      config -- changes to the default configuration (Type: object or none)
                (See below)

      content -- the string for which words are to be extracted

 REFINEMENTS:
      /for-search -- The character specified as the "not-prefix" is not removed
                     from the front of words

1.1. configuration object

The configuration object used by make-word-list function provides almost complete control over what makes a word. You only need to supply changes not all the entries in the configuration object. For example, the following configuration object will only recognise words starting with "a"

   my-config: make object! [ word-start: charset [#"a"]
   probe make-word-list my-config "aword bword cword dad"

The result would be ["aword" "ad"] If you are happy with the default settings, you can supply none instead of a parameter object.

name	format	use	default	notes
alpha	charset	basic alphabet, both upper and lower case	charset [#"a" - #"z" #"A" - #"Z"]
digit	charset	basic digits	charset [#"0" - #"9"]
word-start	charset	characters that start a word	alpha
word-letter	charset	characters used in words	union alpha union digit charset ["~"]	There is no need to include "-", ".", "/" or "\" as words separated by these, such as paths or domain names are handled specially
final-letters	charset	additional characters that may come at the end of a word	charset ["!" "?"]
number	charset	characters that make up the body of a number	number: union digit charset [".,"]
number-prefix	charset	characters that may be at the start of the a number	charset ["+-�$�"]
number-postfix	charset	characters that may appear at the end of a number	charset ["+-"]
word-length	pair	minimum x maximum string lengths to be included	1x40	(3x50 means 1-letter, two-letter and 51+ letter strings will not be recognised
not-prefix	string none	single character that will be used as a not prefix in searches	"~"	set to none if no not prefix is required
stop-list	block of strings	words that will be ignored	["a" "is" "the"]
ignore-tags	logic	ignore tags if true	false
index-pairs	logic	true to index-pairs; false to ignore them	true
hyphen	charset	characters used to hyphenate words	charset ["-_"]

1.2. /for-search refinement

There are many different reasons why you may want to extract the words from a string. Two of the most common are:

the string is the source document that you want to index
the string is a user-supplied search string that you want to turn into separate words so you can search for them using find-words

The two uses may in your application need to behave slightly differently, especially with regard to handling the not-prefix. As an example, the default make-word-list function acts differently when given a string that contains tildes -- a leading tilde is preserved with the /for-search refinement:

     >> make-word-list none "I have some ~tildes in t~~~his ~string~"
     == ["have" "i" "in" "some" "string~" "tildes" "t~~~his"]
     >> make-word-list/for-search none "I have some ~tildes in t~~~his ~string~"
     == ["have" "i" "in" "some" "t~~~his" "~string~" "~tildes"]