Documention for: make-word-list.r
Created by: peterwood
 on: 2-Apr-2007
Last updated by: peterwood on: 4-Apr-2007
Format: text/editable
Downloaded on: 18-Apr-2024

[div/style/max-width:80%

[numbering-on
[asis/style/font-size:small;overflow:auto;background-color:#ddffee
    script: make-word-lisr.r
     title: List words in a string
    author: Peter W A Wood
      date: 2-Apr-2007
   Version: 1.0.0
asis]

[contents

[div/style/border:thin,blue,solid;margin:1em;padding:1em
[h2/style/break:both purpose

[p **make-word-list.r** lists all the unique **words** in a **document**.

[p make-word-list.r is used inside skimp.r - the simple keyword management 
program that is used in REBOL.org for many of its search indexes.

[p I would like to thank Sunanda without whom this script wouldn't have been
started, tested, optimised or documented. 


div] [div/style/*-1 [h2 usage


[asis/style/*-1
USAGE:
    make-word-list config content index-name /for-search

ARGUMENTS:
     config -- changes to the default configuration (Type: object or none)
               (See below)
     
     content -- the string for which words are to be extracted

REFINEMENTS:
     /for-search -- The character specified as the ""not-prefix"" is not removed
                    from the front of words     
asis]

[h3 configuration object

[p The **configuration object** used by **make-word-list** function provides
almost complete control over what makes a word.

You only need to supply changes not all the entries in the configuration object.
For example, the following configuration object will only recognise words
starting with ""a""
[asis/style/*-1
  my-config: make object! [ word-start: charset [#""a""]]
  probe make-word-list my-config ""aword bword cword dad""
 asis]
  The result would be [""aword"" ""ad""]

If you are happy with the default settings, you can supply **none** instead of
a **parameter object**.

[cell/class/lsh name
[cell/class/lsh format
[cell/class/lsh use
[cell/class/lsh default
[cell/class/lsh notes                     


[row
[cell/class/lskey1 alpha
[cell/class/lsdata1 charset
[cell/class/lsdata1 basic alphabet, both upper and lower case
[cell/class/lsdata1 charset [#""a"" - #""z"" #""A"" - #""Z""] 
[cell/class/lsdata1

[row
[cell/class/lskey2 digit
[cell/class/lsdata2 charset
[cell/class/lsdata2 basic digits
[cell/class/lsdata2
 charset [#""0"" - #""9""]
[cell/class/lsdata2

[row
[cell/class/*-4 word-start
[cell/class/*-4 charset
[cell/class/*-4 characters that start a word
[cell/class/*-4 alpha
[cell/class/*-4

[row
[cell/class/*-4 word-letter
[cell/class/*-4 charset
[cell/class/*-4 characters used in words
[cell/class/*-4 union alpha union digit charset [""~""]
[cell/class/*-4 There is no need to include ""-"", ""."", ""/"" or ""\"" as words
separated by these, such as paths or domain names are handled specially

[row
[cell/class/*-4 final-letters
[cell/class/*-4 charset
[cell/class/*-4 additional characters that may come at the end of a word
[cell/class/*-4 charset [""!"" ""?""]
[cell/class/*-4

[row
[cell/class/*-4 number
[cell/class/*-4 charset
[cell/class/*-4 characters that make up the body of a number
[cell/class/*-4 number: union digit charset ["".,""]
[cell/class/*-4

[row
[cell/class/*-4 number-prefix
[cell/class/*-4 charset
[cell/class/*-4 characters that may be at the start of the a number
[cell/class/*-4 charset [""+-Ł$˘""]
[cell/class/*-4

[row
[cell/class/*-4 number-postfix
[cell/class/*-4 charset
[cell/class/*-4 characters that may appear at the end of a number
[cell/class/*-4 charset [""+-""]
[cell/class/*-4

[row
[cell/class/*-4 word-length
[cell/class/*-4 pair
[cell/class/*-4 minimum x maximum string lengths to be included
[cell/class/*-4 1x40
[cell/class/*-4 (3x50 means 1-letter, two-letter and 51+ letter strings will not be recognised

[row
[cell/class/*-4 not-prefix
[cell/class/*-4 string none
[cell/class/*-4 single character that will be used as a **not** prefix in searches
[cell/class/*-4 ""~""
[cell/class/*-4 set to none if no **not** prefix is required

[row
[cell/class/*-4 stop-list
[cell/class/*-4 block of strings
[cell/class/*-4 words that will be ignored
[cell/class/*-4 [""a"" ""is"" ""the""]
[cell/class/*-4

[row
[cell/class/*-4 ignore-tags
[cell/class/*-4 logic
[cell/class/*-4 ignore tags if true 
[cell/class/*-4 false
[cell/class/*-4

[row
[cell/class/*-4 index-pairs
[cell/class/*-4 logic
[cell/class/*-4 true to index-pairs; false to ignore them
[cell/class/*-4 true
[cell/class/*-4

[row
[cell/class/*-4 hyphen
[cell/class/*-4 charset
[cell/class/*-4 characters used to hyphenate words
[cell/class/*-4 charset [""-_""]
[cell/class/*-4

table]


[h3 /for-search refinement

[p There are many different reasons why you may want to extract the words
from a string. Two of the most common are:
[li the string is the source document that you want to index
[li the string is a user-supplied search string that you want to turn into
separate words so you can search for them using **find-words**
list]

[p The two uses **may** in your application need to behave slightly differently,
especially with regard to handling the **not-prefix**.

As an example, the default make-word-list function acts differently when given
a string that contains tildes -- a leading tilde is preserved
with the **/for-search** refinement:

[asis/style/*
    &gt;&gt; make-word-list none ""I have some ~tildes in t~~~his ~string~""                
    == [""have"" ""i"" ""in"" ""some"" ""string~"" ""tildes"" ""t~~~his""]    
    &gt;&gt; make-word-list/for-search none ""I have some ~tildes in t~~~his ~string~""
    == [""have"" ""i"" ""in"" ""some"" ""t~~~his"" ""~string~"" ""~tildes""]

asis]