Documention for: make-word-list.r Created by: peterwood on: 2-Apr-2007 Last updated by: peterwood on: 4-Apr-2007 Format: text/editable Downloaded on: 18-Apr-2024 [div/style/max-width:80% [numbering-on [asis/style/font-size:small;overflow:auto;background-color:#ddffee script: make-word-lisr.r title: List words in a string author: Peter W A Wood date: 2-Apr-2007 Version: 1.0.0 asis] [contents [div/style/border:thin,blue,solid;margin:1em;padding:1em [h2/style/break:both purpose [p **make-word-list.r** lists all the unique **words** in a **document**. [p make-word-list.r is used inside skimp.r - the simple keyword management program that is used in REBOL.org for many of its search indexes. [p I would like to thank Sunanda without whom this script wouldn't have been started, tested, optimised or documented. div] [div/style/*-1 [h2 usage [asis/style/*-1 USAGE: make-word-list config content index-name /for-search ARGUMENTS: config -- changes to the default configuration (Type: object or none) (See below) content -- the string for which words are to be extracted REFINEMENTS: /for-search -- The character specified as the ""not-prefix"" is not removed from the front of words asis] [h3 configuration object [p The **configuration object** used by **make-word-list** function provides almost complete control over what makes a word. You only need to supply changes not all the entries in the configuration object. For example, the following configuration object will only recognise words starting with ""a"" [asis/style/*-1 my-config: make object! [ word-start: charset [#""a""]] probe make-word-list my-config ""aword bword cword dad"" asis] The result would be [""aword"" ""ad""] If you are happy with the default settings, you can supply **none** instead of a **parameter object**. [cell/class/lsh name [cell/class/lsh format [cell/class/lsh use [cell/class/lsh default [cell/class/lsh notes [row [cell/class/lskey1 alpha [cell/class/lsdata1 charset [cell/class/lsdata1 basic alphabet, both upper and lower case [cell/class/lsdata1 charset [#""a"" - #""z"" #""A"" - #""Z""] [cell/class/lsdata1 [row [cell/class/lskey2 digit [cell/class/lsdata2 charset [cell/class/lsdata2 basic digits [cell/class/lsdata2 charset [#""0"" - #""9""] [cell/class/lsdata2 [row [cell/class/*-4 word-start [cell/class/*-4 charset [cell/class/*-4 characters that start a word [cell/class/*-4 alpha [cell/class/*-4 [row [cell/class/*-4 word-letter [cell/class/*-4 charset [cell/class/*-4 characters used in words [cell/class/*-4 union alpha union digit charset [""~""] [cell/class/*-4 There is no need to include ""-"", ""."", ""/"" or ""\"" as words separated by these, such as paths or domain names are handled specially [row [cell/class/*-4 final-letters [cell/class/*-4 charset [cell/class/*-4 additional characters that may come at the end of a word [cell/class/*-4 charset [""!"" ""?""] [cell/class/*-4 [row [cell/class/*-4 number [cell/class/*-4 charset [cell/class/*-4 characters that make up the body of a number [cell/class/*-4 number: union digit charset ["".,""] [cell/class/*-4 [row [cell/class/*-4 number-prefix [cell/class/*-4 charset [cell/class/*-4 characters that may be at the start of the a number [cell/class/*-4 charset [""+-£$¢""] [cell/class/*-4 [row [cell/class/*-4 number-postfix [cell/class/*-4 charset [cell/class/*-4 characters that may appear at the end of a number [cell/class/*-4 charset [""+-""] [cell/class/*-4 [row [cell/class/*-4 word-length [cell/class/*-4 pair [cell/class/*-4 minimum x maximum string lengths to be included [cell/class/*-4 1x40 [cell/class/*-4 (3x50 means 1-letter, two-letter and 51+ letter strings will not be recognised [row [cell/class/*-4 not-prefix [cell/class/*-4 string none [cell/class/*-4 single character that will be used as a **not** prefix in searches [cell/class/*-4 ""~"" [cell/class/*-4 set to none if no **not** prefix is required [row [cell/class/*-4 stop-list [cell/class/*-4 block of strings [cell/class/*-4 words that will be ignored [cell/class/*-4 [""a"" ""is"" ""the""] [cell/class/*-4 [row [cell/class/*-4 ignore-tags [cell/class/*-4 logic [cell/class/*-4 ignore tags if true [cell/class/*-4 false [cell/class/*-4 [row [cell/class/*-4 index-pairs [cell/class/*-4 logic [cell/class/*-4 true to index-pairs; false to ignore them [cell/class/*-4 true [cell/class/*-4 [row [cell/class/*-4 hyphen [cell/class/*-4 charset [cell/class/*-4 characters used to hyphenate words [cell/class/*-4 charset [""-_""] [cell/class/*-4 table] [h3 /for-search refinement [p There are many different reasons why you may want to extract the words from a string. Two of the most common are: [li the string is the source document that you want to index [li the string is a user-supplied search string that you want to turn into separate words so you can search for them using **find-words** list] [p The two uses **may** in your application need to behave slightly differently, especially with regard to handling the **not-prefix**. As an example, the default make-word-list function acts differently when given a string that contains tildes -- a leading tilde is preserved with the **/for-search** refinement: [asis/style/* >> make-word-list none ""I have some ~tildes in t~~~his ~string~"" == [""have"" ""i"" ""in"" ""some"" ""string~"" ""tildes"" ""t~~~his""] >> make-word-list/for-search none ""I have some ~tildes in t~~~his ~string~"" == [""have"" ""i"" ""in"" ""some"" ""t~~~his"" ""~string~"" ""~tildes""] asis]