Documention for: skimp.r
Created by: sunanda
 on: 5-Apr-2007
Last updated by: sunanda on: 28-Apr-2007
Format: text/editable
Downloaded on: 30-Apr-2025

[div/style/max-width:95%

[numbering-on
[asis/style/font-size:small;overflow:auto;background-color:#ddffee
    script: skimp.r
     title: Simple keyword index management program
    author: Sunanda
      date: 23-apr-2007
   Version: 0.0.2
asis]

[contents

[div/style/border:thin,blue,solid;margin:1em;padding:1em
[h2/style/break:both purpose

[p **skimp.r** lets you index the **words** in set of **documents**.

You can then retrieve the list of documents that contain (or not) specific words.

[p skimp.r is used extensively inside REBOL,org for many of its search indexes.

div] [div/style/border:thin,green,solid;margin:1em;padding:1em [h2 thanks and credits
[li Christian, Romano and everyone on the Mailing List who contributed to **rse-ids.r**
[li Peter Wood for writing the **make-word-list.r** script, and creating test data cases
for rse-ids, skimp and make-word-list.
list]

div] [div/style/*-1 [h2 quick start highlights

[h3 Version summary

[li 0.0.0 11-aug-2005 Written for internal use only
[li 0.0.1  3-apr-2007 First public release; uses rse-ids.r and make-word-list.r
[li 0.0.2 23-apr-2007 Add flush-cache and flush-cache-all
list]

[h3 sample data

[p Throughout parts of this documentation we use the following
 data structure as an example: it is a block with **document-names** followed by
 the document's content.
[asis/style/*
  skimp-test-data: [
   "einstein-1" "You see, the wire telegraph is a kind of a very, very long cat."
     "saying-1" "Managing programmers is like herding cats."
   "einstein-1" "You pull his tail in New York and his head is meowing in Los Angeles."
      "lyric-1" "I thought I saw a puddy cat a-creeping up on me"
   "einstein-1" "Do you understand this?"
    "medical-1" "The cat-scan showed nothing unusual"
   "einstein-1" "And radio operates exactly the same way: you send signals here, they receive them
there."
     "saying-2" "Dogs see people as companions; cats see people as staff."
   "einstein-1" "The only difference is that there is no cat."
     "cliche-1" "It's been raining cats and dogs since last night!"
   "doggerel-1" "The cat sat on the mat"
     "saying-3" "While the cat's away, the mice will play"
    ]
asis]

[p Note that the document **einstein-1** appears  several times in the set. This is to help illustrate
what happens when skimp's  index is updated for the same document.

[h3 build an index

[p Let's build a basic skimp index from that data set:

[asis/style/*
      ;; ensure folder exists for index
  if not exists? %skimp-folder [make-dir %skimp-folder]

  index-name: %skimp-folder/skimp-test-index

      ;; add each document to the index
  foreach [document-name contents] skimp-test-data [
      skimp/add-words index-name document-name contents
      ]
asis]

[p You should now have a set of files in the  **%skimp-folder** folder.
 Their names look like: **skimp-test-index-117.sif**


[h3 search index for words

[p We can now use skimp to find words in documents:

[asis/style/*
   probe skimp/find-word index-name "cat"
   == ["saying-3" "doggerel-1" "medical-1" "lyric-1" "einstein-1"]

   probe skimp/find-word index-name "the"
   == ["saying-3" "doggerel-1" "medical-1" "einstein-1"]

   probe skimp/find-word index-name "cats."
   == []               ;; "cats." is not a word according to the default assumptions

   probe skimp/find-word index-name "cat-scan"
   == ["medical-1"]   ;; "cat-scan" is one word

   probe skimp/find-word index-name "scan"
   ["medical-1"]      ;; the "scan" part of "cat-scan" is also a findable word

   probe skimp/find-word index-name "difference"
   == ["einstein-1"

       ;; note: find-WORDS takes a block (find-WORD takes a string)
   probe skimp/find-words index-name ["a" "cat"]
   ["lyric-1" "einstein-1"]

   probe skimp/find-word index-name "~dog"
   ["einstein-1" "saying-1" "lyric-1" "medical-1" "saying-2" "cliche-1" "doggerel-1" "saying-3"]

asis]

[p Notes
[li both **find-word** and **find-words** return a **block** of **document-names**
[li all words in the index are **lowercase** -- (skimp is case insensitive)
[li skimp has applied some logic to extract **words** from the document content.
The logic is highly flexible, and can be completely replaced  -- so you can
easily use skimp with your own definition of what a word is -- see **what is a word?** later
[li all ""parts"" of the **einstein-1** document have been added
[li use **find-word** to find all documents containing a given **word**
[li use **find-words** to find all documents containing **one or more words**
 (the results are **ANDed** together -- so all words must occur for there to be a match).
[li prefix a word with a **tilde** (eg **~dog**) to find all documents that
do **not** contain the word (the tilde is configurable if you want to use something else)
list]


[h3 retrieve more from the index

[asis/style/*
     ;; all indexed words beginning with an **a**:
  probe skimp/get-indexed-words index-name "a"
  == ["a" "a-creeping" "and" "angeles" "as" "away"]

     ;; ditto, with the names of documents that contain the words:
  probe skimp/get-indexed-words/document-list index-name "a"
  == ["a" ["einstein-1" "lyric-1"] "a-creeping" ["lyric-1"]
      "and" ["cliche-1" "einstein-1"] "angeles" ["einstein-1"]
      "as" ["saying-2"] "away" ["saying-3"]]

     ;; all words beginning **t** in **einstein-1**:
  probe skimp/get-indexed-words-for-document index-name "einstein-1" "t"
  == ["tail" "telegraph" "that" "the" "them" "there" "they" "this?"]
asis]


[h3 remove a document

[p You can remove a document entirely like this:


[asis/style/*
  probe skimp/find-word index-name "difference"
  == ["einstein-1"]      ;; finds one document
  **probe skimp/remove-document index-name "einstein-1"**
  probe skimp/find-word index-name "difference"
  == []                  ;; finds no document
asis]


[h3 other highlights

[p In summary:


[li you can have many indexes open at the same time
[li you can check if an index exists, and delete an entire index
[li you can get information about an index, and you can save your oww
data in an index header
[li **add-bulk-words** gives you a faster way if adding many documents at the same time
[li there are various caching and configuration options to tune an index to your needs
list]

div] [div/style/*-1 [h2 usage in detail

[h3 starting skimp

[p Simply use:

[asis/style/* do %skimp.r asis]

[li You will need **rse-ids.r** (in the REBOL.org library).
 rse-ids.r should be already loaded, or available in the same folder
 as skimp so skimp.r can load it.

[li For best results, you'll need **make-word-list.r** (also in the REBOL
library). **make-word-list** provides a flexible and configurable way of
extracting all the ""words"" from a document.
list]

[h3 closing down

[p No need to do anything, unless you have used updates with the **/defer** refinement.
If so, you need to write back **all** open indexes:

[asis/style/* skimp/write-cache-all/flush asis]

You could then if you wish remove skimp from memory:

[asis/style/*
  unset 'skimp
  unset 'rse-ids
asis]


[h3 configuring skimp

[p skimp itself needs very little configuring for use (you can extensively
configure individual indexes -- see later).

[p There are only two global configuration settings to change or set the
 **prefix** and **suffix** for the file name for all indexes:

[asis/style/*
  do %skimp.r
  skimp/index-name-prefix: "live-build-"
  skimp/index-name-suffix: ".index"
asis]
[p With the example above, the files for index **my-index** will be called
**live-build-my-index-nnn.index**

The defaults are:
[li index-name-prefix --  none (ie null string)
[li index-name-suffix -- **.sif** (**S**kimp **I**ndex **F**ile)


div] [div/style/*-1 [h2 creating, configuring and closing an index


[h3 creating a skimp index

[p You do not need to do anything to create an index -- simply
start using it and it will be created automatically.

[p The only pre-requisite is that the **folder to store the index exists**....

[asis/style/*
  skimp/add-words %/c/dev/indexes/ind-1 "einstein-1" "You see"
asis]

[p .... will create an index called **ind1.sif** in the folder
**/c/dev/indexes/ind-1** -- provided that folder already exists.

[h3 configuring a skimp index

[p There are some settings you may wish to apply before adding any documents
to an index; and there are some that you **have** to apply before adding
any documents.

[p It is best to configure the index before performing other operations on it.
You can use **skimp/index-exists?** to see if an index exists or not. If it does not,
you are able to apply the configuration settings:

[asis/style/*
  if not skimp/index-exists? %a-test-index [
      skimp/set-word-definition .... ;; define what a ""word"" is in this index
      ]
asis]

The configuration settings are defined later. In summary, they are:
[li  **skimp/set-config** -- performance settings for the index
[li  **skimp/set-word-definition** -- define what a ""word"" is for the index
list]

[h3 closing a skimp index

[p **without caching**: if you are not using the **skimp cache**, then no
 explicit close is needed: all
data is written back on each update operation, so the data on disk is current.

[p **with caching**: if one or more indexes are cached (see below for details)
you need to write the cache out before ending your program, eg:

[asis/style/*
   skimp/write-cache/flush %test-index-1
   skimp/write-cache/flush %test-index-2
asis]

[p If you've not been keeping a close eye on what indexes you have opened and
cached, use **write-cache-all** to write them **all** back to permanent storage:

[asis/style/*
   skimp/write-cache-all/flush
asis]


div] [div/style/*-1 [h2 caching one or more indexes

[h3 /defer

[p These operations can all take the **/defer** refinement:

[li **add-words**  -- add a document and its words
[li **add-bulk-words** -- add several documents and their words in one go
[li **remove-document** -- remove a document and all its words
[li **removed-documents** -- remove several documents and all their words
[li **set-config**  -- performance settings for an index
[li **set-word-definition** -- define what a ""word"" is
[li **set-owner-data** -- any fields you want to keep in the index header
list]

[p Using **/defer** can speed things up as skimp does not write the
 index files back to permanent storage until you issue either a
 **skimp/write-cache** or a **skimp/write-cache-all**

[p As explained above:

[li **write-cache** takes as an argument the name of an index. It writes
 that index -- provided it was in the cache.

[li **write-cache-all** has no arguments. It writes back all indexes found
in the cache.
list]

[h3 /flush

[p Both **write-cache** and **write-cache-all** can take the **/flush** refinement.

[li with **/flush**: the index files are removed from the cache -- so any future
operations will reread the index files from permanent storage.

[li without **/flush**: the index files are written out **and** retained in the cache.
list]


div] [div/style/*-1 [h2 adding and removing documents from an index

[h3 overview

[p There are four main functions for this:

[li **add-words**  -- adds all the words in a document
[li **add-bulk-words** -- add all the words in multiple documents
[li **remove-document** -- remove a document from the index
[li **remove-documents** -- remove several documents from the index
list]

[p As noted above, each can take the **/defer** refinement. If you have a series of
operations to perform on an index, using **/defer** can speed things up considerably.

[h3 document names

[p By default, a **document's name** must be a **string!** -- previous examples include
**""einstein-1""** and **""saying-1""**.

[p There is a configuration setting that allows document-names to be **integer!**. That is
explained later (See **set-config/integer-document-names**).
 All the document-names in one index must be of the same datatype.


[h3 add-words

[p Adds all the **words** in one **document** to an index.

[p If the document is already in the index, it **adds** all the new words found
in the latest version of the document, but does not remove any words that are no longer
found in it: to **replace** a document in an index, use **remove-document** followed
by **add-words**.

[asis/style/*
USAGE:
    skimp/add-words index-name document-name words /defer

ARGUMENTS:
     index-name -- name of the index (Type: file)
     document-name -- name of the document (Type: string integer)
     words -- words to add (Type: string block)

REFINEMENTS:
     /defer -- Do not write to permanent storage
asis]


[p The **words** can be supplied as either a **string!** or a **block!**

[li if a **string!** -- skimp parses the string to extract the words.
You can override skimps definition of a word in several ways -- see
**what is a word?**

[li if a **block!** -- skimp expects a **block!** of **string!s** -- each string is
added to the index **as is**. This allows you complete control over what is
considered a ""word"".
list]

[p examples:
[asis/style/*
   skimp/add-words %my-index "parrot" "-&gt;: I wish to make a complaint -:)"
   skimp/add-words %my-index "smilies" ["-:)" "-&gt;:" "@/."]
asis]
[p In the **smilies** document, the various keyboard symbols are treated as
indexable words, while in **parrot** they are not.


[h3 add-bulk-words

[p Adds all the **words** in zero or more **documents** to an index.

[p Very similar to **add-words** except **add-bulk-words** accepts more than one
document.

[asis/style/*
USAGE:
    skimp/add-bulk-words index-name data-set /defer

ARGUMENTS:
     index-name -- name of the index (Type: file)
     data-set -- set of details to add [Type: block]

REFINEMENTS:
     /defer -- Do not write to permanent storage
asis]


[p The **data-set** is a **block** with pairs of entries: **document-name** followed
by **words** for that document.

[p As with **add-words** the **words** can be **string!** or **block!**


[p example:
[asis/style/*
   skimp/add-bulk-words %my-index [
           "parrot" "I wish to make a complaint -:)"
          "smilies" ["-:)" "-&gt;:" "@/."]
      ]
asis]
[p The example adds the same two files as the **add-words** example.

[h3 choosing between add-words and add-bulk-words

[li If you have multiple documents to add, **add-bulk-words** can be up to
five times faster than **add-words**
[li multiple **add-words** gives an interactive application more chances
to report progress to the waiting user
[li too many documents at once (say over 30, though you will need to test for
your application as document size is a big factor) and **add-bulk-words** can be slower
than **add-words**.
list]


[h3 remove-document

[p Removes all the **words** in one **document** from an index.

[p If the document is not in the index, then no action is taken.

[asis/style/*
USAGE:
    skimp/remove-document index-name document-name  /defer

ARGUMENTS:
     index-name -- name of the index (Type: file)
     document-name -- name of the document (Type: string integer)

REFINEMENTS:
     /defer -- Do not write to permanent storage
asis]


[p examples:
[asis/style/*
   skimp/remove-document %my-index "parrot"
   skimp/remove-document %my-index "einstein-1"
asis]


[p **question:** Can I remove just some of the words of a document
from an index?
[p **response:** No, not in this version of skimp.

[p **question:** How do I delete an entire index?
[p **response:** See **remove-index** below.


[h3 remove-documents

[p Removes all the **words** in zero or more **documents** from an index.

[p Document names which are not in the index are ignored.

[asis/style/*
USAGE:
    skimp/remove-documents index-name document-list  /defer

ARGUMENTS:
     index-name -- name of the index (Type: file)
     document-list -- names of the document (Type: block)

REFINEMENTS:
     /defer -- Do not write to permanent storage
asis]


[p example:
[asis/style/*
   skimp/remove-documents %my-index ["parrot" "einstein-1"]
asis]


[h3 remove-document vs remove-documents

[p **remove-document** is a courtesy wrapper for **remove-documents**,
so it does not matter if you use **remove-documents** with just one document
name; the effect is the same.

[p If you have multiple documents to remove it is usually **much faster**
to use **remove-documents** once rather than **remove-document** multiple times.

[p That is because skimp needs to scan the **entire index** for each remove operation.
If you are removing 50 documents, that is 50 scans using **remove-document** rather than
one single scan using **remove-documents**.


div] [div/style/*-1 [h2 searching for documents

[p To find the documents that contain specific words, use either:
[li **find-word** -- searches for a specific word
[li **find-words** -- searches for one or more words
list]

In both cases, a search can be negated (ie find all the documents that
do **not** contain a word) be using the **negate prefix**.

[p You can also search more widely, using:

[li  **get-indexed-words** -- find all the words that are contained in the index
[li  **get-indexed-words-for-document** -- find all the words indexed for a specific document
list]


[h3 find-word

[p Returns a block of document-names that contain that word

[asis/style/*
USAGE:
    skimp/find-word index-name word

ARGUMENTS:
     index-name -- name of the index (Type: file)
     word -- word to search for (Type: string)
asis]

[p examples:
[asis/style/*
   skimp/find-word %my-index "cat"
   skimp/find-word %my-index "~cat"
asis]

[li The first example returns the names of all documents containing the word **""cat""**.
[li The second returns all those that do not contain **""cat""**.
list]


[p **question**: How can I know if the word I am looking for is well-formed? ie that
it is formed to the same parsing rules as **add-words**?

[p **response**: Good question! One way is to use **extract-words-from-string** which is
explained later on. It allows you access to the same processing as **add-words**. (If
 you do use **extract-words-from-string**, it returns a **block** so you need to use it
 in conjunction with **find-words** below).


[h3 find-words

[p Returns a block of document-names that contain **all** the words.

[asis/style/*
USAGE:
    skimp/find-words index-name word

ARGUMENTS:
     index-name -- name of the index (Type: file)
     words -- words to search for (Type: block)
asis]

[p examples:
[asis/style/*
   skimp/find-words %my-index ["cat" "sat" "mat"]
   skimp/find-words %my-index ["~cat" "sat" "mat"]
   skimp/find-words %my-index
       skimp/extract-words-from-string/for-search %my-index "~Cat, sat mat."
asis]

[li The first example returns the names of all documents containing the three words
**""cat""**, **""sat""** and **""mat""**
[li The second returns all those that contain **""sat""**
  and **""mat""** do **not** contain **""cat""**.
[li The third example is the same as the second (assuming the definition of a word
in force excludes punctuation) except that it uses **extract-words-from-string** to
turn a user-supplied search string into a block of words.

list]


[h3 get-indexed-words

[p Returns a block of all the words in the index that begin with a specific character.
Optionally, also return a block of the matching document names.

[asis/style/*
USAGE:
    skimp/get-indexed-words index-name character /document-list

ARGUMENTS:
     index-name -- name of the index (Type: file)
     character -- first character of words being sought to search for (Type: char string)

REFINEMENTS:
     /document-list -- Also return the matching document names
asis]

[p examples:
[asis/style/*
   skimp/get-indexed-words %my-index "n"
   skimp/get-indexed-words/document-list %my-index "n"
asis]

[h4 what first letters exist in the index?

[p **get-index-information** (see later) provides a list of all first characters
that exist in the index. This code will display all the words in the index:

[asis/style/*
   index-info: skimp/get-index-information %my-index
   foreach first-char index-info/top-index [
       probe skimp/get-indexed-words %my-index first-char
       ]
asis]


[h3 get-indexed-words-for-document

[p Returns a block of all the words in a specific document that match a specific first character.

[asis/style/*
USAGE:
    skimp/get-indexed-words index-name document-name character

ARGUMENTS:
     index-name -- name of the index (Type: file)
     document-name -- name of document (Type integer string)
     character -- first character of words being sought to search for (Type: char string)

asis]

[p examples:
[asis/style/*
   skimp/get-indexed-words-for-document %my-index "einstein-1" "b"
   skimp/get-indexed-words-for-document %my-index "saying-1" "m"
asis]

[h4 what documents are in the index?

[p Skipping ahead, **get-index-information** is one way to get a list of all
document names.


div] [div/style/*-1 [h2 other index management functions

[p These functions allow you to manage your skimp indexes
efficiently and effectively:


[li  **index-exists?** -- checks if an index exists or not
[li  **get-index-information** -- returns header information about an index
[li  **remove-index** -- delete an index entirely
[li  **set-owner-data** -- add your own index-specific data about an index
[li  **write-cache** -- write out one specific index
[li  **write-cache-all** -- write out all open indexes
[li  **flush-cache** -- empty one specific index
[li  **flush-cache-all** -- empty all open indexes
list]


[h3 index-exists?

[p Returns true or false depending on whether an index exists or not

[asis/style/*
USAGE:
    skimp/index-exists? index-name

ARGUMENTS:
     index-name -- name of the index (Type: file)

asis]

[p example:
[asis/style/*
  if not skimp/index-exists? %my-indexes/index-1 [
      print "sorry -- not yet created"
      ]
asis]


[h3 get-index-information

[p Returns an **object!** containing information about the index, or **none**
if the index does not exist.

[asis/style/*
USAGE:
    skimp/get-index-information index-name

ARGUMENTS:
     index-name -- name of the index (Type: file)

asis]

[p example:
[asis/style/*
  if object? ind-header: skimp/get-index-information %my-indexes/index-1 [
      probe ind-header
      ]
asis]

[h4 index header object

[p Has these fields:


[cell/class/lskey1 **index-file:**
[cell/class/lsdata1 **file**: the name of the index file
[row

[cell/class/lskey2 **top-index:**
[cell/class/lsdata2 **block**: of first characters indexed.
[row

[cell/class/*-3 **owner-data:**
[cell/class/*-3 **object**: any owner data you have set with **set-owner-data** (see later)
[row

[cell/class/*-3 **config:**
[cell/class/*-3 **object**: create-time configuration settings. See later
[row

[cell/class/*-3 **word-parameters:**
[cell/class/*-3 **object**: parameters defining what a word is. See later
[row

[cell/class/*-3 **make-word-list:**
[cell/class/*-3 **function** actual function used in this index to define what a word is.
table]


[h3 remove-index

[p Deletes all files related to an index.

[asis/style/*
USAGE:
    skimp/remove-index index-name

ARGUMENTS:
     index-name -- name of the index (Type: file)

asis]

[p example:
[asis/style/*
    skimp/remove-index/my-old-index
asis]


[h3 set-owner-data

[p Allows you to store an **object!** in the index header. You could use this to
(say) record an extended name for the index, or to add some application-specific
notes about the index

[asis/style/*
USAGE:
    skimp/set-owner-data index-name owner-data /defer

ARGUMENTS:
     index-name -- name of the index (Type: file)
     owner-data -- your data (Type: object)

REFINEMENTS:
     /defer -- Do not write to permanent storage
asis]

[p example:
[asis/style/*
    skimp/set-owner-data %my-index make object! [
        last-updated: now
        index-name: "All my photo captions"
        support: "call 555-1234"
        ]
asis]

[h4 updating existing owner-data

[p Simply specify the fields you want to change....
[asis/style/*
    skimp/set-owner-data %my-index make object! [
        support: "call 555-9876"
        ]
asis]

[p ....Other fields in the owner-data are left unchanged.

[p To get the full owner-data record, use **get-index-information**.


[h3 write-cache

[p Writes the data for a specific index to permanent storage. Use if you
have been using the **/defer** refinement on update operations.

[asis/style/*
USAGE:
    skimp/write-cache index-name /flush

ARGUMENTS:
     index-name -- name of the index (Type: file)

REFINEMENTS:
     /flush -- Purge the index from the cache after writing
               (further operations will cause it to be re-read)
asis]

[p examples:
[asis/style/*
    skimp/write-cache %my-index-1
    skimp/write-cache/flush %my-index-2
asis]


[h3 write-cache-all

[p Writes the data for all indexes to permanent storage. Use if you
have been using the **/defer** refinement on update operations, and
your clean-up routines are unsure of what indexes your software have
been updating

[asis/style/*
USAGE:
    skimp/write-cache /flush

ARGUMENTS:
     none

REFINEMENTS:
     /flush -- Purge the indexes from the cache after writing
               (further operations will cause them to be re-read)
asis]

[p example:
[asis/style/*
    skimp/write-cache-all
    skimp/write-cache-all/flush
asis]


[h3 flush-cache

[p Clears an index from the cache without first writing it to
permanent storage.

[p If you have been updating an index using the **/defer** refinement,
this is a way of purging all the changes since the last **write-cache**.

[p If you have an index open purely for reading (ie not one you
are updating in the cuurent task) this is a way of forcing the
index to be reread from permanent storage. That may be of use
if either:
[li the index is being updated by another task, and you want
to read the updated version; or
[li you have a large index and wish to release the memory
it is occupying
list]

[asis/style/*
USAGE:
    flush/write-cache index-name

ARGUMENTS:
     index-name -- name of the index (Type: file)

asis]

[p example:
[asis/style/*
    skimp/flush-cache %my-index-1
asis]


[h3 flush-cache-all

[p Clears all indexes from the cache

[asis/style/*
USAGE:
    skimp/flush-cache

ARGUMENTS:
     none

asis]

[p examples:
[asis/style/*
    skimp/flush-cache-all
asis]


div] [div/style/*-1 [h2 what is a word?

[p skimp indexes the **words** in **documents**. So an obvious question is:
how does it know what a word is?

[p The simple answer is:
[li 1. skimp uses **make-word-list.r**
[li 2. **make-word-list** is highly configurable, so you can tweak it to fit
your needs
[li 3. if you need more tweaks than are possible with the configuration
parameters, you can provide your own function to extract the words
in a document.
[li 4. In addition, the **block** format of **add-words** and **add-bulk-words**
lets you directly state what the indexable words are.
list]

[p In principle, a **word** can be **any** string of **one or more** characters. You are
not limited to letters, numerals or even printable characters.

[p Your definition of what a word is (including any function you have supplied
to extract the words in a document) are **held in the index itself**. So the
definition of a word remains stable and constant throughout the life of the index.

[p There are two functions for managing the definition of a word for an index:

[li  **set-word-definition**  -- add or update the definition of a word
[li  **extract-words-from-string** -- test how an index defines a word
list]

In addition, **get-index-information** (see above) returns you the current definition
for any given index.


[h3 set-word-definition

[p Defines what a word is for a given index

[asis/style/*
USAGE:
    skimp/set-word-definition index-name /parameters parm-obj /make-word-list mwlf /defer

ARGUMENTS:
     index-name -- name of the index (Type: file)

REFINEMENTS:
     /parameters -- extend/change definition for make-word-list function
         parm-obj (Type: object)

     /make-word-list -- replacement function
         mwlf (Type: function none)

     /defer -- Do not write to permanent storage
asis]

[p examples:
[asis/style/*
    skimp/set-word-definition/parameters
         make object! [
             initial-letters: charset compose [#"a" - #"z" #"A" - #"Z" (to-char 199) (to-char 231) ]
      ]

    skimp/set-word-definition/make-word-list
          func [parms string] [return unique parse/all string " "]
asis]

[li the first example changes the default parameters for
**make-word-list** to add chars 199 and 231 (C-cedilla) to the letters
that make up a word
[li the second example replaces the default **make-word-list** function
with a custom (and very basic) function to find the words in a document.
[li these options are explained in more detail below
list]

[h4 changing the word definition for an existing index

[p You can do this, but it may not be wise. It will not affect any of the
words already indexed, but it may affect your ability to search for them.

[p For example, you build an index over 1000 documents in which decimal numbers
are not treated as words. Later you change the definition so they are treated
as words, and add another 1000 documents. **Only the latter 1000 documents will
have their numbers indexed**.

[h4 parameter object

[p The **parameter object** is passed to the **make-word-list** function. By default
it is similar to the default object used by make-word-list.

[p You can find skimp's settings for an index like this:
[asis/style/*
  probe get in skimp/get-index-information %my-index 'word-parameters


[h3 make-word-list function

[p By default, skimp looks for **%make-word-list.r** in the current folder:
[li if found: skimp uses it as the default make-word-list function
[li if not, it uses a fairly useless minimal function:
[asis
     func [ parms  string [string!]
    ][
      return unique sort parse/all trim/lines copy string " "
    ]
asis]

[p If you want skimp to use another function that you have written, you can specify
 it with **set-word-definition/make-word-list**.

[li You must supply a **function** that takes two arguments and one refinement:
[list
[li **an object** -- you will be passed the **parameter object** as described above.
You are, of course, free to ignore it.
[li **a string** -- the string that you are required to break into indexable words
[li **/for-search** -- a refinement. Described below.
list]
[li Your function will be saved in the index header, and later executed from there.
**So it must be self-contained, making no references outside of itself**.
[li Your function must return a **block** of zero or more **unique** strings. These are the
words you have extracted from the input string.
[li Your returned block may contain the empty string (""""). If so, it will be ignored.
list]

[p example:

[asis/style/*
  skimp/set-word-parameters/make-word-list
      func [
         obj [object!]
         str [string!]
         /for-search
      ][
      return unique parse/all str "."
      ]
asis]

[p The example is hardly the most useful word-extracting function in the world :-)

[p To test your function (or the built-in function) use **extract-words-from-string**
 (See later).


[h4 /for-search refinement

[p There are two subtly different reasons why you may want to extract the words
from a string:
[li the string is the source document that you want to index
[li the string is a user-supplied search string that you want to turn into
separate words so you can search for them using **find-words**
list]

[p The two uses **may** in your application need to behave slightly differently,
especially with regard to handling the **not-prefix**.

As an example, the built-in make-word-list function acts differently when given
a string that contains tildes -- a leading tilde is preserved with the **/for-search**
refinement:

[asis/style/*
    skimp/extract-words-from-string %my-index "I have some ~tildes in t~~~his ~string~"
    == ["have" "I" "in" "some" **"string~"** "tildes" "t~~~his"]
    skimp/extract-words-from-string/for-search %my-index "I have some ~tildes in t~~~his
~string~"
    == ["have" "I" "in" "some" "t~~~his" **"~string~"** "~tildes"]
asis]


[h3 extract-words-from-string

[p We've just about covered this above. Provides access to the **make-word-list** function
in an index. Allows you to:
[li analyse the built-in behavior
[li see the effects of changing the **parameters** with **set-word-definition**
[li test any replacement make-word-list function you have written
list]

[asis/style/*
USAGE:
    skimp/extract-words-from-string index-name string /for-search

ARGUMENTS:
     index-name -- name of the index (Type: file)
     string -- string from which to extract the words (Type: string)

REFINEMENTS:
     /for-search -- Says if words need to be extracted for
                    searching or indexing
asis]

[p example:
[asis/style/*
    skimp/extract-words-from-string %my-index {[here_are (some words) ""to"" $index!}
    == ["are" "here" "index!" "some" "to" "words"]
asis]


div] [div/style/*-1 [h2 set-config


 [h3 basic index configuration

[p There are three magic settings you can apply when creating an
index that may affect its performance.

[p Once these are set, they cannot be changed for the lifetime of the
index (See **skimp-tools** later for possible exceptions).  The settings
do not matter much while you are evaluating or playing with skimp. But you
may want to run benchmarks on large data sets before building any
skimp indexes for critical work.

[p The three settings are:

[li **index-levels** -- how deep an index to build
[li **integer-document-names ** -- whether document names are **string!s** or **integer!s**
[li **one-file** -- whether the index consists of one file or a series of files
list]

[p The meaning of these settings is discussed below.

[p To find the settings for an index, use **get-index-information**. The
values are returned in the **config** object:

[asis/style/*
     index-info: skimp/get-index-information %my-index
     probe index-info/config
asis]

[p To set or change a value, use **set-index-config** (see below).

[p As you can only set these values on a new index (one with no words in it), you
may need to surround any code that sets them with a test that the index exists or not:

[asis/style/*
    if not skimp/index-exists? %my-index [
         .... code to set config values
asis]


[h3 set-config

[p Sets or changes the value of a basic configuration variable

[asis/style/*
USAGE:
    skimp/set-config index-name config /defer

ARGUMENTS:
     index-name -- name of the index (Type: file)
     config -- object containing your configuration settings (Type: object)

REFINEMENTS:
     /defer -- Do not write to permanent storage
asis]

[p examples:
[asis/style/*
    probe skimp/set-config %my-index [index-levels: 6]
    probe skimp/set-config %my-index [index-levels: 3 one-file: true]
asis]

[p The response is the newly updated configuration object.

[p Note that you do not need to specify all the configuration values.
The ones you omit will not be changed


[h3 set-config/one-file

[p The **one-file** config setting sets whether the index is created as a single
file or a series of files.

[li **one-file: true** -- the index will be created as one file
[li **one-file: false** -- there will be one file for the index header, plus one
file for each first character indexed. (If your index indexes words beginning a, b and c,
there will be four files in the index: header a-file, b-file, c-file. The index file names
are made up as **index-name-N.sif** where N is **to-integer to-char word/1**
list]

[p example:
[asis/style/*
    probe skimp/set-config %my-index make object! [one-file: true]
asis]

[h4 what's the point?

[p The main point is speed of retrieval. skimp was written to perform well when running
queries on a webserver. A typical search on a web server consists of just a few words.
With a multiple-file index, we need load and initialise only those parts of the index,
 thus ensuring less i/o and cpu time.

[p The same advantage may apply in a desktop application: rather than having the whole
index loaded at application start-up or on the first query, just the relevant parts need
be loaded.

[p There are similar advantages when updating an index -- if a new document has words that
begin with just [a b c] then only those three files will by written back.

[p **skimp-tools** (See later) will provide a way of flipping this setting
after an index has been created.


[h3 set-config/index-levels

[p Changes the default number of levels in the index.

[p The **skimp-tools** documentation will contain some more detailed information
on the internals of the skimp index. A brief explanation for now:

[p The index is a bit like (not identical to) a **b-tree**. If you have a
three-level index, then the words REBOL and REBEL are indexed like this:

[asis/style/* "r"  ===&gt; "e"  ===&gt; "b"  ===&gt; ["el" [1 2 3] "ol" [1 4 58x100]] asis]

[p That is: there are three levels of index (for the characters R, E and B).
They point you to a list that contains the other letters of words that begin
REB, plus an index of which documents contain them (REBEL is in docs 1, 2,
 and 3; REBOL is in docs 1, 4 and 58 through 157)


[p **index-levels** lets you set the number of levels of index:
[li the default is 3 (like the example above)
[li this value cannot be changed once any words are indexed (at least, not
without recreating the whole index
[li the minimum is 1
[li there is no maximum, but anything about 4 is likely to be adding more layers
than you could ever need.
list]

[p example:
[asis/style/*
    probe skimp/set-config %my-index make object! [index-levels: 2]
asis]


[h4 what's the best setting?

[p You'll need to experiment with your live data. In practice, the best **index-levels**
setting makes between 10% and 20% difference to file sizes and retrieval time, so the
setting is unlikely to make a crucial difference. (Your experience may vary with indexes that are
 many megabytes in size).


[h3 set-config/integer-document-names

[p In all the examples so far, **document names** have been **string!s**,
though there have been some hints that they can also be integers.

[p Internally, skimp uses **document ids**, which are always integers.

[p skimp uses a **document name** to **document id** table to translate between
them.

[p So, if it happens that your document names are integers that meet the
requirement below, you can eliminate that conversion table. The requirements are:
[li must be an integer
[li must be 1 or higher (ie not zero or negative)
[li it will help if they are more-or-less consecutive, though they do not need
to start from 1. If your document ids are not more-or-less consecutive, you may
be better off treating them as strings.
list]

[p examples:
[asis/style/*
    skimp/set-config %my-index make object! [integer-document-names: true]
    skimp/add-words %my-index 55 "this document name is an integer"
asis]


div] [div/style/*-1 [h2 limitations

[p Just so you know:
[li All indexes are **case insensitive** (words are folded to **lowercase**)
[li skimp indexes the presence or absence of a word in a document. It does not
note whether words are near to each other.
[li The empty string is **never** indexed:
    [asis/style/*
        add-words %my-index ["" "hello"]
     asis]
     indexes only **hello**
[li **remove-document** de-indexes an entire document. There is no current way
to remove just some of the words.
[li The definition of what a ""word"" is is crucial to the operation of an index;
you may need to spend time working on this as part of any indexing
project you are planing.
[li **find-words** ANDS the results together. If you want a search that uses OR,
you need to make multiple calls to find-words, and merge the results, eg:
  [asis/style/*
      unique sort rejoin [
                  find-words %my-index ["cat"]
                  find-words %my-index ["dog" "~horse"]
                  find-words %my-index ["pony" wombat"]
                  ]
  asis]
  Finds all documents that contain cat **OR** (dog **AND NOT** horse) **OR** (pony
**AND** wombat}
list]

div] [div/style/*-1 [h2 Coming soon

[p We're hoping to release two extras in the next few of weeks. Look out for
them on REBOL.org:

[li **skimp-my-altme.r** -- demo applications
[li **skimp-tools.r** -- extra skimp facilities
list]

[h3 skimp-my-altme.r

[p Is a demonstration application. It will consist of an API to index the
posts in **Altme worlds**. Depending on time, there may be a cheapo GUI in front
of it. **If you would like to help write a decent front end to the Altme world
 indexer, please contact me today**.


[h3 skimp-tools.r

[p This will extend the **skimp** function to provide features a **developer**
may need when writing applications that use skimp; or if extending the basics of
skimp itself. They will include:

[li **integrity-check** -- a scan of an index to find any build problems;
[li **flip-one-file** -- an option to change an index from one large file
to a set of smaller ones;
[li **rename-document** - a way to change **document-names** without having to
delete and reindex a document.

list]

[p The accompanying document will include some developer's notes on
the internals of a skimp index.


div]
div]