Script Library: 1213 scripts
  • Home
  • Script library
  • AltME Archive
  • Mailing list
  • Articles Index
  • Site search
 
View scriptLicenseDownload documentation as: HTML or editable
Download scriptHistoryOther scripts by: sunanda

Documentation for: skimp.r


     script: skimp.r
      title: Simple keyword index management program
     author: Sunanda
       date: 23-apr-2007
    Version: 0.0.2
 

purpose

skimp.r lets you index the words in set of documents. You can then retrieve the list of documents that contain (or not) specific words.

skimp.r is used extensively inside REBOL,org for many of its search indexes.

1. thanks and credits

  • Christian, Romano and everyone on the Mailing List who contributed to rse-ids.r
  • Peter Wood for writing the make-word-list.r script, and creating test data cases for rse-ids, skimp and make-word-list.

2. quick start highlights

2.1. Version summary

  • 0.0.0 11-aug-2005 Written for internal use only
  • 0.0.1 3-apr-2007 First public release; uses rse-ids.r and make-word-list.r
  • 0.0.2 23-apr-2007 Add flush-cache and flush-cache-all

2.2. sample data

Throughout parts of this documentation we use the following data structure as an example: it is a block with document-names followed by the document's content.

   skimp-test-data: [
    "einstein-1" "You see, the wire telegraph is a kind of a very, very long cat."
      "saying-1" "Managing programmers is like herding cats."
    "einstein-1" "You pull his tail in New York and his head is meowing in Los Angeles."
       "lyric-1" "I thought I saw a puddy cat a-creeping up on me"
    "einstein-1" "Do you understand this?"
     "medical-1" "The cat-scan showed nothing unusual"
    "einstein-1" "And radio operates exactly the same way: you send signals here, they receive them
 there."
      "saying-2" "Dogs see people as companions; cats see people as staff."
    "einstein-1" "The only difference is that there is no cat."
      "cliche-1" "It's been raining cats and dogs since last night!"
    "doggerel-1" "The cat sat on the mat"
      "saying-3" "While the cat's away, the mice will play"
     ]
 

Note that the document einstein-1 appears several times in the set. This is to help illustrate what happens when skimp's index is updated for the same document.

2.3. build an index

Let's build a basic skimp index from that data set:

       ;; ensure folder exists for index
   if not exists? %skimp-folder [make-dir %skimp-folder]

   index-name: %skimp-folder/skimp-test-index

       ;; add each document to the index
   foreach [document-name contents] skimp-test-data [
       skimp/add-words index-name document-name contents
       ]
 

You should now have a set of files in the %skimp-folder folder. Their names look like: skimp-test-index-117.sif

2.4. search index for words

We can now use skimp to find words in documents:

    probe skimp/find-word index-name "cat"
    == ["saying-3" "doggerel-1" "medical-1" "lyric-1" "einstein-1"]

    probe skimp/find-word index-name "the"
    == ["saying-3" "doggerel-1" "medical-1" "einstein-1"]

    probe skimp/find-word index-name "cats."
    == []               ;; "cats." is not a word according to the default assumptions

    probe skimp/find-word index-name "cat-scan"
    == ["medical-1"]   ;; "cat-scan" is one word

    probe skimp/find-word index-name "scan"
    ["medical-1"]      ;; the "scan" part of "cat-scan" is also a findable word

    probe skimp/find-word index-name "difference"
    == ["einstein-1"

        ;; note: find-WORDS takes a block (find-WORD takes a string)
    probe skimp/find-words index-name ["a" "cat"]
    ["lyric-1" "einstein-1"]

    probe skimp/find-word index-name "~dog"
    ["einstein-1" "saying-1" "lyric-1" "medical-1" "saying-2" "cliche-1" "doggerel-1" "saying-3"]

 

Notes

  • both find-word and find-words return a block of document-names
  • all words in the index are lowercase -- (skimp is case insensitive)
  • skimp has applied some logic to extract words from the document content. The logic is highly flexible, and can be completely replaced -- so you can easily use skimp with your own definition of what a word is -- see what is a word? later
  • all "parts" of the einstein-1 document have been added
  • use find-word to find all documents containing a given word
  • use find-words to find all documents containing one or more words (the results are ANDed together -- so all words must occur for there to be a match).
  • prefix a word with a tilde (eg ~dog) to find all documents that do not contain the word (the tilde is configurable if you want to use something else)

2.5. retrieve more from the index

      ;; all indexed words beginning with an a:
   probe skimp/get-indexed-words index-name "a"
   == ["a" "a-creeping" "and" "angeles" "as" "away"]

      ;; ditto, with the names of documents that contain the words:
   probe skimp/get-indexed-words/document-list index-name "a"
   == ["a" ["einstein-1" "lyric-1"] "a-creeping" ["lyric-1"]
       "and" ["cliche-1" "einstein-1"] "angeles" ["einstein-1"]
       "as" ["saying-2"] "away" ["saying-3"]

      ;; all words beginning t in einstein-1:
   probe skimp/get-indexed-words-for-document index-name "einstein-1" "t"
   == ["tail" "telegraph" "that" "the" "them" "there" "they" "this?"]
 

2.6. remove a document

You can remove a document entirely like this:

   probe skimp/find-word index-name "difference"
   == ["einstein-1"]      ;; finds one document
   probe skimp/remove-document index-name "einstein-1"
   probe skimp/find-word index-name "difference"
   == []                  ;; finds no document
 

2.7. other highlights

In summary:

  • you can have many indexes open at the same time
  • you can check if an index exists, and delete an entire index
  • you can get information about an index, and you can save your oww data in an index header
  • add-bulk-words gives you a faster way if adding many documents at the same time
  • there are various caching and configuration options to tune an index to your needs

3. usage in detail

3.1. starting skimp

Simply use:

do %skimp.r 
  • You will need rse-ids.r (in the REBOL.org library). rse-ids.r should be already loaded, or available in the same folder as skimp so skimp.r can load it.
  • For best results, you'll need make-word-list.r (also in the REBOL library). make-word-list provides a flexible and configurable way of extracting all the "words" from a document.

3.2. closing down

No need to do anything, unless you have used updates with the /defer refinement. If so, you need to write back all open indexes:

skimp/write-cache-all/flush 
You could then if you wish remove skimp from memory:
   unset 'skimp
   unset 'rse-ids
 

3.3. configuring skimp

skimp itself needs very little configuring for use (you can extensively configure individual indexes -- see later).

There are only two global configuration settings to change or set the prefix and suffix for the file name for all indexes:

   do %skimp.r
   skimp/index-name-prefix: "live-build-"
   skimp/index-name-suffix: ".index"
 

With the example above, the files for index my-index will be called live-build-my-index-nnn.index The defaults are:

  • index-name-prefix -- none (ie null string)
  • index-name-suffix -- .sif (Skimp Index File)

4. creating, configuring and closing an index

4.1. creating a skimp index

You do not need to do anything to create an index -- simply start using it and it will be created automatically.

The only pre-requisite is that the folder to store the index exists....

   skimp/add-words %/c/dev/indexes/ind-1 "einstein-1" "You see"
 

.... will create an index called ind1.sif in the folder /c/dev/indexes/ind-1 -- provided that folder already exists.

4.2. configuring a skimp index

There are some settings you may wish to apply before adding any documents to an index; and there are some that you have to apply before adding any documents.

It is best to configure the index before performing other operations on it. You can use skimp/index-exists? to see if an index exists or not. If it does not, you are able to apply the configuration settings:

   if not skimp/index-exists? %a-test-index [
       skimp/set-word-definition .... ;; define what a "word" is in this index
       ]
 
The configuration settings are defined later. In summary, they are:
  • skimp/set-config -- performance settings for the index
  • skimp/set-word-definition -- define what a "word" is for the index
  • 4.3. closing a skimp index

    without caching: if you are not using the skimp cache, then no explicit close is needed: all data is written back on each update operation, so the data on disk is current.

    with caching: if one or more indexes are cached (see below for details) you need to write the cache out before ending your program, eg:

        skimp/write-cache/flush %test-index-1
        skimp/write-cache/flush %test-index-2
     

    If you've not been keeping a close eye on what indexes you have opened and cached, use write-cache-all to write them all back to permanent storage:

        skimp/write-cache-all/flush
     

    5. caching one or more indexes

    5.1. /defer

    These operations can all take the /defer refinement:

    • add-words -- add a document and its words
    • add-bulk-words -- add several documents and their words in one go
    • remove-document -- remove a document and all its words
    • removed-documents -- remove several documents and all their words
    • set-config -- performance settings for an index
    • set-word-definition -- define what a "word" is
    • set-owner-data -- any fields you want to keep in the index header

    Using /defer can speed things up as skimp does not write the index files back to permanent storage until you issue either a skimp/write-cache or a skimp/write-cache-all

    As explained above:

    • write-cache takes as an argument the name of an index. It writes that index -- provided it was in the cache.
    • write-cache-all has no arguments. It writes back all indexes found in the cache.

    5.2. /flush

    Both write-cache and write-cache-all can take the /flush refinement.

    • with /flush: the index files are removed from the cache -- so any future operations will reread the index files from permanent storage.
    • without /flush: the index files are written out and retained in the cache.

    6. adding and removing documents from an index

    6.1. overview

    There are four main functions for this:

    • add-words -- adds all the words in a document
    • add-bulk-words -- add all the words in multiple documents
    • remove-document -- remove a document from the index
    • remove-documents -- remove several documents from the index

    As noted above, each can take the /defer refinement. If you have a series of operations to perform on an index, using /defer can speed things up considerably.

    6.2. document names

    By default, a document's name must be a string! -- previous examples include "einstein-1" and "saying-1".

    There is a configuration setting that allows document-names to be integer!. That is explained later (See set-config/integer-document-names). All the document-names in one index must be of the same datatype.

    6.3. add-words

    Adds all the words in one document to an index.

    If the document is already in the index, it adds all the new words found in the latest version of the document, but does not remove any words that are no longer found in it: to replace a document in an index, use remove-document followed by add-words.

     USAGE:
         skimp/add-words index-name document-name words /defer
    
     ARGUMENTS:
          index-name -- name of the index (Type: file)
          document-name -- name of the document (Type: string integer)
          words -- words to add (Type: string block)
    
     REFINEMENTS:
          /defer -- Do not write to permanent storage
     

    The words can be supplied as either a string! or a block!

    • if a string! -- skimp parses the string to extract the words. You can override skimps definition of a word in several ways -- see what is a word?
    • if a block! -- skimp expects a block! of string!s -- each string is added to the index as is. This allows you complete control over what is considered a "word".

    examples:

        skimp/add-words %my-index "parrot" "->: I wish to make a complaint -:)"
        skimp/add-words %my-index "smilies" ["-:)" "->:" "@/."]
     

    In the smilies document, the various keyboard symbols are treated as indexable words, while in parrot they are not.

    6.4. add-bulk-words

    Adds all the words in zero or more documents to an index.

    Very similar to add-words except add-bulk-words accepts more than one document.

     USAGE:
         skimp/add-bulk-words index-name data-set /defer
    
     ARGUMENTS:
          index-name -- name of the index (Type: file)
          data-set -- set of details to add [Type: block]
    
     REFINEMENTS:
          /defer -- Do not write to permanent storage
     

    The data-set is a block with pairs of entries: document-name followed by words for that document.

    As with add-words the words can be string! or block!

    example:

        skimp/add-bulk-words %my-index [
                "parrot" "I wish to make a complaint -:)"
               "smilies" ["-:)" "->:" "@/."]
           ]
     

    The example adds the same two files as the add-words example.

    6.5. choosing between add-words and add-bulk-words

    • If you have multiple documents to add, add-bulk-words can be up to five times faster than add-words
    • multiple add-words gives an interactive application more chances to report progress to the waiting user
    • too many documents at once (say over 30, though you will need to test for your application as document size is a big factor) and add-bulk-words can be slower than add-words.

    6.6. remove-document

    Removes all the words in one document from an index.

    If the document is not in the index, then no action is taken.

     USAGE:
         skimp/remove-document index-name document-name  /defer
    
     ARGUMENTS:
          index-name -- name of the index (Type: file)
          document-name -- name of the document (Type: string integer)
    
     REFINEMENTS:
          /defer -- Do not write to permanent storage
     

    examples:

        skimp/remove-document %my-index "parrot"
        skimp/remove-document %my-index "einstein-1"
     

    question: Can I remove just some of the words of a document from an index?

    response: No, not in this version of skimp.

    question: How do I delete an entire index?

    response: See remove-index below.

    6.7. remove-documents

    Removes all the words in zero or more documents from an index.

    Document names which are not in the index are ignored.

     USAGE:
         skimp/remove-documents index-name document-list  /defer
    
     ARGUMENTS:
          index-name -- name of the index (Type: file)
          document-list -- names of the document (Type: block)
    
     REFINEMENTS:
          /defer -- Do not write to permanent storage
     

    example:

        skimp/remove-documents %my-index ["parrot" "einstein-1"]
     

    6.8. remove-document vs remove-documents

    remove-document is a courtesy wrapper for remove-documents, so it does not matter if you use remove-documents with just one document name; the effect is the same.

    If you have multiple documents to remove it is usually much faster to use remove-documents once rather than remove-document multiple times.

    That is because skimp needs to scan the entire index for each remove operation. If you are removing 50 documents, that is 50 scans using remove-document rather than one single scan using remove-documents.

    7. searching for documents

    To find the documents that contain specific words, use either:

    • find-word -- searches for a specific word
    • find-words -- searches for one or more words
    In both cases, a search can be negated (ie find all the documents that do not contain a word) be using the negate prefix.

    You can also search more widely, using:

    • get-indexed-words -- find all the words that are contained in the index
    • get-indexed-words-for-document -- find all the words indexed for a specific document

    7.1. find-word

    Returns a block of document-names that contain that word

     USAGE:
         skimp/find-word index-name word
    
     ARGUMENTS:
          index-name -- name of the index (Type: file)
          word -- word to search for (Type: string)
     

    examples:

        skimp/find-word %my-index "cat"
        skimp/find-word %my-index "~cat"
     
    • The first example returns the names of all documents containing the word "cat".
    • The second returns all those that do not contain "cat".

    question: How can I know if the word I am looking for is well-formed? ie that it is formed to the same parsing rules as add-words?

    response: Good question! One way is to use extract-words-from-string which is explained later on. It allows you access to the same processing as add-words. (If you do use extract-words-from-string, it returns a block so you need to use it in conjunction with find-words below).

    7.2. find-words

    Returns a block of document-names that contain all the words.

     USAGE:
         skimp/find-words index-name word
    
     ARGUMENTS:
          index-name -- name of the index (Type: file)
          words -- words to search for (Type: block)
     

    examples:

        skimp/find-words %my-index ["cat" "sat" "mat"]
        skimp/find-words %my-index ["~cat" "sat" "mat"]
        skimp/find-words %my-index
            skimp/extract-words-from-string/for-search %my-index "~Cat, sat mat."
     
    • The first example returns the names of all documents containing the three words "cat", "sat" and "mat"
    • The second returns all those that contain "sat" and "mat" do not contain "cat".
    • The third example is the same as the second (assuming the definition of a word in force excludes punctuation) except that it uses extract-words-from-string to turn a user-supplied search string into a block of words.

    7.3. get-indexed-words

    Returns a block of all the words in the index that begin with a specific character. Optionally, also return a block of the matching document names.

     USAGE:
         skimp/get-indexed-words index-name character /document-list
    
     ARGUMENTS:
          index-name -- name of the index (Type: file)
          character -- first character of words being sought to search for (Type: char string)
    
     REFINEMENTS:
          /document-list -- Also return the matching document names
     

    examples:

        skimp/get-indexed-words %my-index "n"
        skimp/get-indexed-words/document-list %my-index "n"
     

    7.3.1. what first letters exist in the index?

    get-index-information (see later) provides a list of all first characters that exist in the index. This code will display all the words in the index:

        index-info: skimp/get-index-information %my-index
        foreach first-char index-info/top-index [
            probe skimp/get-indexed-words %my-index first-char
            ]
     

    7.4. get-indexed-words-for-document

    Returns a block of all the words in a specific document that match a specific first character.

     USAGE:
         skimp/get-indexed-words index-name document-name character
    
     ARGUMENTS:
          index-name -- name of the index (Type: file)
          document-name -- name of document (Type integer string)
          character -- first character of words being sought to search for (Type: char string)
    
     

    examples:

        skimp/get-indexed-words-for-document %my-index "einstein-1" "b"
        skimp/get-indexed-words-for-document %my-index "saying-1" "m"
     

    7.4.1. what documents are in the index?

    Skipping ahead, get-index-information is one way to get a list of all document names.

    8. other index management functions

    These functions allow you to manage your skimp indexes efficiently and effectively:

    • index-exists? -- checks if an index exists or not
    • get-index-information -- returns header information about an index
    • remove-index -- delete an index entirely
    • set-owner-data -- add your own index-specific data about an index
    • write-cache -- write out one specific index
    • write-cache-all -- write out all open indexes
    • flush-cache -- empty one specific index
    • flush-cache-all -- empty all open indexes

    8.1. index-exists?

    Returns true or false depending on whether an index exists or not

     USAGE:
         skimp/index-exists? index-name
    
     ARGUMENTS:
          index-name -- name of the index (Type: file)
    
     

    example:

       if not skimp/index-exists? %my-indexes/index-1 [
           print "sorry -- not yet created"
           ]
     

    8.2. get-index-information

    Returns an object! containing information about the index, or none if the index does not exist.

     USAGE:
         skimp/get-index-information index-name
    
     ARGUMENTS:
          index-name -- name of the index (Type: file)
    
     

    example:

       if object? ind-header: skimp/get-index-information %my-indexes/index-1 [
           probe ind-header
           ]
     

    8.2.1. index header object

    Has these fields:

    index-file:file: the name of the index file
    top-index:block: of first characters indexed.
    owner-data:object: any owner data you have set with set-owner-data (see later)
    config:object: create-time configuration settings. See later
    word-parameters:object: parameters defining what a word is. See later
    make-word-list:function actual function used in this index to define what a word is.

    8.3. remove-index

    Deletes all files related to an index.

     USAGE:
         skimp/remove-index index-name
    
     ARGUMENTS:
          index-name -- name of the index (Type: file)
    
     

    example:

         skimp/remove-index/my-old-index
     

    8.4. set-owner-data

    Allows you to store an object! in the index header. You could use this to (say) record an extended name for the index, or to add some application-specific notes about the index

     USAGE:
         skimp/set-owner-data index-name owner-data /defer
    
     ARGUMENTS:
          index-name -- name of the index (Type: file)
          owner-data -- your data (Type: object)
    
     REFINEMENTS:
          /defer -- Do not write to permanent storage
     

    example:

         skimp/set-owner-data %my-index make object! [
             last-updated: now
             index-name: "All my photo captions"
             support: "call 555-1234"
             ]
     

    8.4.1. updating existing owner-data

    Simply specify the fields you want to change....

         skimp/set-owner-data %my-index make object! [
             support: "call 555-9876"
             ]
     

    ....Other fields in the owner-data are left unchanged.

    To get the full owner-data record, use get-index-information.

    8.5. write-cache

    Writes the data for a specific index to permanent storage. Use if you have been using the /defer refinement on update operations.

     USAGE:
         skimp/write-cache index-name /flush
    
     ARGUMENTS:
          index-name -- name of the index (Type: file)
    
     REFINEMENTS:
          /flush -- Purge the index from the cache after writing
                    (further operations will cause it to be re-read)
     

    examples:

         skimp/write-cache %my-index-1
         skimp/write-cache/flush %my-index-2
     

    8.6. write-cache-all

    Writes the data for all indexes to permanent storage. Use if you have been using the /defer refinement on update operations, and your clean-up routines are unsure of what indexes your software have been updating

     USAGE:
         skimp/write-cache /flush
    
     ARGUMENTS:
          none
    
     REFINEMENTS:
          /flush -- Purge the indexes from the cache after writing
                    (further operations will cause them to be re-read)
     

    example:

         skimp/write-cache-all
         skimp/write-cache-all/flush
     

    8.7. flush-cache

    Clears an index from the cache without first writing it to permanent storage.

    If you have been updating an index using the /defer refinement, this is a way of purging all the changes since the last write-cache.

    If you have an index open purely for reading (ie not one you are updating in the cuurent task) this is a way of forcing the index to be reread from permanent storage. That may be of use if either:

    • the index is being updated by another task, and you want to read the updated version; or
    • you have a large index and wish to release the memory it is occupying
     USAGE:
         flush/write-cache index-name
    
     ARGUMENTS:
          index-name -- name of the index (Type: file)
    
     

    example:

         skimp/flush-cache %my-index-1
     

    8.8. flush-cache-all

    Clears all indexes from the cache

     USAGE:
         skimp/flush-cache
    
     ARGUMENTS:
          none
    
     

    examples:

         skimp/flush-cache-all
     

    9. what is a word?

    skimp indexes the words in documents. So an obvious question is: how does it know what a word is?

    The simple answer is:

    • 1. skimp uses make-word-list.r
    • 2. make-word-list is highly configurable, so you can tweak it to fit your needs
    • 3. if you need more tweaks than are possible with the configuration parameters, you can provide your own function to extract the words in a document.
    • 4. In addition, the block format of add-words and add-bulk-words lets you directly state what the indexable words are.

    In principle, a word can be any string of one or more characters. You are not limited to letters, numerals or even printable characters.

    Your definition of what a word is (including any function you have supplied to extract the words in a document) are held in the index itself. So the definition of a word remains stable and constant throughout the life of the index.

    There are two functions for managing the definition of a word for an index:

    • set-word-definition -- add or update the definition of a word
    • extract-words-from-string -- test how an index defines a word
    In addition, get-index-information (see above) returns you the current definition for any given index.

    9.1. set-word-definition

    Defines what a word is for a given index

     USAGE:
         skimp/set-word-definition index-name /parameters parm-obj /make-word-list mwlf /defer
    
     ARGUMENTS:
          index-name -- name of the index (Type: file)
    
     REFINEMENTS:
          /parameters -- extend/change definition for make-word-list function
              parm-obj (Type: object)
    
          /make-word-list -- replacement function
              mwlf (Type: function none)
    
          /defer -- Do not write to permanent storage
     

    examples:

         skimp/set-word-definition/parameters
              make object! [
                  initial-letters: charset compose [#"a" - #"z" #"A" - #"Z" (to-char 199) (to-char 231) ]
           ]
    
         skimp/set-word-definition/make-word-list
               func [parms string] [return unique parse/all string " "]
     
    • the first example changes the default parameters for make-word-list to add chars 199 and 231 (C-cedilla) to the letters that make up a word
    • the second example replaces the default make-word-list function with a custom (and very basic) function to find the words in a document.
    • these options are explained in more detail below

    9.1.1. changing the word definition for an existing index

    You can do this, but it may not be wise. It will not affect any of the words already indexed, but it may affect your ability to search for them.

    For example, you build an index over 1000 documents in which decimal numbers are not treated as words. Later you change the definition so they are treated as words, and add another 1000 documents. Only the latter 1000 documents will have their numbers indexed.

    9.1.2. parameter object

    The parameter object is passed to the make-word-list function. By default it is similar to the default object used by make-word-list.

    You can find skimp's settings for an index like this:

       probe get in skimp/get-index-information %my-index 'word-parameters
    
     [h3 make-word-list function
    
     [p By default, skimp looks for %make-word-list.r in the current folder:
     [li if found: skimp uses it as the default make-word-list function
     [li if not, it uses a fairly useless minimal function:
     [asis
          func [ parms  string [string!]
         ][
           return unique sort parse/all trim/lines copy string " "
         ]
     

    If you want skimp to use another function that you have written, you can specify it with set-word-definition/make-word-list.

    • You must supply a function that takes two arguments and one refinement:
      • an object -- you will be passed the parameter object as described above. You are, of course, free to ignore it.
      • a string -- the string that you are required to break into indexable words
      • /for-search -- a refinement. Described below.
    • Your function will be saved in the index header, and later executed from there. So it must be self-contained, making no references outside of itself.
    • Your function must return a block of zero or more unique strings. These are the words you have extracted from the input string.
    • Your returned block may contain the empty string (""). If so, it will be ignored.

    example:

       skimp/set-word-parameters/make-word-list
           func [
              obj [object!]
              str [string!]
              /for-search
           ][
           return unique parse/all str "."
           ]
     

    The example is hardly the most useful word-extracting function in the world :-)

    To test your function (or the built-in function) use extract-words-from-string (See later).

    9.1.3. /for-search refinement

    There are two subtly different reasons why you may want to extract the words from a string:

    • the string is the source document that you want to index
    • the string is a user-supplied search string that you want to turn into separate words so you can search for them using find-words

    The two uses may in your application need to behave slightly differently, especially with regard to handling the not-prefix. As an example, the built-in make-word-list function acts differently when given a string that contains tildes -- a leading tilde is preserved with the /for-search refinement:

         skimp/extract-words-from-string %my-index "I have some ~tildes in t~~~his ~string~"
         == ["have" "I" "in" "some" "string~" "tildes" "t~~~his"]
         skimp/extract-words-from-string/for-search %my-index "I have some ~tildes in t~~~his
     ~string~"
         == ["have" "I" "in" "some" "t~~~his" "~string~" "~tildes"]
     

    9.2. extract-words-from-string

    We've just about covered this above. Provides access to the make-word-list function in an index. Allows you to:

    • analyse the built-in behavior
    • see the effects of changing the parameters with set-word-definition
    • test any replacement make-word-list function you have written
     USAGE:
         skimp/extract-words-from-string index-name string /for-search
    
     ARGUMENTS:
          index-name -- name of the index (Type: file)
          string -- string from which to extract the words (Type: string)
    
     REFINEMENTS:
          /for-search -- Says if words need to be extracted for
                         searching or indexing
     

    example:

         skimp/extract-words-from-string %my-index {[here_are (some words) "to" $index!}
         == ["are" "here" "index!" "some" "to" "words"]
     

    10. set-config

    10.1. basic index configuration

    There are three magic settings you can apply when creating an index that may affect its performance.

    Once these are set, they cannot be changed for the lifetime of the index (See skimp-tools later for possible exceptions). The settings do not matter much while you are evaluating or playing with skimp. But you may want to run benchmarks on large data sets before building any skimp indexes for critical work.

    The three settings are:

    • index-levels -- how deep an index to build
    • integer-document-names -- whether document names are string!s or integer!s
    • one-file -- whether the index consists of one file or a series of files

    The meaning of these settings is discussed below.

    To find the settings for an index, use get-index-information. The values are returned in the config object:

          index-info: skimp/get-index-information %my-index
          probe index-info/config
     

    To set or change a value, use set-index-config (see below).

    As you can only set these values on a new index (one with no words in it), you may need to surround any code that sets them with a test that the index exists or not:

         if not skimp/index-exists? %my-index [
              .... code to set config values
     

    10.2. set-config

    Sets or changes the value of a basic configuration variable

     USAGE:
         skimp/set-config index-name config /defer
    
     ARGUMENTS:
          index-name -- name of the index (Type: file)
          config -- object containing your configuration settings (Type: object)
    
     REFINEMENTS:
          /defer -- Do not write to permanent storage
     

    examples:

         probe skimp/set-config %my-index [index-levels: 6]
         probe skimp/set-config %my-index [index-levels: 3 one-file: true]
     

    The response is the newly updated configuration object.

    Note that you do not need to specify all the configuration values. The ones you omit will not be changed

    10.3. set-config/one-file

    The one-file config setting sets whether the index is created as a single file or a series of files.

    • one-file: true -- the index will be created as one file
    • one-file: false -- there will be one file for the index header, plus one file for each first character indexed. (If your index indexes words beginning a, b and c, there will be four files in the index: header a-file, b-file, c-file. The index file names are made up as index-name-N.sif where N is to-integer to-char word/1

    example:

         probe skimp/set-config %my-index make object! [one-file: true]
     

    10.3.1. what's the point?

    The main point is speed of retrieval. skimp was written to perform well when running queries on a webserver. A typical search on a web server consists of just a few words. With a multiple-file index, we need load and initialise only those parts of the index, thus ensuring less i/o and cpu time.

    The same advantage may apply in a desktop application: rather than having the whole index loaded at application start-up or on the first query, just the relevant parts need be loaded.

    There are similar advantages when updating an index -- if a new document has words that begin with just [a b c] then only those three files will by written back.

    skimp-tools (See later) will provide a way of flipping this setting after an index has been created.

    10.4. set-config/index-levels

    Changes the default number of levels in the index.

    The skimp-tools documentation will contain some more detailed information on the internals of the skimp index. A brief explanation for now:

    The index is a bit like (not identical to) a b-tree. If you have a three-level index, then the words REBOL and REBEL are indexed like this:

    "r"  ===> "e"  ===> "b"  ===> ["el" [1 2 3] "ol" [1 4 58x100] 

    That is: there are three levels of index (for the characters R, E and B). They point you to a list that contains the other letters of words that begin REB, plus an index of which documents contain them (REBEL is in docs 1, 2, and 3; REBOL is in docs 1, 4 and 58 through 157)

    index-levels lets you set the number of levels of index:

    • the default is 3 (like the example above)
    • this value cannot be changed once any words are indexed (at least, not without recreating the whole index
    • the minimum is 1
    • there is no maximum, but anything about 4 is likely to be adding more layers than you could ever need.

    example:

         probe skimp/set-config %my-index make object! [index-levels: 2]
     

    10.4.1. what's the best setting?

    You'll need to experiment with your live data. In practice, the best index-levels setting makes between 10% and 20% difference to file sizes and retrieval time, so the setting is unlikely to make a crucial difference. (Your experience may vary with indexes that are many megabytes in size).

    10.5. set-config/integer-document-names

    In all the examples so far, document names have been string!s, though there have been some hints that they can also be integers.

    Internally, skimp uses document ids, which are always integers.

    skimp uses a document name to document id table to translate between them.

    So, if it happens that your document names are integers that meet the requirement below, you can eliminate that conversion table. The requirements are:

    • must be an integer
    • must be 1 or higher (ie not zero or negative)
    • it will help if they are more-or-less consecutive, though they do not need to start from 1. If your document ids are not more-or-less consecutive, you may be better off treating them as strings.

    examples:

         skimp/set-config %my-index make object! [integer-document-names: true]
         skimp/add-words %my-index 55 "this document name is an integer"
     

    11. limitations

    Just so you know:

    • All indexes are case insensitive (words are folded to lowercase)
    • skimp indexes the presence or absence of a word in a document. It does not note whether words are near to each other.
    • The empty string is never indexed:
               add-words %my-index [" "hello"]
            
      indexes only hello
    • remove-document de-indexes an entire document. There is no current way to remove just some of the words.
    • The definition of what a "word" is is crucial to the operation of an index; you may need to spend time working on this as part of any indexing project you are planing.
    • find-words ANDS the results together. If you want a search that uses OR, you need to make multiple calls to find-words, and merge the results, eg:
             unique sort rejoin [
                         find-words %my-index ["cat"]
                         find-words %my-index ["dog" "~horse"]
                         find-words %my-index ["pony" wombat"]
                         ]
         
      Finds all documents that contain cat OR (dog AND NOT horse) OR (pony AND wombat}

    12. Coming soon

    We're hoping to release two extras in the next few of weeks. Look out for them on REBOL.org:

    • skimp-my-altme.r -- demo applications
    • skimp-tools.r -- extra skimp facilities

    12.1. skimp-my-altme.r

    Is a demonstration application. It will consist of an API to index the posts in Altme worlds. Depending on time, there may be a cheapo GUI in front of it. If you would like to help write a decent front end to the Altme world indexer, please contact me today.

    12.2. skimp-tools.r

    This will extend the skimp function to provide features a developer may need when writing applications that use skimp; or if extending the basics of skimp itself. They will include:

    • integrity-check -- a scan of an index to find any build problems;
    • flip-one-file -- an option to change an index from one large file to a set of smaller ones;
    • rename-document - a way to change document-names without having to delete and reindex a document.

    The accompanying document will include some developer's notes on the internals of a skimp index.