Documention for: skimp.r
Created by: sunanda
on: 5-Apr-2007
Last updated by: sunanda on: 28-Apr-2007
Format: html
Downloaded on: 26-Jul-2024

     script: skimp.r
      title: Simple keyword index management program
     author: Sunanda
       date: 23-apr-2007
    Version: 0.0.2
 

purpose

skimp.r lets you index the words in set of documents. You can then retrieve the list of documents that contain (or not) specific words.

skimp.r is used extensively inside REBOL,org for many of its search indexes.

1. thanks and credits

2. quick start highlights

2.1. Version summary

2.2. sample data

Throughout parts of this documentation we use the following data structure as an example: it is a block with document-names followed by the document's content.

   skimp-test-data: [
    "einstein-1" "You see, the wire telegraph is a kind of a very, very long cat."
      "saying-1" "Managing programmers is like herding cats."
    "einstein-1" "You pull his tail in New York and his head is meowing in Los Angeles."
       "lyric-1" "I thought I saw a puddy cat a-creeping up on me"
    "einstein-1" "Do you understand this?"
     "medical-1" "The cat-scan showed nothing unusual"
    "einstein-1" "And radio operates exactly the same way: you send signals here, they receive them
 there."
      "saying-2" "Dogs see people as companions; cats see people as staff."
    "einstein-1" "The only difference is that there is no cat."
      "cliche-1" "It's been raining cats and dogs since last night!"
    "doggerel-1" "The cat sat on the mat"
      "saying-3" "While the cat's away, the mice will play"
     ]
 

Note that the document einstein-1 appears several times in the set. This is to help illustrate what happens when skimp's index is updated for the same document.

2.3. build an index

Let's build a basic skimp index from that data set:

       ;; ensure folder exists for index
   if not exists? %skimp-folder [make-dir %skimp-folder]

   index-name: %skimp-folder/skimp-test-index

       ;; add each document to the index
   foreach [document-name contents] skimp-test-data [
       skimp/add-words index-name document-name contents
       ]
 

You should now have a set of files in the %skimp-folder folder. Their names look like: skimp-test-index-117.sif

2.4. search index for words

We can now use skimp to find words in documents:

    probe skimp/find-word index-name "cat"
    == ["saying-3" "doggerel-1" "medical-1" "lyric-1" "einstein-1"]

    probe skimp/find-word index-name "the"
    == ["saying-3" "doggerel-1" "medical-1" "einstein-1"]

    probe skimp/find-word index-name "cats."
    == []               ;; "cats." is not a word according to the default assumptions

    probe skimp/find-word index-name "cat-scan"
    == ["medical-1"]   ;; "cat-scan" is one word

    probe skimp/find-word index-name "scan"
    ["medical-1"]      ;; the "scan" part of "cat-scan" is also a findable word

    probe skimp/find-word index-name "difference"
    == ["einstein-1"

        ;; note: find-WORDS takes a block (find-WORD takes a string)
    probe skimp/find-words index-name ["a" "cat"]
    ["lyric-1" "einstein-1"]

    probe skimp/find-word index-name "~dog"
    ["einstein-1" "saying-1" "lyric-1" "medical-1" "saying-2" "cliche-1" "doggerel-1" "saying-3"]

 

Notes

2.5. retrieve more from the index

      ;; all indexed words beginning with an a:
   probe skimp/get-indexed-words index-name "a"
   == ["a" "a-creeping" "and" "angeles" "as" "away"]

      ;; ditto, with the names of documents that contain the words:
   probe skimp/get-indexed-words/document-list index-name "a"
   == ["a" ["einstein-1" "lyric-1"] "a-creeping" ["lyric-1"]
       "and" ["cliche-1" "einstein-1"] "angeles" ["einstein-1"]
       "as" ["saying-2"] "away" ["saying-3"]

      ;; all words beginning t in einstein-1:
   probe skimp/get-indexed-words-for-document index-name "einstein-1" "t"
   == ["tail" "telegraph" "that" "the" "them" "there" "they" "this?"]
 

2.6. remove a document

You can remove a document entirely like this:

   probe skimp/find-word index-name "difference"
   == ["einstein-1"]      ;; finds one document
   probe skimp/remove-document index-name "einstein-1"
   probe skimp/find-word index-name "difference"
   == []                  ;; finds no document
 

2.7. other highlights

In summary:

3. usage in detail

3.1. starting skimp

Simply use:

do %skimp.r 

3.2. closing down

No need to do anything, unless you have used updates with the /defer refinement. If so, you need to write back all open indexes:

skimp/write-cache-all/flush 
You could then if you wish remove skimp from memory:
   unset 'skimp
   unset 'rse-ids
 

3.3. configuring skimp

skimp itself needs very little configuring for use (you can extensively configure individual indexes -- see later).

There are only two global configuration settings to change or set the prefix and suffix for the file name for all indexes:

   do %skimp.r
   skimp/index-name-prefix: "live-build-"
   skimp/index-name-suffix: ".index"
 

With the example above, the files for index my-index will be called live-build-my-index-nnn.index The defaults are:

4. creating, configuring and closing an index

4.1. creating a skimp index

You do not need to do anything to create an index -- simply start using it and it will be created automatically.

The only pre-requisite is that the folder to store the index exists....

   skimp/add-words %/c/dev/indexes/ind-1 "einstein-1" "You see"
 

.... will create an index called ind1.sif in the folder /c/dev/indexes/ind-1 -- provided that folder already exists.

4.2. configuring a skimp index

There are some settings you may wish to apply before adding any documents to an index; and there are some that you have to apply before adding any documents.

It is best to configure the index before performing other operations on it. You can use skimp/index-exists? to see if an index exists or not. If it does not, you are able to apply the configuration settings:

   if not skimp/index-exists? %a-test-index [
       skimp/set-word-definition .... ;; define what a "word" is in this index
       ]
 
The configuration settings are defined later. In summary, they are:
  • skimp/set-config -- performance settings for the index
  • skimp/set-word-definition -- define what a "word" is for the index
  • 4.3. closing a skimp index

    without caching: if you are not using the skimp cache, then no explicit close is needed: all data is written back on each update operation, so the data on disk is current.

    with caching: if one or more indexes are cached (see below for details) you need to write the cache out before ending your program, eg:

        skimp/write-cache/flush %test-index-1
        skimp/write-cache/flush %test-index-2
     

    If you've not been keeping a close eye on what indexes you have opened and cached, use write-cache-all to write them all back to permanent storage:

        skimp/write-cache-all/flush
     

    5. caching one or more indexes

    5.1. /defer

    These operations can all take the /defer refinement:

    Using /defer can speed things up as skimp does not write the index files back to permanent storage until you issue either a skimp/write-cache or a skimp/write-cache-all

    As explained above:

    5.2. /flush

    Both write-cache and write-cache-all can take the /flush refinement.

    6. adding and removing documents from an index

    6.1. overview

    There are four main functions for this:

    As noted above, each can take the /defer refinement. If you have a series of operations to perform on an index, using /defer can speed things up considerably.

    6.2. document names

    By default, a document's name must be a string! -- previous examples include "einstein-1" and "saying-1".

    There is a configuration setting that allows document-names to be integer!. That is explained later (See set-config/integer-document-names). All the document-names in one index must be of the same datatype.

    6.3. add-words

    Adds all the words in one document to an index.

    If the document is already in the index, it adds all the new words found in the latest version of the document, but does not remove any words that are no longer found in it: to replace a document in an index, use remove-document followed by add-words.

     USAGE:
         skimp/add-words index-name document-name words /defer
    
     ARGUMENTS:
          index-name -- name of the index (Type: file)
          document-name -- name of the document (Type: string integer)
          words -- words to add (Type: string block)
    
     REFINEMENTS:
          /defer -- Do not write to permanent storage
     

    The words can be supplied as either a string! or a block!

    examples:

        skimp/add-words %my-index "parrot" "->: I wish to make a complaint -:)"
        skimp/add-words %my-index "smilies" ["-:)" "->:" "@/."]
     

    In the smilies document, the various keyboard symbols are treated as indexable words, while in parrot they are not.

    6.4. add-bulk-words

    Adds all the words in zero or more documents to an index.

    Very similar to add-words except add-bulk-words accepts more than one document.

     USAGE:
         skimp/add-bulk-words index-name data-set /defer
    
     ARGUMENTS:
          index-name -- name of the index (Type: file)
          data-set -- set of details to add [Type: block]
    
     REFINEMENTS:
          /defer -- Do not write to permanent storage
     

    The data-set is a block with pairs of entries: document-name followed by words for that document.

    As with add-words the words can be string! or block!

    example:

        skimp/add-bulk-words %my-index [
                "parrot" "I wish to make a complaint -:)"
               "smilies" ["-:)" "->:" "@/."]
           ]
     

    The example adds the same two files as the add-words example.

    6.5. choosing between add-words and add-bulk-words

    6.6. remove-document

    Removes all the words in one document from an index.

    If the document is not in the index, then no action is taken.

     USAGE:
         skimp/remove-document index-name document-name  /defer
    
     ARGUMENTS:
          index-name -- name of the index (Type: file)
          document-name -- name of the document (Type: string integer)
    
     REFINEMENTS:
          /defer -- Do not write to permanent storage
     

    examples:

        skimp/remove-document %my-index "parrot"
        skimp/remove-document %my-index "einstein-1"
     

    question: Can I remove just some of the words of a document from an index?

    response: No, not in this version of skimp.

    question: How do I delete an entire index?

    response: See remove-index below.

    6.7. remove-documents

    Removes all the words in zero or more documents from an index.

    Document names which are not in the index are ignored.

     USAGE:
         skimp/remove-documents index-name document-list  /defer
    
     ARGUMENTS:
          index-name -- name of the index (Type: file)
          document-list -- names of the document (Type: block)
    
     REFINEMENTS:
          /defer -- Do not write to permanent storage
     

    example:

        skimp/remove-documents %my-index ["parrot" "einstein-1"]
     

    6.8. remove-document vs remove-documents

    remove-document is a courtesy wrapper for remove-documents, so it does not matter if you use remove-documents with just one document name; the effect is the same.

    If you have multiple documents to remove it is usually much faster to use remove-documents once rather than remove-document multiple times.

    That is because skimp needs to scan the entire index for each remove operation. If you are removing 50 documents, that is 50 scans using remove-document rather than one single scan using remove-documents.

    7. searching for documents

    To find the documents that contain specific words, use either:

    In both cases, a search can be negated (ie find all the documents that do not contain a word) be using the negate prefix.

    You can also search more widely, using:

    7.1. find-word

    Returns a block of document-names that contain that word

     USAGE:
         skimp/find-word index-name word
    
     ARGUMENTS:
          index-name -- name of the index (Type: file)
          word -- word to search for (Type: string)
     

    examples:

        skimp/find-word %my-index "cat"
        skimp/find-word %my-index "~cat"
     

    question: How can I know if the word I am looking for is well-formed? ie that it is formed to the same parsing rules as add-words?

    response: Good question! One way is to use extract-words-from-string which is explained later on. It allows you access to the same processing as add-words. (If you do use extract-words-from-string, it returns a block so you need to use it in conjunction with find-words below).

    7.2. find-words

    Returns a block of document-names that contain all the words.

     USAGE:
         skimp/find-words index-name word
    
     ARGUMENTS:
          index-name -- name of the index (Type: file)
          words -- words to search for (Type: block)
     

    examples:

        skimp/find-words %my-index ["cat" "sat" "mat"]
        skimp/find-words %my-index ["~cat" "sat" "mat"]
        skimp/find-words %my-index
            skimp/extract-words-from-string/for-search %my-index "~Cat, sat mat."
     

    7.3. get-indexed-words

    Returns a block of all the words in the index that begin with a specific character. Optionally, also return a block of the matching document names.

     USAGE:
         skimp/get-indexed-words index-name character /document-list
    
     ARGUMENTS:
          index-name -- name of the index (Type: file)
          character -- first character of words being sought to search for (Type: char string)
    
     REFINEMENTS:
          /document-list -- Also return the matching document names
     

    examples:

        skimp/get-indexed-words %my-index "n"
        skimp/get-indexed-words/document-list %my-index "n"
     

    7.3.1. what first letters exist in the index?

    get-index-information (see later) provides a list of all first characters that exist in the index. This code will display all the words in the index:

        index-info: skimp/get-index-information %my-index
        foreach first-char index-info/top-index [
            probe skimp/get-indexed-words %my-index first-char
            ]
     

    7.4. get-indexed-words-for-document

    Returns a block of all the words in a specific document that match a specific first character.

     USAGE:
         skimp/get-indexed-words index-name document-name character
    
     ARGUMENTS:
          index-name -- name of the index (Type: file)
          document-name -- name of document (Type integer string)
          character -- first character of words being sought to search for (Type: char string)
    
     

    examples:

        skimp/get-indexed-words-for-document %my-index "einstein-1" "b"
        skimp/get-indexed-words-for-document %my-index "saying-1" "m"
     

    7.4.1. what documents are in the index?

    Skipping ahead, get-index-information is one way to get a list of all document names.

    8. other index management functions

    These functions allow you to manage your skimp indexes efficiently and effectively:

    8.1. index-exists?

    Returns true or false depending on whether an index exists or not

     USAGE:
         skimp/index-exists? index-name
    
     ARGUMENTS:
          index-name -- name of the index (Type: file)
    
     

    example:

       if not skimp/index-exists? %my-indexes/index-1 [
           print "sorry -- not yet created"
           ]
     

    8.2. get-index-information

    Returns an object! containing information about the index, or none if the index does not exist.

     USAGE:
         skimp/get-index-information index-name
    
     ARGUMENTS:
          index-name -- name of the index (Type: file)
    
     

    example:

       if object? ind-header: skimp/get-index-information %my-indexes/index-1 [
           probe ind-header
           ]
     

    8.2.1. index header object

    Has these fields:

    index-file: file: the name of the index file
    top-index: block: of first characters indexed.
    owner-data: object: any owner data you have set with set-owner-data (see later)
    config: object: create-time configuration settings. See later
    word-parameters: object: parameters defining what a word is. See later
    make-word-list: function actual function used in this index to define what a word is.

    8.3. remove-index

    Deletes all files related to an index.

     USAGE:
         skimp/remove-index index-name
    
     ARGUMENTS:
          index-name -- name of the index (Type: file)
    
     

    example:

         skimp/remove-index/my-old-index
     

    8.4. set-owner-data

    Allows you to store an object! in the index header. You could use this to (say) record an extended name for the index, or to add some application-specific notes about the index

     USAGE:
         skimp/set-owner-data index-name owner-data /defer
    
     ARGUMENTS:
          index-name -- name of the index (Type: file)
          owner-data -- your data (Type: object)
    
     REFINEMENTS:
          /defer -- Do not write to permanent storage
     

    example:

         skimp/set-owner-data %my-index make object! [
             last-updated: now
             index-name: "All my photo captions"
             support: "call 555-1234"
             ]
     

    8.4.1. updating existing owner-data

    Simply specify the fields you want to change....

         skimp/set-owner-data %my-index make object! [
             support: "call 555-9876"
             ]
     

    ....Other fields in the owner-data are left unchanged.

    To get the full owner-data record, use get-index-information.

    8.5. write-cache

    Writes the data for a specific index to permanent storage. Use if you have been using the /defer refinement on update operations.

     USAGE:
         skimp/write-cache index-name /flush
    
     ARGUMENTS:
          index-name -- name of the index (Type: file)
    
     REFINEMENTS:
          /flush -- Purge the index from the cache after writing
                    (further operations will cause it to be re-read)
     

    examples:

         skimp/write-cache %my-index-1
         skimp/write-cache/flush %my-index-2
     

    8.6. write-cache-all

    Writes the data for all indexes to permanent storage. Use if you have been using the /defer refinement on update operations, and your clean-up routines are unsure of what indexes your software have been updating

     USAGE:
         skimp/write-cache /flush
    
     ARGUMENTS:
          none
    
     REFINEMENTS:
          /flush -- Purge the indexes from the cache after writing
                    (further operations will cause them to be re-read)
     

    example:

         skimp/write-cache-all
         skimp/write-cache-all/flush
     

    8.7. flush-cache

    Clears an index from the cache without first writing it to permanent storage.

    If you have been updating an index using the /defer refinement, this is a way of purging all the changes since the last write-cache.

    If you have an index open purely for reading (ie not one you are updating in the cuurent task) this is a way of forcing the index to be reread from permanent storage. That may be of use if either:

     USAGE:
         flush/write-cache index-name
    
     ARGUMENTS:
          index-name -- name of the index (Type: file)
    
     

    example:

         skimp/flush-cache %my-index-1
     

    8.8. flush-cache-all

    Clears all indexes from the cache

     USAGE:
         skimp/flush-cache
    
     ARGUMENTS:
          none
    
     

    examples:

         skimp/flush-cache-all
     

    9. what is a word?

    skimp indexes the words in documents. So an obvious question is: how does it know what a word is?

    The simple answer is:

    In principle, a word can be any string of one or more characters. You are not limited to letters, numerals or even printable characters.

    Your definition of what a word is (including any function you have supplied to extract the words in a document) are held in the index itself. So the definition of a word remains stable and constant throughout the life of the index.

    There are two functions for managing the definition of a word for an index:

    In addition, get-index-information (see above) returns you the current definition for any given index.

    9.1. set-word-definition

    Defines what a word is for a given index

     USAGE:
         skimp/set-word-definition index-name /parameters parm-obj /make-word-list mwlf /defer
    
     ARGUMENTS:
          index-name -- name of the index (Type: file)
    
     REFINEMENTS:
          /parameters -- extend/change definition for make-word-list function
              parm-obj (Type: object)
    
          /make-word-list -- replacement function
              mwlf (Type: function none)
    
          /defer -- Do not write to permanent storage
     

    examples:

         skimp/set-word-definition/parameters
              make object! [
                  initial-letters: charset compose [#"a" - #"z" #"A" - #"Z" (to-char 199) (to-char 231) ]
           ]
    
         skimp/set-word-definition/make-word-list
               func [parms string] [return unique parse/all string " "]
     

    9.1.1. changing the word definition for an existing index

    You can do this, but it may not be wise. It will not affect any of the words already indexed, but it may affect your ability to search for them.

    For example, you build an index over 1000 documents in which decimal numbers are not treated as words. Later you change the definition so they are treated as words, and add another 1000 documents. Only the latter 1000 documents will have their numbers indexed.

    9.1.2. parameter object

    The parameter object is passed to the make-word-list function. By default it is similar to the default object used by make-word-list.

    You can find skimp's settings for an index like this:

       probe get in skimp/get-index-information %my-index 'word-parameters
    
     [h3 make-word-list function
    
     [p By default, skimp looks for %make-word-list.r in the current folder:
     [li if found: skimp uses it as the default make-word-list function
     [li if not, it uses a fairly useless minimal function:
     [asis
          func [ parms  string [string!]
         ][
           return unique sort parse/all trim/lines copy string " "
         ]
     

    If you want skimp to use another function that you have written, you can specify it with set-word-definition/make-word-list.

    example:

       skimp/set-word-parameters/make-word-list
           func [
              obj [object!]
              str [string!]
              /for-search
           ][
           return unique parse/all str "."
           ]
     

    The example is hardly the most useful word-extracting function in the world :-)

    To test your function (or the built-in function) use extract-words-from-string (See later).

    9.1.3. /for-search refinement

    There are two subtly different reasons why you may want to extract the words from a string:

    The two uses may in your application need to behave slightly differently, especially with regard to handling the not-prefix. As an example, the built-in make-word-list function acts differently when given a string that contains tildes -- a leading tilde is preserved with the /for-search refinement:

         skimp/extract-words-from-string %my-index "I have some ~tildes in t~~~his ~string~"
         == ["have" "I" "in" "some" "string~" "tildes" "t~~~his"]
         skimp/extract-words-from-string/for-search %my-index "I have some ~tildes in t~~~his
     ~string~"
         == ["have" "I" "in" "some" "t~~~his" "~string~" "~tildes"]
     

    9.2. extract-words-from-string

    We've just about covered this above. Provides access to the make-word-list function in an index. Allows you to:

     USAGE:
         skimp/extract-words-from-string index-name string /for-search
    
     ARGUMENTS:
          index-name -- name of the index (Type: file)
          string -- string from which to extract the words (Type: string)
    
     REFINEMENTS:
          /for-search -- Says if words need to be extracted for
                         searching or indexing
     

    example:

         skimp/extract-words-from-string %my-index {[here_are (some words) "to" $index!}
         == ["are" "here" "index!" "some" "to" "words"]
     

    10. set-config

    10.1. basic index configuration

    There are three magic settings you can apply when creating an index that may affect its performance.

    Once these are set, they cannot be changed for the lifetime of the index (See skimp-tools later for possible exceptions). The settings do not matter much while you are evaluating or playing with skimp. But you may want to run benchmarks on large data sets before building any skimp indexes for critical work.

    The three settings are:

    The meaning of these settings is discussed below.

    To find the settings for an index, use get-index-information. The values are returned in the config object:

          index-info: skimp/get-index-information %my-index
          probe index-info/config
     

    To set or change a value, use set-index-config (see below).

    As you can only set these values on a new index (one with no words in it), you may need to surround any code that sets them with a test that the index exists or not:

         if not skimp/index-exists? %my-index [
              .... code to set config values
     

    10.2. set-config

    Sets or changes the value of a basic configuration variable

     USAGE:
         skimp/set-config index-name config /defer
    
     ARGUMENTS:
          index-name -- name of the index (Type: file)
          config -- object containing your configuration settings (Type: object)
    
     REFINEMENTS:
          /defer -- Do not write to permanent storage
     

    examples:

         probe skimp/set-config %my-index [index-levels: 6]
         probe skimp/set-config %my-index [index-levels: 3 one-file: true]
     

    The response is the newly updated configuration object.

    Note that you do not need to specify all the configuration values. The ones you omit will not be changed

    10.3. set-config/one-file

    The one-file config setting sets whether the index is created as a single file or a series of files.

    example:

         probe skimp/set-config %my-index make object! [one-file: true]
     

    10.3.1. what's the point?

    The main point is speed of retrieval. skimp was written to perform well when running queries on a webserver. A typical search on a web server consists of just a few words. With a multiple-file index, we need load and initialise only those parts of the index, thus ensuring less i/o and cpu time.

    The same advantage may apply in a desktop application: rather than having the whole index loaded at application start-up or on the first query, just the relevant parts need be loaded.

    There are similar advantages when updating an index -- if a new document has words that begin with just [a b c] then only those three files will by written back.

    skimp-tools (See later) will provide a way of flipping this setting after an index has been created.

    10.4. set-config/index-levels

    Changes the default number of levels in the index.

    The skimp-tools documentation will contain some more detailed information on the internals of the skimp index. A brief explanation for now:

    The index is a bit like (not identical to) a b-tree. If you have a three-level index, then the words REBOL and REBEL are indexed like this:

    "r"  ===> "e"  ===> "b"  ===> ["el" [1 2 3] "ol" [1 4 58x100] 

    That is: there are three levels of index (for the characters R, E and B). They point you to a list that contains the other letters of words that begin REB, plus an index of which documents contain them (REBEL is in docs 1, 2, and 3; REBOL is in docs 1, 4 and 58 through 157)

    index-levels lets you set the number of levels of index:

    example:

         probe skimp/set-config %my-index make object! [index-levels: 2]
     

    10.4.1. what's the best setting?

    You'll need to experiment with your live data. In practice, the best index-levels setting makes between 10% and 20% difference to file sizes and retrieval time, so the setting is unlikely to make a crucial difference. (Your experience may vary with indexes that are many megabytes in size).

    10.5. set-config/integer-document-names

    In all the examples so far, document names have been string!s, though there have been some hints that they can also be integers.

    Internally, skimp uses document ids, which are always integers.

    skimp uses a document name to document id table to translate between them.

    So, if it happens that your document names are integers that meet the requirement below, you can eliminate that conversion table. The requirements are:

    examples:

         skimp/set-config %my-index make object! [integer-document-names: true]
         skimp/add-words %my-index 55 "this document name is an integer"
     

    11. limitations

    Just so you know:

    12. Coming soon

    We're hoping to release two extras in the next few of weeks. Look out for them on REBOL.org:

    12.1. skimp-my-altme.r

    Is a demonstration application. It will consist of an API to index the posts in Altme worlds. Depending on time, there may be a cheapo GUI in front of it. If you would like to help write a decent front end to the Altme world indexer, please contact me today.

    12.2. skimp-tools.r

    This will extend the skimp function to provide features a developer may need when writing applications that use skimp; or if extending the basics of skimp itself. They will include:

    The accompanying document will include some developer's notes on the internals of a skimp index.