Documention for: skimp.r Created by: sunanda on: 5-Apr-2007 Last updated by: sunanda on: 28-Apr-2007 Format: text/editable Downloaded on: 30-Apr-2025 [div/style/max-width:95% [numbering-on [asis/style/font-size:small;overflow:auto;background-color:#ddffee script: skimp.r title: Simple keyword index management program author: Sunanda date: 23-apr-2007 Version: 0.0.2 asis] [contents [div/style/border:thin,blue,solid;margin:1em;padding:1em [h2/style/break:both purpose [p **skimp.r** lets you index the **words** in set of **documents**. You can then retrieve the list of documents that contain (or not) specific words. [p skimp.r is used extensively inside REBOL,org for many of its search indexes. div] [div/style/border:thin,green,solid;margin:1em;padding:1em [h2 thanks and credits [li Christian, Romano and everyone on the Mailing List who contributed to **rse-ids.r** [li Peter Wood for writing the **make-word-list.r** script, and creating test data cases for rse-ids, skimp and make-word-list. list] div] [div/style/*-1 [h2 quick start highlights [h3 Version summary [li 0.0.0 11-aug-2005 Written for internal use only [li 0.0.1 3-apr-2007 First public release; uses rse-ids.r and make-word-list.r [li 0.0.2 23-apr-2007 Add flush-cache and flush-cache-all list] [h3 sample data [p Throughout parts of this documentation we use the following data structure as an example: it is a block with **document-names** followed by the document's content. [asis/style/* skimp-test-data: [ "einstein-1" "You see, the wire telegraph is a kind of a very, very long cat." "saying-1" "Managing programmers is like herding cats." "einstein-1" "You pull his tail in New York and his head is meowing in Los Angeles." "lyric-1" "I thought I saw a puddy cat a-creeping up on me" "einstein-1" "Do you understand this?" "medical-1" "The cat-scan showed nothing unusual" "einstein-1" "And radio operates exactly the same way: you send signals here, they receive them there." "saying-2" "Dogs see people as companions; cats see people as staff." "einstein-1" "The only difference is that there is no cat." "cliche-1" "It's been raining cats and dogs since last night!" "doggerel-1" "The cat sat on the mat" "saying-3" "While the cat's away, the mice will play" ] asis] [p Note that the document **einstein-1** appears several times in the set. This is to help illustrate what happens when skimp's index is updated for the same document. [h3 build an index [p Let's build a basic skimp index from that data set: [asis/style/* ;; ensure folder exists for index if not exists? %skimp-folder [make-dir %skimp-folder] index-name: %skimp-folder/skimp-test-index ;; add each document to the index foreach [document-name contents] skimp-test-data [ skimp/add-words index-name document-name contents ] asis] [p You should now have a set of files in the **%skimp-folder** folder. Their names look like: **skimp-test-index-117.sif** [h3 search index for words [p We can now use skimp to find words in documents: [asis/style/* probe skimp/find-word index-name "cat" == ["saying-3" "doggerel-1" "medical-1" "lyric-1" "einstein-1"] probe skimp/find-word index-name "the" == ["saying-3" "doggerel-1" "medical-1" "einstein-1"] probe skimp/find-word index-name "cats." == [] ;; "cats." is not a word according to the default assumptions probe skimp/find-word index-name "cat-scan" == ["medical-1"] ;; "cat-scan" is one word probe skimp/find-word index-name "scan" ["medical-1"] ;; the "scan" part of "cat-scan" is also a findable word probe skimp/find-word index-name "difference" == ["einstein-1" ;; note: find-WORDS takes a block (find-WORD takes a string) probe skimp/find-words index-name ["a" "cat"] ["lyric-1" "einstein-1"] probe skimp/find-word index-name "~dog" ["einstein-1" "saying-1" "lyric-1" "medical-1" "saying-2" "cliche-1" "doggerel-1" "saying-3"] asis] [p Notes [li both **find-word** and **find-words** return a **block** of **document-names** [li all words in the index are **lowercase** -- (skimp is case insensitive) [li skimp has applied some logic to extract **words** from the document content. The logic is highly flexible, and can be completely replaced -- so you can easily use skimp with your own definition of what a word is -- see **what is a word?** later [li all ""parts"" of the **einstein-1** document have been added [li use **find-word** to find all documents containing a given **word** [li use **find-words** to find all documents containing **one or more words** (the results are **ANDed** together -- so all words must occur for there to be a match). [li prefix a word with a **tilde** (eg **~dog**) to find all documents that do **not** contain the word (the tilde is configurable if you want to use something else) list] [h3 retrieve more from the index [asis/style/* ;; all indexed words beginning with an **a**: probe skimp/get-indexed-words index-name "a" == ["a" "a-creeping" "and" "angeles" "as" "away"] ;; ditto, with the names of documents that contain the words: probe skimp/get-indexed-words/document-list index-name "a" == ["a" ["einstein-1" "lyric-1"] "a-creeping" ["lyric-1"] "and" ["cliche-1" "einstein-1"] "angeles" ["einstein-1"] "as" ["saying-2"] "away" ["saying-3"]] ;; all words beginning **t** in **einstein-1**: probe skimp/get-indexed-words-for-document index-name "einstein-1" "t" == ["tail" "telegraph" "that" "the" "them" "there" "they" "this?"] asis] [h3 remove a document [p You can remove a document entirely like this: [asis/style/* probe skimp/find-word index-name "difference" == ["einstein-1"] ;; finds one document **probe skimp/remove-document index-name "einstein-1"** probe skimp/find-word index-name "difference" == [] ;; finds no document asis] [h3 other highlights [p In summary: [li you can have many indexes open at the same time [li you can check if an index exists, and delete an entire index [li you can get information about an index, and you can save your oww data in an index header [li **add-bulk-words** gives you a faster way if adding many documents at the same time [li there are various caching and configuration options to tune an index to your needs list] div] [div/style/*-1 [h2 usage in detail [h3 starting skimp [p Simply use: [asis/style/* do %skimp.r asis] [li You will need **rse-ids.r** (in the REBOL.org library). rse-ids.r should be already loaded, or available in the same folder as skimp so skimp.r can load it. [li For best results, you'll need **make-word-list.r** (also in the REBOL library). **make-word-list** provides a flexible and configurable way of extracting all the ""words"" from a document. list] [h3 closing down [p No need to do anything, unless you have used updates with the **/defer** refinement. If so, you need to write back **all** open indexes: [asis/style/* skimp/write-cache-all/flush asis] You could then if you wish remove skimp from memory: [asis/style/* unset 'skimp unset 'rse-ids asis] [h3 configuring skimp [p skimp itself needs very little configuring for use (you can extensively configure individual indexes -- see later). [p There are only two global configuration settings to change or set the **prefix** and **suffix** for the file name for all indexes: [asis/style/* do %skimp.r skimp/index-name-prefix: "live-build-" skimp/index-name-suffix: ".index" asis] [p With the example above, the files for index **my-index** will be called **live-build-my-index-nnn.index** The defaults are: [li index-name-prefix -- none (ie null string) [li index-name-suffix -- **.sif** (**S**kimp **I**ndex **F**ile) div] [div/style/*-1 [h2 creating, configuring and closing an index [h3 creating a skimp index [p You do not need to do anything to create an index -- simply start using it and it will be created automatically. [p The only pre-requisite is that the **folder to store the index exists**.... [asis/style/* skimp/add-words %/c/dev/indexes/ind-1 "einstein-1" "You see" asis] [p .... will create an index called **ind1.sif** in the folder **/c/dev/indexes/ind-1** -- provided that folder already exists. [h3 configuring a skimp index [p There are some settings you may wish to apply before adding any documents to an index; and there are some that you **have** to apply before adding any documents. [p It is best to configure the index before performing other operations on it. You can use **skimp/index-exists?** to see if an index exists or not. If it does not, you are able to apply the configuration settings: [asis/style/* if not skimp/index-exists? %a-test-index [ skimp/set-word-definition .... ;; define what a ""word"" is in this index ] asis] The configuration settings are defined later. In summary, they are: [li **skimp/set-config** -- performance settings for the index [li **skimp/set-word-definition** -- define what a ""word"" is for the index list] [h3 closing a skimp index [p **without caching**: if you are not using the **skimp cache**, then no explicit close is needed: all data is written back on each update operation, so the data on disk is current. [p **with caching**: if one or more indexes are cached (see below for details) you need to write the cache out before ending your program, eg: [asis/style/* skimp/write-cache/flush %test-index-1 skimp/write-cache/flush %test-index-2 asis] [p If you've not been keeping a close eye on what indexes you have opened and cached, use **write-cache-all** to write them **all** back to permanent storage: [asis/style/* skimp/write-cache-all/flush asis] div] [div/style/*-1 [h2 caching one or more indexes [h3 /defer [p These operations can all take the **/defer** refinement: [li **add-words** -- add a document and its words [li **add-bulk-words** -- add several documents and their words in one go [li **remove-document** -- remove a document and all its words [li **removed-documents** -- remove several documents and all their words [li **set-config** -- performance settings for an index [li **set-word-definition** -- define what a ""word"" is [li **set-owner-data** -- any fields you want to keep in the index header list] [p Using **/defer** can speed things up as skimp does not write the index files back to permanent storage until you issue either a **skimp/write-cache** or a **skimp/write-cache-all** [p As explained above: [li **write-cache** takes as an argument the name of an index. It writes that index -- provided it was in the cache. [li **write-cache-all** has no arguments. It writes back all indexes found in the cache. list] [h3 /flush [p Both **write-cache** and **write-cache-all** can take the **/flush** refinement. [li with **/flush**: the index files are removed from the cache -- so any future operations will reread the index files from permanent storage. [li without **/flush**: the index files are written out **and** retained in the cache. list] div] [div/style/*-1 [h2 adding and removing documents from an index [h3 overview [p There are four main functions for this: [li **add-words** -- adds all the words in a document [li **add-bulk-words** -- add all the words in multiple documents [li **remove-document** -- remove a document from the index [li **remove-documents** -- remove several documents from the index list] [p As noted above, each can take the **/defer** refinement. If you have a series of operations to perform on an index, using **/defer** can speed things up considerably. [h3 document names [p By default, a **document's name** must be a **string!** -- previous examples include **""einstein-1""** and **""saying-1""**. [p There is a configuration setting that allows document-names to be **integer!**. That is explained later (See **set-config/integer-document-names**). All the document-names in one index must be of the same datatype. [h3 add-words [p Adds all the **words** in one **document** to an index. [p If the document is already in the index, it **adds** all the new words found in the latest version of the document, but does not remove any words that are no longer found in it: to **replace** a document in an index, use **remove-document** followed by **add-words**. [asis/style/* USAGE: skimp/add-words index-name document-name words /defer ARGUMENTS: index-name -- name of the index (Type: file) document-name -- name of the document (Type: string integer) words -- words to add (Type: string block) REFINEMENTS: /defer -- Do not write to permanent storage asis] [p The **words** can be supplied as either a **string!** or a **block!** [li if a **string!** -- skimp parses the string to extract the words. You can override skimps definition of a word in several ways -- see **what is a word?** [li if a **block!** -- skimp expects a **block!** of **string!s** -- each string is added to the index **as is**. This allows you complete control over what is considered a ""word"". list] [p examples: [asis/style/* skimp/add-words %my-index "parrot" "->: I wish to make a complaint -:)" skimp/add-words %my-index "smilies" ["-:)" "->:" "@/."] asis] [p In the **smilies** document, the various keyboard symbols are treated as indexable words, while in **parrot** they are not. [h3 add-bulk-words [p Adds all the **words** in zero or more **documents** to an index. [p Very similar to **add-words** except **add-bulk-words** accepts more than one document. [asis/style/* USAGE: skimp/add-bulk-words index-name data-set /defer ARGUMENTS: index-name -- name of the index (Type: file) data-set -- set of details to add [Type: block] REFINEMENTS: /defer -- Do not write to permanent storage asis] [p The **data-set** is a **block** with pairs of entries: **document-name** followed by **words** for that document. [p As with **add-words** the **words** can be **string!** or **block!** [p example: [asis/style/* skimp/add-bulk-words %my-index [ "parrot" "I wish to make a complaint -:)" "smilies" ["-:)" "->:" "@/."] ] asis] [p The example adds the same two files as the **add-words** example. [h3 choosing between add-words and add-bulk-words [li If you have multiple documents to add, **add-bulk-words** can be up to five times faster than **add-words** [li multiple **add-words** gives an interactive application more chances to report progress to the waiting user [li too many documents at once (say over 30, though you will need to test for your application as document size is a big factor) and **add-bulk-words** can be slower than **add-words**. list] [h3 remove-document [p Removes all the **words** in one **document** from an index. [p If the document is not in the index, then no action is taken. [asis/style/* USAGE: skimp/remove-document index-name document-name /defer ARGUMENTS: index-name -- name of the index (Type: file) document-name -- name of the document (Type: string integer) REFINEMENTS: /defer -- Do not write to permanent storage asis] [p examples: [asis/style/* skimp/remove-document %my-index "parrot" skimp/remove-document %my-index "einstein-1" asis] [p **question:** Can I remove just some of the words of a document from an index? [p **response:** No, not in this version of skimp. [p **question:** How do I delete an entire index? [p **response:** See **remove-index** below. [h3 remove-documents [p Removes all the **words** in zero or more **documents** from an index. [p Document names which are not in the index are ignored. [asis/style/* USAGE: skimp/remove-documents index-name document-list /defer ARGUMENTS: index-name -- name of the index (Type: file) document-list -- names of the document (Type: block) REFINEMENTS: /defer -- Do not write to permanent storage asis] [p example: [asis/style/* skimp/remove-documents %my-index ["parrot" "einstein-1"] asis] [h3 remove-document vs remove-documents [p **remove-document** is a courtesy wrapper for **remove-documents**, so it does not matter if you use **remove-documents** with just one document name; the effect is the same. [p If you have multiple documents to remove it is usually **much faster** to use **remove-documents** once rather than **remove-document** multiple times. [p That is because skimp needs to scan the **entire index** for each remove operation. If you are removing 50 documents, that is 50 scans using **remove-document** rather than one single scan using **remove-documents**. div] [div/style/*-1 [h2 searching for documents [p To find the documents that contain specific words, use either: [li **find-word** -- searches for a specific word [li **find-words** -- searches for one or more words list] In both cases, a search can be negated (ie find all the documents that do **not** contain a word) be using the **negate prefix**. [p You can also search more widely, using: [li **get-indexed-words** -- find all the words that are contained in the index [li **get-indexed-words-for-document** -- find all the words indexed for a specific document list] [h3 find-word [p Returns a block of document-names that contain that word [asis/style/* USAGE: skimp/find-word index-name word ARGUMENTS: index-name -- name of the index (Type: file) word -- word to search for (Type: string) asis] [p examples: [asis/style/* skimp/find-word %my-index "cat" skimp/find-word %my-index "~cat" asis] [li The first example returns the names of all documents containing the word **""cat""**. [li The second returns all those that do not contain **""cat""**. list] [p **question**: How can I know if the word I am looking for is well-formed? ie that it is formed to the same parsing rules as **add-words**? [p **response**: Good question! One way is to use **extract-words-from-string** which is explained later on. It allows you access to the same processing as **add-words**. (If you do use **extract-words-from-string**, it returns a **block** so you need to use it in conjunction with **find-words** below). [h3 find-words [p Returns a block of document-names that contain **all** the words. [asis/style/* USAGE: skimp/find-words index-name word ARGUMENTS: index-name -- name of the index (Type: file) words -- words to search for (Type: block) asis] [p examples: [asis/style/* skimp/find-words %my-index ["cat" "sat" "mat"] skimp/find-words %my-index ["~cat" "sat" "mat"] skimp/find-words %my-index skimp/extract-words-from-string/for-search %my-index "~Cat, sat mat." asis] [li The first example returns the names of all documents containing the three words **""cat""**, **""sat""** and **""mat""** [li The second returns all those that contain **""sat""** and **""mat""** do **not** contain **""cat""**. [li The third example is the same as the second (assuming the definition of a word in force excludes punctuation) except that it uses **extract-words-from-string** to turn a user-supplied search string into a block of words. list] [h3 get-indexed-words [p Returns a block of all the words in the index that begin with a specific character. Optionally, also return a block of the matching document names. [asis/style/* USAGE: skimp/get-indexed-words index-name character /document-list ARGUMENTS: index-name -- name of the index (Type: file) character -- first character of words being sought to search for (Type: char string) REFINEMENTS: /document-list -- Also return the matching document names asis] [p examples: [asis/style/* skimp/get-indexed-words %my-index "n" skimp/get-indexed-words/document-list %my-index "n" asis] [h4 what first letters exist in the index? [p **get-index-information** (see later) provides a list of all first characters that exist in the index. This code will display all the words in the index: [asis/style/* index-info: skimp/get-index-information %my-index foreach first-char index-info/top-index [ probe skimp/get-indexed-words %my-index first-char ] asis] [h3 get-indexed-words-for-document [p Returns a block of all the words in a specific document that match a specific first character. [asis/style/* USAGE: skimp/get-indexed-words index-name document-name character ARGUMENTS: index-name -- name of the index (Type: file) document-name -- name of document (Type integer string) character -- first character of words being sought to search for (Type: char string) asis] [p examples: [asis/style/* skimp/get-indexed-words-for-document %my-index "einstein-1" "b" skimp/get-indexed-words-for-document %my-index "saying-1" "m" asis] [h4 what documents are in the index? [p Skipping ahead, **get-index-information** is one way to get a list of all document names. div] [div/style/*-1 [h2 other index management functions [p These functions allow you to manage your skimp indexes efficiently and effectively: [li **index-exists?** -- checks if an index exists or not [li **get-index-information** -- returns header information about an index [li **remove-index** -- delete an index entirely [li **set-owner-data** -- add your own index-specific data about an index [li **write-cache** -- write out one specific index [li **write-cache-all** -- write out all open indexes [li **flush-cache** -- empty one specific index [li **flush-cache-all** -- empty all open indexes list] [h3 index-exists? [p Returns true or false depending on whether an index exists or not [asis/style/* USAGE: skimp/index-exists? index-name ARGUMENTS: index-name -- name of the index (Type: file) asis] [p example: [asis/style/* if not skimp/index-exists? %my-indexes/index-1 [ print "sorry -- not yet created" ] asis] [h3 get-index-information [p Returns an **object!** containing information about the index, or **none** if the index does not exist. [asis/style/* USAGE: skimp/get-index-information index-name ARGUMENTS: index-name -- name of the index (Type: file) asis] [p example: [asis/style/* if object? ind-header: skimp/get-index-information %my-indexes/index-1 [ probe ind-header ] asis] [h4 index header object [p Has these fields: [cell/class/lskey1 **index-file:** [cell/class/lsdata1 **file**: the name of the index file [row [cell/class/lskey2 **top-index:** [cell/class/lsdata2 **block**: of first characters indexed. [row [cell/class/*-3 **owner-data:** [cell/class/*-3 **object**: any owner data you have set with **set-owner-data** (see later) [row [cell/class/*-3 **config:** [cell/class/*-3 **object**: create-time configuration settings. See later [row [cell/class/*-3 **word-parameters:** [cell/class/*-3 **object**: parameters defining what a word is. See later [row [cell/class/*-3 **make-word-list:** [cell/class/*-3 **function** actual function used in this index to define what a word is. table] [h3 remove-index [p Deletes all files related to an index. [asis/style/* USAGE: skimp/remove-index index-name ARGUMENTS: index-name -- name of the index (Type: file) asis] [p example: [asis/style/* skimp/remove-index/my-old-index asis] [h3 set-owner-data [p Allows you to store an **object!** in the index header. You could use this to (say) record an extended name for the index, or to add some application-specific notes about the index [asis/style/* USAGE: skimp/set-owner-data index-name owner-data /defer ARGUMENTS: index-name -- name of the index (Type: file) owner-data -- your data (Type: object) REFINEMENTS: /defer -- Do not write to permanent storage asis] [p example: [asis/style/* skimp/set-owner-data %my-index make object! [ last-updated: now index-name: "All my photo captions" support: "call 555-1234" ] asis] [h4 updating existing owner-data [p Simply specify the fields you want to change.... [asis/style/* skimp/set-owner-data %my-index make object! [ support: "call 555-9876" ] asis] [p ....Other fields in the owner-data are left unchanged. [p To get the full owner-data record, use **get-index-information**. [h3 write-cache [p Writes the data for a specific index to permanent storage. Use if you have been using the **/defer** refinement on update operations. [asis/style/* USAGE: skimp/write-cache index-name /flush ARGUMENTS: index-name -- name of the index (Type: file) REFINEMENTS: /flush -- Purge the index from the cache after writing (further operations will cause it to be re-read) asis] [p examples: [asis/style/* skimp/write-cache %my-index-1 skimp/write-cache/flush %my-index-2 asis] [h3 write-cache-all [p Writes the data for all indexes to permanent storage. Use if you have been using the **/defer** refinement on update operations, and your clean-up routines are unsure of what indexes your software have been updating [asis/style/* USAGE: skimp/write-cache /flush ARGUMENTS: none REFINEMENTS: /flush -- Purge the indexes from the cache after writing (further operations will cause them to be re-read) asis] [p example: [asis/style/* skimp/write-cache-all skimp/write-cache-all/flush asis] [h3 flush-cache [p Clears an index from the cache without first writing it to permanent storage. [p If you have been updating an index using the **/defer** refinement, this is a way of purging all the changes since the last **write-cache**. [p If you have an index open purely for reading (ie not one you are updating in the cuurent task) this is a way of forcing the index to be reread from permanent storage. That may be of use if either: [li the index is being updated by another task, and you want to read the updated version; or [li you have a large index and wish to release the memory it is occupying list] [asis/style/* USAGE: flush/write-cache index-name ARGUMENTS: index-name -- name of the index (Type: file) asis] [p example: [asis/style/* skimp/flush-cache %my-index-1 asis] [h3 flush-cache-all [p Clears all indexes from the cache [asis/style/* USAGE: skimp/flush-cache ARGUMENTS: none asis] [p examples: [asis/style/* skimp/flush-cache-all asis] div] [div/style/*-1 [h2 what is a word? [p skimp indexes the **words** in **documents**. So an obvious question is: how does it know what a word is? [p The simple answer is: [li 1. skimp uses **make-word-list.r** [li 2. **make-word-list** is highly configurable, so you can tweak it to fit your needs [li 3. if you need more tweaks than are possible with the configuration parameters, you can provide your own function to extract the words in a document. [li 4. In addition, the **block** format of **add-words** and **add-bulk-words** lets you directly state what the indexable words are. list] [p In principle, a **word** can be **any** string of **one or more** characters. You are not limited to letters, numerals or even printable characters. [p Your definition of what a word is (including any function you have supplied to extract the words in a document) are **held in the index itself**. So the definition of a word remains stable and constant throughout the life of the index. [p There are two functions for managing the definition of a word for an index: [li **set-word-definition** -- add or update the definition of a word [li **extract-words-from-string** -- test how an index defines a word list] In addition, **get-index-information** (see above) returns you the current definition for any given index. [h3 set-word-definition [p Defines what a word is for a given index [asis/style/* USAGE: skimp/set-word-definition index-name /parameters parm-obj /make-word-list mwlf /defer ARGUMENTS: index-name -- name of the index (Type: file) REFINEMENTS: /parameters -- extend/change definition for make-word-list function parm-obj (Type: object) /make-word-list -- replacement function mwlf (Type: function none) /defer -- Do not write to permanent storage asis] [p examples: [asis/style/* skimp/set-word-definition/parameters make object! [ initial-letters: charset compose [#"a" - #"z" #"A" - #"Z" (to-char 199) (to-char 231) ] ] skimp/set-word-definition/make-word-list func [parms string] [return unique parse/all string " "] asis] [li the first example changes the default parameters for **make-word-list** to add chars 199 and 231 (C-cedilla) to the letters that make up a word [li the second example replaces the default **make-word-list** function with a custom (and very basic) function to find the words in a document. [li these options are explained in more detail below list] [h4 changing the word definition for an existing index [p You can do this, but it may not be wise. It will not affect any of the words already indexed, but it may affect your ability to search for them. [p For example, you build an index over 1000 documents in which decimal numbers are not treated as words. Later you change the definition so they are treated as words, and add another 1000 documents. **Only the latter 1000 documents will have their numbers indexed**. [h4 parameter object [p The **parameter object** is passed to the **make-word-list** function. By default it is similar to the default object used by make-word-list. [p You can find skimp's settings for an index like this: [asis/style/* probe get in skimp/get-index-information %my-index 'word-parameters [h3 make-word-list function [p By default, skimp looks for **%make-word-list.r** in the current folder: [li if found: skimp uses it as the default make-word-list function [li if not, it uses a fairly useless minimal function: [asis func [ parms string [string!] ][ return unique sort parse/all trim/lines copy string " " ] asis] [p If you want skimp to use another function that you have written, you can specify it with **set-word-definition/make-word-list**. [li You must supply a **function** that takes two arguments and one refinement: [list [li **an object** -- you will be passed the **parameter object** as described above. You are, of course, free to ignore it. [li **a string** -- the string that you are required to break into indexable words [li **/for-search** -- a refinement. Described below. list] [li Your function will be saved in the index header, and later executed from there. **So it must be self-contained, making no references outside of itself**. [li Your function must return a **block** of zero or more **unique** strings. These are the words you have extracted from the input string. [li Your returned block may contain the empty string (""""). If so, it will be ignored. list] [p example: [asis/style/* skimp/set-word-parameters/make-word-list func [ obj [object!] str [string!] /for-search ][ return unique parse/all str "." ] asis] [p The example is hardly the most useful word-extracting function in the world :-) [p To test your function (or the built-in function) use **extract-words-from-string** (See later). [h4 /for-search refinement [p There are two subtly different reasons why you may want to extract the words from a string: [li the string is the source document that you want to index [li the string is a user-supplied search string that you want to turn into separate words so you can search for them using **find-words** list] [p The two uses **may** in your application need to behave slightly differently, especially with regard to handling the **not-prefix**. As an example, the built-in make-word-list function acts differently when given a string that contains tildes -- a leading tilde is preserved with the **/for-search** refinement: [asis/style/* skimp/extract-words-from-string %my-index "I have some ~tildes in t~~~his ~string~" == ["have" "I" "in" "some" **"string~"** "tildes" "t~~~his"] skimp/extract-words-from-string/for-search %my-index "I have some ~tildes in t~~~his ~string~" == ["have" "I" "in" "some" "t~~~his" **"~string~"** "~tildes"] asis] [h3 extract-words-from-string [p We've just about covered this above. Provides access to the **make-word-list** function in an index. Allows you to: [li analyse the built-in behavior [li see the effects of changing the **parameters** with **set-word-definition** [li test any replacement make-word-list function you have written list] [asis/style/* USAGE: skimp/extract-words-from-string index-name string /for-search ARGUMENTS: index-name -- name of the index (Type: file) string -- string from which to extract the words (Type: string) REFINEMENTS: /for-search -- Says if words need to be extracted for searching or indexing asis] [p example: [asis/style/* skimp/extract-words-from-string %my-index {[here_are (some words) ""to"" $index!} == ["are" "here" "index!" "some" "to" "words"] asis] div] [div/style/*-1 [h2 set-config [h3 basic index configuration [p There are three magic settings you can apply when creating an index that may affect its performance. [p Once these are set, they cannot be changed for the lifetime of the index (See **skimp-tools** later for possible exceptions). The settings do not matter much while you are evaluating or playing with skimp. But you may want to run benchmarks on large data sets before building any skimp indexes for critical work. [p The three settings are: [li **index-levels** -- how deep an index to build [li **integer-document-names ** -- whether document names are **string!s** or **integer!s** [li **one-file** -- whether the index consists of one file or a series of files list] [p The meaning of these settings is discussed below. [p To find the settings for an index, use **get-index-information**. The values are returned in the **config** object: [asis/style/* index-info: skimp/get-index-information %my-index probe index-info/config asis] [p To set or change a value, use **set-index-config** (see below). [p As you can only set these values on a new index (one with no words in it), you may need to surround any code that sets them with a test that the index exists or not: [asis/style/* if not skimp/index-exists? %my-index [ .... code to set config values asis] [h3 set-config [p Sets or changes the value of a basic configuration variable [asis/style/* USAGE: skimp/set-config index-name config /defer ARGUMENTS: index-name -- name of the index (Type: file) config -- object containing your configuration settings (Type: object) REFINEMENTS: /defer -- Do not write to permanent storage asis] [p examples: [asis/style/* probe skimp/set-config %my-index [index-levels: 6] probe skimp/set-config %my-index [index-levels: 3 one-file: true] asis] [p The response is the newly updated configuration object. [p Note that you do not need to specify all the configuration values. The ones you omit will not be changed [h3 set-config/one-file [p The **one-file** config setting sets whether the index is created as a single file or a series of files. [li **one-file: true** -- the index will be created as one file [li **one-file: false** -- there will be one file for the index header, plus one file for each first character indexed. (If your index indexes words beginning a, b and c, there will be four files in the index: header a-file, b-file, c-file. The index file names are made up as **index-name-N.sif** where N is **to-integer to-char word/1** list] [p example: [asis/style/* probe skimp/set-config %my-index make object! [one-file: true] asis] [h4 what's the point? [p The main point is speed of retrieval. skimp was written to perform well when running queries on a webserver. A typical search on a web server consists of just a few words. With a multiple-file index, we need load and initialise only those parts of the index, thus ensuring less i/o and cpu time. [p The same advantage may apply in a desktop application: rather than having the whole index loaded at application start-up or on the first query, just the relevant parts need be loaded. [p There are similar advantages when updating an index -- if a new document has words that begin with just [a b c] then only those three files will by written back. [p **skimp-tools** (See later) will provide a way of flipping this setting after an index has been created. [h3 set-config/index-levels [p Changes the default number of levels in the index. [p The **skimp-tools** documentation will contain some more detailed information on the internals of the skimp index. A brief explanation for now: [p The index is a bit like (not identical to) a **b-tree**. If you have a three-level index, then the words REBOL and REBEL are indexed like this: [asis/style/* "r" ===> "e" ===> "b" ===> ["el" [1 2 3] "ol" [1 4 58x100]] asis] [p That is: there are three levels of index (for the characters R, E and B). They point you to a list that contains the other letters of words that begin REB, plus an index of which documents contain them (REBEL is in docs 1, 2, and 3; REBOL is in docs 1, 4 and 58 through 157) [p **index-levels** lets you set the number of levels of index: [li the default is 3 (like the example above) [li this value cannot be changed once any words are indexed (at least, not without recreating the whole index [li the minimum is 1 [li there is no maximum, but anything about 4 is likely to be adding more layers than you could ever need. list] [p example: [asis/style/* probe skimp/set-config %my-index make object! [index-levels: 2] asis] [h4 what's the best setting? [p You'll need to experiment with your live data. In practice, the best **index-levels** setting makes between 10% and 20% difference to file sizes and retrieval time, so the setting is unlikely to make a crucial difference. (Your experience may vary with indexes that are many megabytes in size). [h3 set-config/integer-document-names [p In all the examples so far, **document names** have been **string!s**, though there have been some hints that they can also be integers. [p Internally, skimp uses **document ids**, which are always integers. [p skimp uses a **document name** to **document id** table to translate between them. [p So, if it happens that your document names are integers that meet the requirement below, you can eliminate that conversion table. The requirements are: [li must be an integer [li must be 1 or higher (ie not zero or negative) [li it will help if they are more-or-less consecutive, though they do not need to start from 1. If your document ids are not more-or-less consecutive, you may be better off treating them as strings. list] [p examples: [asis/style/* skimp/set-config %my-index make object! [integer-document-names: true] skimp/add-words %my-index 55 "this document name is an integer" asis] div] [div/style/*-1 [h2 limitations [p Just so you know: [li All indexes are **case insensitive** (words are folded to **lowercase**) [li skimp indexes the presence or absence of a word in a document. It does not note whether words are near to each other. [li The empty string is **never** indexed: [asis/style/* add-words %my-index ["" "hello"] asis] indexes only **hello** [li **remove-document** de-indexes an entire document. There is no current way to remove just some of the words. [li The definition of what a ""word"" is is crucial to the operation of an index; you may need to spend time working on this as part of any indexing project you are planing. [li **find-words** ANDS the results together. If you want a search that uses OR, you need to make multiple calls to find-words, and merge the results, eg: [asis/style/* unique sort rejoin [ find-words %my-index ["cat"] find-words %my-index ["dog" "~horse"] find-words %my-index ["pony" wombat"] ] asis] Finds all documents that contain cat **OR** (dog **AND NOT** horse) **OR** (pony **AND** wombat} list] div] [div/style/*-1 [h2 Coming soon [p We're hoping to release two extras in the next few of weeks. Look out for them on REBOL.org: [li **skimp-my-altme.r** -- demo applications [li **skimp-tools.r** -- extra skimp facilities list] [h3 skimp-my-altme.r [p Is a demonstration application. It will consist of an API to index the posts in **Altme worlds**. Depending on time, there may be a cheapo GUI in front of it. **If you would like to help write a decent front end to the Altme world indexer, please contact me today**. [h3 skimp-tools.r [p This will extend the **skimp** function to provide features a **developer** may need when writing applications that use skimp; or if extending the basics of skimp itself. They will include: [li **integrity-check** -- a scan of an index to find any build problems; [li **flip-one-file** -- an option to change an index from one large file to a set of smaller ones; [li **rename-document** - a way to change **document-names** without having to delete and reindex a document. list] [p The accompanying document will include some developer's notes on the internals of a skimp index. div] div]