View script | License | Download documentation as: HTML or editable |
Download script | History | Other scripts by: sunanda |
[0.06] 70.669k
Documentation for: skimp.rscript: skimp.r title: Simple keyword index management program author: Sunanda date: 23-apr-2007 Version: 0.0.2 purposeskimp.r lets you index the words in set of documents. You can then retrieve the list of documents that contain (or not) specific words. skimp.r is used extensively inside REBOL,org for many of its search indexes. 1. thanks and credits
2. quick start highlights2.1. Version summary
2.2. sample dataThroughout parts of this documentation we use the following data structure as an example: it is a block with document-names followed by the document's content. skimp-test-data: [ "einstein-1" "You see, the wire telegraph is a kind of a very, very long cat." "saying-1" "Managing programmers is like herding cats." "einstein-1" "You pull his tail in New York and his head is meowing in Los Angeles." "lyric-1" "I thought I saw a puddy cat a-creeping up on me" "einstein-1" "Do you understand this?" "medical-1" "The cat-scan showed nothing unusual" "einstein-1" "And radio operates exactly the same way: you send signals here, they receive them there." "saying-2" "Dogs see people as companions; cats see people as staff." "einstein-1" "The only difference is that there is no cat." "cliche-1" "It's been raining cats and dogs since last night!" "doggerel-1" "The cat sat on the mat" "saying-3" "While the cat's away, the mice will play" ] Note that the document einstein-1 appears several times in the set. This is to help illustrate what happens when skimp's index is updated for the same document. 2.3. build an indexLet's build a basic skimp index from that data set: ;; ensure folder exists for index if not exists? %skimp-folder [make-dir %skimp-folder] index-name: %skimp-folder/skimp-test-index ;; add each document to the index foreach [document-name contents] skimp-test-data [ skimp/add-words index-name document-name contents ] You should now have a set of files in the %skimp-folder folder. Their names look like: skimp-test-index-117.sif 2.4. search index for wordsWe can now use skimp to find words in documents: probe skimp/find-word index-name "cat" == ["saying-3" "doggerel-1" "medical-1" "lyric-1" "einstein-1"] probe skimp/find-word index-name "the" == ["saying-3" "doggerel-1" "medical-1" "einstein-1"] probe skimp/find-word index-name "cats." == [] ;; "cats." is not a word according to the default assumptions probe skimp/find-word index-name "cat-scan" == ["medical-1"] ;; "cat-scan" is one word probe skimp/find-word index-name "scan" ["medical-1"] ;; the "scan" part of "cat-scan" is also a findable word probe skimp/find-word index-name "difference" == ["einstein-1" ;; note: find-WORDS takes a block (find-WORD takes a string) probe skimp/find-words index-name ["a" "cat"] ["lyric-1" "einstein-1"] probe skimp/find-word index-name "~dog" ["einstein-1" "saying-1" "lyric-1" "medical-1" "saying-2" "cliche-1" "doggerel-1" "saying-3"] Notes
2.5. retrieve more from the index;; all indexed words beginning with an a: probe skimp/get-indexed-words index-name "a" == ["a" "a-creeping" "and" "angeles" "as" "away"] ;; ditto, with the names of documents that contain the words: probe skimp/get-indexed-words/document-list index-name "a" == ["a" ["einstein-1" "lyric-1"] "a-creeping" ["lyric-1"] "and" ["cliche-1" "einstein-1"] "angeles" ["einstein-1"] "as" ["saying-2"] "away" ["saying-3"] ;; all words beginning t in einstein-1: probe skimp/get-indexed-words-for-document index-name "einstein-1" "t" == ["tail" "telegraph" "that" "the" "them" "there" "they" "this?"] 2.6. remove a documentYou can remove a document entirely like this: probe skimp/find-word index-name "difference" == ["einstein-1"] ;; finds one document probe skimp/remove-document index-name "einstein-1" probe skimp/find-word index-name "difference" == [] ;; finds no document 2.7. other highlightsIn summary:
3. usage in detail3.1. starting skimpSimply use: do %skimp.r
3.2. closing downNo need to do anything, unless you have used updates with the /defer refinement. If so, you need to write back all open indexes: skimp/write-cache-all/flushYou could then if you wish remove skimp from memory: unset 'skimp unset 'rse-ids 3.3. configuring skimpskimp itself needs very little configuring for use (you can extensively configure individual indexes -- see later). There are only two global configuration settings to change or set the prefix and suffix for the file name for all indexes: do %skimp.r skimp/index-name-prefix: "live-build-" skimp/index-name-suffix: ".index" With the example above, the files for index my-index will be called live-build-my-index-nnn.index The defaults are:
4. creating, configuring and closing an index4.1. creating a skimp indexYou do not need to do anything to create an index -- simply start using it and it will be created automatically. The only pre-requisite is that the folder to store the index exists.... skimp/add-words %/c/dev/indexes/ind-1 "einstein-1" "You see" .... will create an index called ind1.sif in the folder /c/dev/indexes/ind-1 -- provided that folder already exists. 4.2. configuring a skimp indexThere are some settings you may wish to apply before adding any documents to an index; and there are some that you have to apply before adding any documents. It is best to configure the index before performing other operations on it. You can use skimp/index-exists? to see if an index exists or not. If it does not, you are able to apply the configuration settings: if not skimp/index-exists? %a-test-index [ skimp/set-word-definition .... ;; define what a "word" is in this index ]The configuration settings are defined later. In summary, they are: 4.3. closing a skimp indexwithout caching: if you are not using the skimp cache, then no explicit close is needed: all data is written back on each update operation, so the data on disk is current. with caching: if one or more indexes are cached (see below for details) you need to write the cache out before ending your program, eg: skimp/write-cache/flush %test-index-1 skimp/write-cache/flush %test-index-2 If you've not been keeping a close eye on what indexes you have opened and cached, use write-cache-all to write them all back to permanent storage: skimp/write-cache-all/flush 5. caching one or more indexes5.1. /deferThese operations can all take the /defer refinement:
Using /defer can speed things up as skimp does not write the index files back to permanent storage until you issue either a skimp/write-cache or a skimp/write-cache-all As explained above:
5.2. /flushBoth write-cache and write-cache-all can take the /flush refinement.
6. adding and removing documents from an index6.1. overviewThere are four main functions for this:
As noted above, each can take the /defer refinement. If you have a series of operations to perform on an index, using /defer can speed things up considerably. 6.2. document namesBy default, a document's name must be a string! -- previous examples include "einstein-1" and "saying-1". There is a configuration setting that allows document-names to be integer!. That is explained later (See set-config/integer-document-names). All the document-names in one index must be of the same datatype. 6.3. add-wordsAdds all the words in one document to an index. If the document is already in the index, it adds all the new words found in the latest version of the document, but does not remove any words that are no longer found in it: to replace a document in an index, use remove-document followed by add-words. USAGE: skimp/add-words index-name document-name words /defer ARGUMENTS: index-name -- name of the index (Type: file) document-name -- name of the document (Type: string integer) words -- words to add (Type: string block) REFINEMENTS: /defer -- Do not write to permanent storage The words can be supplied as either a string! or a block!
examples: skimp/add-words %my-index "parrot" "->: I wish to make a complaint -:)" skimp/add-words %my-index "smilies" ["-:)" "->:" "@/."] In the smilies document, the various keyboard symbols are treated as indexable words, while in parrot they are not. 6.4. add-bulk-wordsAdds all the words in zero or more documents to an index. Very similar to add-words except add-bulk-words accepts more than one document. USAGE: skimp/add-bulk-words index-name data-set /defer ARGUMENTS: index-name -- name of the index (Type: file) data-set -- set of details to add [Type: block] REFINEMENTS: /defer -- Do not write to permanent storage The data-set is a block with pairs of entries: document-name followed by words for that document. As with add-words the words can be string! or block! example: skimp/add-bulk-words %my-index [ "parrot" "I wish to make a complaint -:)" "smilies" ["-:)" "->:" "@/."] ] The example adds the same two files as the add-words example. 6.5. choosing between add-words and add-bulk-words
6.6. remove-documentRemoves all the words in one document from an index. If the document is not in the index, then no action is taken. USAGE: skimp/remove-document index-name document-name /defer ARGUMENTS: index-name -- name of the index (Type: file) document-name -- name of the document (Type: string integer) REFINEMENTS: /defer -- Do not write to permanent storage examples: skimp/remove-document %my-index "parrot" skimp/remove-document %my-index "einstein-1" question: Can I remove just some of the words of a document from an index? response: No, not in this version of skimp. question: How do I delete an entire index? response: See remove-index below. 6.7. remove-documentsRemoves all the words in zero or more documents from an index. Document names which are not in the index are ignored. USAGE: skimp/remove-documents index-name document-list /defer ARGUMENTS: index-name -- name of the index (Type: file) document-list -- names of the document (Type: block) REFINEMENTS: /defer -- Do not write to permanent storage example: skimp/remove-documents %my-index ["parrot" "einstein-1"] 6.8. remove-document vs remove-documentsremove-document is a courtesy wrapper for remove-documents, so it does not matter if you use remove-documents with just one document name; the effect is the same. If you have multiple documents to remove it is usually much faster to use remove-documents once rather than remove-document multiple times. That is because skimp needs to scan the entire index for each remove operation. If you are removing 50 documents, that is 50 scans using remove-document rather than one single scan using remove-documents. 7. searching for documentsTo find the documents that contain specific words, use either:
You can also search more widely, using:
7.1. find-wordReturns a block of document-names that contain that word USAGE: skimp/find-word index-name word ARGUMENTS: index-name -- name of the index (Type: file) word -- word to search for (Type: string) examples: skimp/find-word %my-index "cat" skimp/find-word %my-index "~cat"
question: How can I know if the word I am looking for is well-formed? ie that it is formed to the same parsing rules as add-words? response: Good question! One way is to use extract-words-from-string which is explained later on. It allows you access to the same processing as add-words. (If you do use extract-words-from-string, it returns a block so you need to use it in conjunction with find-words below). 7.2. find-wordsReturns a block of document-names that contain all the words. USAGE: skimp/find-words index-name word ARGUMENTS: index-name -- name of the index (Type: file) words -- words to search for (Type: block) examples: skimp/find-words %my-index ["cat" "sat" "mat"] skimp/find-words %my-index ["~cat" "sat" "mat"] skimp/find-words %my-index skimp/extract-words-from-string/for-search %my-index "~Cat, sat mat."
7.3. get-indexed-wordsReturns a block of all the words in the index that begin with a specific character. Optionally, also return a block of the matching document names. USAGE: skimp/get-indexed-words index-name character /document-list ARGUMENTS: index-name -- name of the index (Type: file) character -- first character of words being sought to search for (Type: char string) REFINEMENTS: /document-list -- Also return the matching document names examples: skimp/get-indexed-words %my-index "n" skimp/get-indexed-words/document-list %my-index "n" 7.3.1. what first letters exist in the index?get-index-information (see later) provides a list of all first characters that exist in the index. This code will display all the words in the index: index-info: skimp/get-index-information %my-index foreach first-char index-info/top-index [ probe skimp/get-indexed-words %my-index first-char ] 7.4. get-indexed-words-for-documentReturns a block of all the words in a specific document that match a specific first character. USAGE: skimp/get-indexed-words index-name document-name character ARGUMENTS: index-name -- name of the index (Type: file) document-name -- name of document (Type integer string) character -- first character of words being sought to search for (Type: char string) examples: skimp/get-indexed-words-for-document %my-index "einstein-1" "b" skimp/get-indexed-words-for-document %my-index "saying-1" "m" 7.4.1. what documents are in the index?Skipping ahead, get-index-information is one way to get a list of all document names. 8. other index management functionsThese functions allow you to manage your skimp indexes efficiently and effectively:
8.1. index-exists?Returns true or false depending on whether an index exists or not USAGE: skimp/index-exists? index-name ARGUMENTS: index-name -- name of the index (Type: file) example: if not skimp/index-exists? %my-indexes/index-1 [ print "sorry -- not yet created" ] 8.2. get-index-informationReturns an object! containing information about the index, or none if the index does not exist. USAGE: skimp/get-index-information index-name ARGUMENTS: index-name -- name of the index (Type: file) example: if object? ind-header: skimp/get-index-information %my-indexes/index-1 [ probe ind-header ] 8.2.1. index header objectHas these fields:
8.3. remove-indexDeletes all files related to an index. USAGE: skimp/remove-index index-name ARGUMENTS: index-name -- name of the index (Type: file) example: skimp/remove-index/my-old-index 8.4. set-owner-dataAllows you to store an object! in the index header. You could use this to (say) record an extended name for the index, or to add some application-specific notes about the index USAGE: skimp/set-owner-data index-name owner-data /defer ARGUMENTS: index-name -- name of the index (Type: file) owner-data -- your data (Type: object) REFINEMENTS: /defer -- Do not write to permanent storage example: skimp/set-owner-data %my-index make object! [ last-updated: now index-name: "All my photo captions" support: "call 555-1234" ] 8.4.1. updating existing owner-dataSimply specify the fields you want to change.... skimp/set-owner-data %my-index make object! [ support: "call 555-9876" ] ....Other fields in the owner-data are left unchanged. To get the full owner-data record, use get-index-information. 8.5. write-cacheWrites the data for a specific index to permanent storage. Use if you have been using the /defer refinement on update operations. USAGE: skimp/write-cache index-name /flush ARGUMENTS: index-name -- name of the index (Type: file) REFINEMENTS: /flush -- Purge the index from the cache after writing (further operations will cause it to be re-read) examples: skimp/write-cache %my-index-1 skimp/write-cache/flush %my-index-2 8.6. write-cache-allWrites the data for all indexes to permanent storage. Use if you have been using the /defer refinement on update operations, and your clean-up routines are unsure of what indexes your software have been updating USAGE: skimp/write-cache /flush ARGUMENTS: none REFINEMENTS: /flush -- Purge the indexes from the cache after writing (further operations will cause them to be re-read) example: skimp/write-cache-all skimp/write-cache-all/flush 8.7. flush-cacheClears an index from the cache without first writing it to permanent storage. If you have been updating an index using the /defer refinement, this is a way of purging all the changes since the last write-cache. If you have an index open purely for reading (ie not one you are updating in the cuurent task) this is a way of forcing the index to be reread from permanent storage. That may be of use if either:
USAGE: flush/write-cache index-name ARGUMENTS: index-name -- name of the index (Type: file) example: skimp/flush-cache %my-index-1 8.8. flush-cache-allClears all indexes from the cache USAGE: skimp/flush-cache ARGUMENTS: none examples: skimp/flush-cache-all 9. what is a word?skimp indexes the words in documents. So an obvious question is: how does it know what a word is? The simple answer is:
In principle, a word can be any string of one or more characters. You are not limited to letters, numerals or even printable characters. Your definition of what a word is (including any function you have supplied to extract the words in a document) are held in the index itself. So the definition of a word remains stable and constant throughout the life of the index. There are two functions for managing the definition of a word for an index:
9.1. set-word-definitionDefines what a word is for a given index USAGE: skimp/set-word-definition index-name /parameters parm-obj /make-word-list mwlf /defer ARGUMENTS: index-name -- name of the index (Type: file) REFINEMENTS: /parameters -- extend/change definition for make-word-list function parm-obj (Type: object) /make-word-list -- replacement function mwlf (Type: function none) /defer -- Do not write to permanent storage examples: skimp/set-word-definition/parameters make object! [ initial-letters: charset compose [#"a" - #"z" #"A" - #"Z" (to-char 199) (to-char 231) ] ] skimp/set-word-definition/make-word-list func [parms string] [return unique parse/all string " "]
9.1.1. changing the word definition for an existing indexYou can do this, but it may not be wise. It will not affect any of the words already indexed, but it may affect your ability to search for them. For example, you build an index over 1000 documents in which decimal numbers are not treated as words. Later you change the definition so they are treated as words, and add another 1000 documents. Only the latter 1000 documents will have their numbers indexed. 9.1.2. parameter objectThe parameter object is passed to the make-word-list function. By default it is similar to the default object used by make-word-list. You can find skimp's settings for an index like this: probe get in skimp/get-index-information %my-index 'word-parameters [h3 make-word-list function [p By default, skimp looks for %make-word-list.r in the current folder: [li if found: skimp uses it as the default make-word-list function [li if not, it uses a fairly useless minimal function: [asis func [ parms string [string!] ][ return unique sort parse/all trim/lines copy string " " ] If you want skimp to use another function that you have written, you can specify it with set-word-definition/make-word-list.
example: skimp/set-word-parameters/make-word-list func [ obj [object!] str [string!] /for-search ][ return unique parse/all str "." ] The example is hardly the most useful word-extracting function in the world :-) To test your function (or the built-in function) use extract-words-from-string (See later). 9.1.3. /for-search refinementThere are two subtly different reasons why you may want to extract the words from a string:
The two uses may in your application need to behave slightly differently, especially with regard to handling the not-prefix. As an example, the built-in make-word-list function acts differently when given a string that contains tildes -- a leading tilde is preserved with the /for-search refinement: skimp/extract-words-from-string %my-index "I have some ~tildes in t~~~his ~string~" == ["have" "I" "in" "some" "string~" "tildes" "t~~~his"] skimp/extract-words-from-string/for-search %my-index "I have some ~tildes in t~~~his ~string~" == ["have" "I" "in" "some" "t~~~his" "~string~" "~tildes"] 9.2. extract-words-from-stringWe've just about covered this above. Provides access to the make-word-list function in an index. Allows you to:
USAGE: skimp/extract-words-from-string index-name string /for-search ARGUMENTS: index-name -- name of the index (Type: file) string -- string from which to extract the words (Type: string) REFINEMENTS: /for-search -- Says if words need to be extracted for searching or indexing example: skimp/extract-words-from-string %my-index {[here_are (some words) "to" $index!} == ["are" "here" "index!" "some" "to" "words"] 10. set-config10.1. basic index configurationThere are three magic settings you can apply when creating an index that may affect its performance. Once these are set, they cannot be changed for the lifetime of the index (See skimp-tools later for possible exceptions). The settings do not matter much while you are evaluating or playing with skimp. But you may want to run benchmarks on large data sets before building any skimp indexes for critical work. The three settings are:
The meaning of these settings is discussed below. To find the settings for an index, use get-index-information. The values are returned in the config object: index-info: skimp/get-index-information %my-index probe index-info/config To set or change a value, use set-index-config (see below). As you can only set these values on a new index (one with no words in it), you may need to surround any code that sets them with a test that the index exists or not: if not skimp/index-exists? %my-index [ .... code to set config values 10.2. set-configSets or changes the value of a basic configuration variable USAGE: skimp/set-config index-name config /defer ARGUMENTS: index-name -- name of the index (Type: file) config -- object containing your configuration settings (Type: object) REFINEMENTS: /defer -- Do not write to permanent storage examples: probe skimp/set-config %my-index [index-levels: 6] probe skimp/set-config %my-index [index-levels: 3 one-file: true] The response is the newly updated configuration object. Note that you do not need to specify all the configuration values. The ones you omit will not be changed 10.3. set-config/one-fileThe one-file config setting sets whether the index is created as a single file or a series of files.
example: probe skimp/set-config %my-index make object! [one-file: true] 10.3.1. what's the point?The main point is speed of retrieval. skimp was written to perform well when running queries on a webserver. A typical search on a web server consists of just a few words. With a multiple-file index, we need load and initialise only those parts of the index, thus ensuring less i/o and cpu time. The same advantage may apply in a desktop application: rather than having the whole index loaded at application start-up or on the first query, just the relevant parts need be loaded. There are similar advantages when updating an index -- if a new document has words that begin with just [a b c] then only those three files will by written back. skimp-tools (See later) will provide a way of flipping this setting after an index has been created. 10.4. set-config/index-levelsChanges the default number of levels in the index. The skimp-tools documentation will contain some more detailed information on the internals of the skimp index. A brief explanation for now: The index is a bit like (not identical to) a b-tree. If you have a three-level index, then the words REBOL and REBEL are indexed like this: "r" ===> "e" ===> "b" ===> ["el" [1 2 3] "ol" [1 4 58x100] That is: there are three levels of index (for the characters R, E and B). They point you to a list that contains the other letters of words that begin REB, plus an index of which documents contain them (REBEL is in docs 1, 2, and 3; REBOL is in docs 1, 4 and 58 through 157) index-levels lets you set the number of levels of index:
example: probe skimp/set-config %my-index make object! [index-levels: 2] 10.4.1. what's the best setting?You'll need to experiment with your live data. In practice, the best index-levels setting makes between 10% and 20% difference to file sizes and retrieval time, so the setting is unlikely to make a crucial difference. (Your experience may vary with indexes that are many megabytes in size). 10.5. set-config/integer-document-namesIn all the examples so far, document names have been string!s, though there have been some hints that they can also be integers. Internally, skimp uses document ids, which are always integers. skimp uses a document name to document id table to translate between them. So, if it happens that your document names are integers that meet the requirement below, you can eliminate that conversion table. The requirements are:
examples: skimp/set-config %my-index make object! [integer-document-names: true] skimp/add-words %my-index 55 "this document name is an integer" 11. limitationsJust so you know:
12. Coming soonWe're hoping to release two extras in the next few of weeks. Look out for them on REBOL.org:
12.1. skimp-my-altme.rIs a demonstration application. It will consist of an API to index the posts in Altme worlds. Depending on time, there may be a cheapo GUI in front of it. If you would like to help write a decent front end to the Altme world indexer, please contact me today. 12.2. skimp-tools.rThis will extend the skimp function to provide features a developer may need when writing applications that use skimp; or if extending the basics of skimp itself. They will include:
The accompanying document will include some developer's notes on the internals of a skimp index. |