Essay on Associative Data Stores

[1/8] from: joel::neely::fedex::com at: 16-Sep-2000 15:47

Short version: 1. A general-purpose Associative Data Store (ADS) is not directly implemented by append and select for a variety of reasons. 1.1. Keys and values are not distinguised. 1.2. Storage management/economy require extra code to handle replacement of values for existing keys. 1.3. The block! type doesn't appear to be supported as a key in a manner consistent with other data types. 2. The path notation does not support a general purpose ADS implementation for a variety of reasons. 2.1. It restricts the range of possible key types. 2.2. It does not support the addition of values, requiring distinct syntax for addition of new associations versus updating or retrieving existing associations. 2.2. The overloading of "/" in REBOL provides a potentially rich source of bugs (or at least confusion). Conclusion: Applications that need a general-purpose ADS will require additional code to provide that functionality. This note is already too long; I'll provide a (further-modified) possible implementation in a separate email later. Note: I'm not saying that blocks and the native operations on blocks are "bad", just that they aren't sufficient for the general case. -jn- Long-winded version: Greetings, fellow Rebolutionaries! There has been a good bit of discussion regarding the concept and possible REBOL implementation, of an "Associative Data Store" (which I'll abbreviate hereafter as ADS, aka "associative array", "associative list", "a-list", "lookup table", dictionary , etc.) From my perspective, there are some issues in this discussion that have become confusingly entangled; as I am a bear of small brain, only able to think of one thing at a time, I'd like to disentangle them. Let me begin with a definition that lets me address the different features I believe a completely general-purpose ADS would possess. An ADS is a mechanism for storing and retrieving associations between a collection of (distinct!) "keys" and their corresponding "value"s. It can be visualized as a table of two columns: each row holds an association between the key in the left-hand column and the value in the right-hand column. An association is added/updated by ensuring that the desired key is in a left-hand entry (inserting it if it is not already present) and that the corresponding right-hand entry contains the desired value (inserting or replacing, depending on whether the key was already present). An association may be retrieved for a given "target" (if the target occurs among the keys) by locating the target in the left-hand column and returning the value in the corresponding position of the right-hand column. If the target does not occur in the left-hand column, we might simply declare an error or we might return a default "not found" value -- such as none in REBOL. Note that I've said nothing about what types of data may serve as keys or values. An obvious consequence of this description is that ANY data type whatsoever can be used as a value, and any data type that supports simple (in)equality testing can be used as a key. [At this point, let me apologize if I have insulted anybody's intelligence. I certainly didn't intend to; I just wanted to make my assumptions about an ADS completely explicit.] Now let's look at some of the categories of suggestions that have been discussed on the list in response to the question, How can I implement an ADS in REBOL? 1. "Why not just use block/append/find/select?" We can see several reasons why block/append/find/select do not provide an implementation of the general-purpose ADS described above. (I'm not saying here that they can't be USED IN an implementation of ADS, just that they don't support all of the stated requirements without additional coding.) 1.1. Keys and values are not distinguished by select , which allows "false positive" results to be obtained. E.g.:

>> append blk ["a" "b"]

== ["a" "b"]

>> append blk ["e" "f"]

== ["a" "b" "e" "f"]

>> append blk ["i" "j"]

== ["a" "b" "e" "f" "i" "j"]

>> append blk ["o" "p"]

== ["a" "b" "e" "f" "i" "j" "o" "p"]

>> append blk ["u" "v"]

== ["a" "b" "e" "f" "i" "j" "o" "p" "u" "v"] In other words, blk is intended as an ADS associating each vowel with the following consonant. This sometimes works:

>> select blk "i"

== "j" and sometimes doesn't:

>> select blk "b"

== "e" because select doesn't understand which entries were intended as keys and which as values. In addition to the above false positive result, this problem can also mask the existance of a legitimate association, as in:

>> blk2: [14 79 12 86 53 92 79 104]

== [14 79 12 86 53 92 79 104]

>> select blk2 79

== 12 1.2. Simply appending a new pair on the end of a block is not sufficient for defining an association, as it doesn't deal with the possibility that the key already exists (either as a key or as a value...) In such cases, the newly-added pair will never be found, unless additional code is written to deal with this situation. Of course, this could be sidestepped by using insert to place the new pair at the head of the block. This would provide functionally correct behavior (if we ignore 1.1 for a moment) but would have the undesirable side effect of consuming progressively more memory as new entries were added to replace associations on previous keys. 1.3. It appears that find and select don't treat block! data consistenly with other data types when used as keys and search targets, even though equality is defined for block! data. Consider the function:

>> test-select: func [b [block!] k [any-type!] v [any-type!]] [

[ append b reduce [k v] [ select b k [ ]

>> blk: []

== [] A naive analysis would say, "We're going to put k and v into the block and then look up the value associated with k. Therefore, (except for the problem discussed under 1.1) we should always get back v." Let's try it:

>> test-select blk "a" "b"

== "b"

>> test-select blk 2 14

== 14

>> test-select blk true 1.2.3.4

== 1.2.3.4

>> test-select blk 1.2.3.5 "www.gorp.org"

== "www.gorp.org" So far, so good. Now...

>> test-select blk [2 3] 6

== none Hmmm. That didn't work. However it gets more puzzling (to me, at least) as we continue to poke around!

>> blk

== ["a" "b" 2 14 true 1.2.3.4 1.2.3.5 "www.gorp.org" [2 3] 6] The "[2 3]" entry seems to be present, which we can confirm by the aliasing bug:

>> select blk "www.gorp.org"

== [2 3] but it doesn't seem to be recognized by select as a key

>> select blk [2 3]

== none

>> select blk reduce [2 3]

== none

>> select blk reduce [[2 3]]

== 6 Now THAT was a surprise! I'll be glad to hear any explanations anyone can offer for that last case! Whatever the explantion may be, it appears that select is NOT using the normal meaning of equality, as demonstrated by:

>> pick blk 9

== [2 3]

>> [2 3] = pick blk 9

== true

>> select blk [2 3]

== none 2. "Why not use the built-in path notation?" The path notation is even less suitable for a general purpose ADS implementation for a variety of reasons. 2.1. It restricts the range of possible key types. Let's start small, with our vowels example:

>> blk: ["a" "b" "e" "f" "i" "j" "o" "p" "u" "v"]

== ["a" "b" "e" "f" "i" "j" "o" "p" "u" "v"] To use pathing as an access notation, we are FORCED to create additional words, instead of being able to access data via literal keys.

>> blk/a

** Script Error: Invalid path value: a. ** Where: blk/a

>> blk/:("a")

** Syntax Error: Invalid word-get -- :. ** Where: (line 1) blk/:("a") That didn't work because there isn't a word! value of a in the dictionary block, and because the colon prefix isn't an operator. Since our keys here are string! data, we have to create a word (variable) to hold the key, and use the associated get-word! as a path element, as in:

>> key: "a"

== "a"

>> blk/:key

== "b" However, if our dictionary starts to get more interesting, we get some more surprises!

>> blk: [

[ "a" "b" "e" "f" "i" "j" "o" "p" "u" "v" [ 2 1 3 2 5 3 7 8 11 55 [ true "yep" false "nope" [ 1.2.3.4 "castor.romefounders.org" [ 1.2.3.5 "pollux.romefounders.org" [ ] == [ "a" "b" "e" "f" "i" "j" "o" "p" "u" "v" 2 1 3 2 5 3 7 8 11 55 true "yep" false "nope" 1.2.3.4 "castor.romef... Now we have a more interesting variety of types for keys and associated values. Does the create-a-word-and-use-its-get-word trick still work?

>> key: 1.2.3.4

== 1.2.3.4

>> blk/:key

== "castor.romefounders.org" It would appear so, until we try

>> key: true

== true

>> blk/:key

== "a" Whoa! Did I set key to what I thought I did?

>> key

== true Yep. It appears there's another weirdism here! And it also bites the other logic! value when used as a key.

>> key: false

== false

>> blk/:key

== "b" Purely speculation on my part, but I'd guess this relates to what we'll cover in point 2.3 below. To be fair, I should point out that this example is really due to TWO issues. The logic! keys are giving false positives, and the dictionary doesn't actually contain any logic! keys. Consider this:

>> pick blk 21

== true

>> type? pick blk 21

== word!

>> true = pick blk 21

== false

>> 'true = pick blk 21

== true The twenty-first entry in blk (intended as the eleventh key) is actually a word! value, and not a logic! value, as blk was initialized with a literal (and therefore unevaluated) block. However, we can show that this is not what caused the retrieval failure very simply:

>> blk: reduce blk

== [ "a" "b" "e" "f" "i" "j" "o" "p" "u" "v" 2 1 3 2 5 3 7 8 11 55 true "yep" false "nope" 1.2.3.4 "castor.romefounde...

>> pick blk 21

== true

>> type? pick blk 21

== logic!

>> true = pick blk 21

== true

>> 'true = pick blk 21

== false

>> key: true

== true

>> blk/:key

== "a" So, even after the effort of using a get-word! path element, we have some problems. 2.2. It does not support the addition of values, requiring distinct syntax for addition of new associations versus updating or retrieving existing associations.

>> blk: [a "first" b "second" c "third"]

== [a "first" b "second" c "third"]

>> blk/a

== "first"

>> blk/a: "primo"

== [a "primo" b "second" c "third"] So far, so good. However, we can't add new values this way:

>> blk/d: "fourth"

** Script Error: Invalid path value: d. ** Where: blk/d: "fourth" and we can't use "calculated" keys for updating:

>> key: 'b

== b

>> blk/:key

== "second"

>> blk/:key: "alternate"

** Syntax Error: Invalid word -- :key:. ** Where: (line 1) blk/:key: "alternate" All of these are leading up to the last point on paths. 2.3. The overloading of "/" in REBOL provides a potentially rich source of bugs (or, at the very least, confusion). The "/" is used for several distinct concepts in REBOL: - Refinements - a combination boolean argument and switch that MAY signal the presence of other arguments. - Paths - SYMBOLIC selection from within a block, similar to (but more fragile than) using select with a word! key. - Indexing - POSITIONAL selection from within a block. The "/" notation breaks referential transparency (which is very odd in a language billed as "Expression Based".) For those not jargon-saturated, referential transparency is just a complicated way of saying that equal-valued expressions should be freely interchangeable. For example, all expressions evaluating to 2 can be substituted for each other, as in:

>> tangent 2

== 3.49207694917477E-2

>> a: 2

== 2

>> tangent a

== 3.49207694917477E-2

>> (1 + 1)

== 2

>> tangent (1 + 1)

== 3.49207694917477E-2

>> b: 1

== 1

>> tangent (3 * b - 1)

== 3.49207694917477E-2 This doesn't work with "/", as demonstrated by:

>> now

== 16-Sep-2000/15:13:33-5:00

>> now/date

== 16-Sep-2000

>> jetzt: now

== 16-Sep-2000/15:13:44-5:00

>> jetzt/date

== 16-Sep-2000 So far, so good. A word referring to a date! and a function that returns a date! start off looking similar. But:

>> jetzt/date/year

== 2000

>> jetzt/date/month

== 9

>> now/date/year

** Script Error: Too many refinements. ** Where: now/date/year

>> now/date/month

** Script Error: Too many refinements. ** Where: now/date/month The slashes after jetzt are all interpreted as structure accessors, but the slashes after now are all interpreted as refinements. There isn't a notation to say, "Take the value from this function call and access its components." without storing it somewhere first.

>> (now/date)/year

== /year

>> yuck: now/date

== 16-Sep-2000

>> yuck/year

== 2000 With respect to the last two uses of "/", it's not always obvious, even from reading your own source code, which one you'll get.

>> blk: [a "first" b "second" c "third" 1 "one" 2 "two" 3 "three"]

== [a "first" b "second" c "third" 1 "one" 2 "two" 3 "three"]

>> key: 'a

== a

>> blk/:key

== "first"

>> key: 1

== 1

>> blk/:key

== a

>> key: 2

== 2

>> blk/:key

== "first" If the value of key is a word, the slash retrieves a value by symbolic lookup, but if the value of key is numeric, the slash retrieves a value by position. Therefore, it cannot be used as a notational substitute for

>> select blk 1

== "one" Bear in mind that we might have something similar to key: some-complicated-function with-some-arguments which could require significantly more effort to determine the meaning of val: blk/:key in the general case. That's (more than) enough for now! -jn-

[2/8] from: lmecir:geocities at: 18-Sep-2000 8:57

Hi Joel, two notes: 1) I think that some posts on ADS missed one very important point - the speed. The problem is, that Find (sequential search) may be a bottleneck for some apps. 2) Select may be less general than Select/only sometimes. Regards Ladislav

[3/8] from: ole_f:post3:tele:dk at: 18-Sep-2000 12:45

If you read this, you are using an old mail reader. The solution if you have difficulty reading the rest of this mail is getting hold of newer, MIME-compliant software. --X=Boundary=X Content-Type: text/plain; charset=iso-8859-1 Content-Transfer-Encoding: quoted-printable X-Wordwrap: 78 Hi Joel, 16-Sep-2000 you wrote:

>1.1. Keys and values are not distinguised. >1.2. Storage management/economy require extra code to > handle replacement of values for existing keys. >1.3. The block! type doesn't appear to be supported as a key > in a manner consistent with other data types.

I've attached some functions for manipulating with an associative data structure. I personally think they suffice for my normal uses, but I'm looking forward to your comments (though, please keep it short then, I don't have the time to read through all that ASCII ;-) ). BTW, using a hash! for the "keys" array would probably be a better idea than just using a block!, as I currently do. _If_ find/only works faster on hash! lists, that is (it should be, but I don't know). Kind regards, -- Ole Friis <[ole_f--post3--tele--dk]> Amiga is a trademark of Amiga Inc. --X=Boundary=X Content-Type: text/plain; charset=US-ASCII; name="ADS.r" Content-Transfer-Encoding: 7bit REBOL [] create-ads: func [ "Creates a new associative data structure" ][ make object! [ keys: copy [] values: copy [] ] ] lookup-ads: func [ "Returns the value associated with the key" ads [object!] "The associative data structure" key [any-type!] "Key" /local t-list ][ if found? t-list: find/only ads/keys key [ pick ads/values index? t-list ] ] associate-ads: func [ "Appends the new key/value pair in the data structure" ads [object!] "The associative data structure" key [any-type!] "Key" value [any-type!] "Value" /local t-list ][ either found? t-list: find/only ads/keys key [ change/only at ads/values index? t-list value ][ append/only ads/keys key append/only ads/values value ] ] --X=Boundary=X--

[4/8] from: g:santilli:tiscalinet:it at: 18-Sep-2000 19:21

Hello [joel--neely--fedex--com]! On 16-Set-00, you wrote: j>>> test-select: func [b [block!] k [any-type!] v j> [any-type!]] [ j> [ append b reduce [k v] j> [ select b k j> [ ] j>>> blk: [] j> == [] Try this:

>> test-select: func [b k v] [

[ append b reduce [k v] [ select/only b k [ ]

>> blk: []

== []

>> test-select blk "a" "b"

== "b"

>> test-select blk [2 3] 6

== 6 j>>> select blk reduce [[2 3]] j> == 6 j> Now THAT was a surprise! I'll be glad to hear any explanations j> anyone can offer for that last case! Whatever the explantion REBOL functions tend to treat series as a sequence of values, not as a single value. They have the /ONLY refinement to treat series as a single value. You SELECT was serching for the value 2 followed by the value 3, not for the value [2 3]. Also notice that:

>> blk: [1 2 "one-two" 1 3 "one-three" 2 3 "two-three"]

== [1 2 "one-two" 1 3 "one-three" 2 3 "two-three"]

>> select blk [1 2]

== "one-two"

>> select blk [1 3]

== "one-three"

>> select blk [2 3]

== "two-three" Regards, Gabriele. -- Gabriele Santilli <[giesse--writeme--com]> - Amigan - REBOL programmer Amiga Group Italia sez. L'Aquila -- http://www.amyresource.it/AGI/

[5/8] from: joel:neely:fedex at: 19-Sep-2000 7:44

Hi, Ole! [ole_f--post3--tele--dk] wrote:

[snip]

> I've attached some functions for manipulating with an associative > data structure. I personally think they suffice for my normal uses, > but I'm looking forward to your comments (though, please keep it > short then, I don't have the time to read through all that > ASCII ;-) ). >

(Not only do you not have time to read it, I really don't have time to be typing it! ;-) I'll wait to respond with specifics until I've had a chance to look them over adequately. I'm curious about the motivations (and the performance) of your compromise solution -- part object, part function library. Would I be correct in guessing that you chose to make associate-ads and lookup-ads standalone functions instead of object methods to avoid the duplication which REBOL (currently -- hint, hint, hint) imposes on objects?

> BTW, using a hash! for the "keys" array would probably be a better > idea than just using a block!, as I currently do. _If_ find/only > works faster on hash! lists, that is (it should be, but I don't > know). >

Disclaimer: I've not done the empirical testing of this hypothesis for the REBOL implementation of hash! , so PLEASE take this as speculation. My general experience is that hashing vs. array/list lookup is subject to tradeoffs. There is overhead (both space and time) with hashing, which makes it perform more poorly than simpler but straightforward lookup schemes for small cases. As the size of the data collection grows, however, that overhead amortizes nicely. Ultimately, for "large enough" collections of data, the hash table beats array/list algorithms -- the best way to determine how large is large enough, of course, is to run some benchmarks, which I'd like to do some time when my higher priorities have been handled (and there are LOTS of them! ;-) -jn-

[6/8] from: larry:ecotope at: 19-Sep-2000 9:51

Hi Joel, Ole What Joel writes about tradeoffs is true, but the most important thing to know is that the REBOL hash! datatype only hashes strings (a little missing documentation). For other datatypes, you just get a more expensive block with no speed benefits. Jim has said that he will extend it to hash numbers as well, but no schedule on that. Cheers -Larry PS Joel, I have really enjoyed your recent essays and benchmarks. ---------excerpt from original message

[7/8] from: ole_f:post3:tele:dk at: 20-Sep-2000 14:30

Hi Larry, 19-Sep-2000 you wrote:

>What Joel writes about tradeoffs is true, but the most important thing to >know is that the REBOL hash! datatype only hashes strings (a little missing >documentation). For other datatypes, you just get a more expensive block >with no speed benefits. Jim has said that he will extend it to hash numbers >as well, but no schedule on that.

OK, thanks for the info. That scheme sounds reasonable to me, given that the only thing I've used hash tables for in the past has been strings. I don't immediately see any reason for having fast lookup on more advanced types. (But probably others immediately see the reason, no doubt ;-) ) Kind regards, -- Ole Friis <[ole_f--post3--tele--dk]> Amiga is a trademark of Amiga Inc.

[8/8] from: ole_f:post3:tele:dk at: 20-Sep-2000 14:32

Hi Joel, 19-Sep-2000 you wrote: [...]

>I'll wait to respond with specifics until I've had a chance to look >them over adequately. I'm curious about the motivations (and

<<quoted lines omitted: 4>>

>avoid the duplication which REBOL (currently -- hint, hint, hint) >imposes on objects?

You can get rid of the duplication by doing a super-object containing the functions, then cloning this object for all your "real objects". So no, that wasn't the case. The reason was laziness ;-) 1: I needed to encapsulate the two different lists into something, and I chose an object. Could've used a list with the two lists as the only elements, but I just didn't. 2: When I tested my code, I just wanted to be able to 'do the source file and then directly use the functions on my existing object. If I had put the functions inside the object, then i would have to create a new object in order to use the new functions. This is the laziness part. So, there wasn't any clever reason. And anyway, I just wrote it to show how to do the thing relatively elegant by separating the keys and the values into two separate lists, instead of the approach everybody else uses by cramming everything into a single list. Anyway, it can be done a lot nicer than I have done. Kind regards, -- Ole Friis <[ole_f--post3--tele--dk]> Amiga is a trademark of Amiga Inc.

Notes

Quoted lines have been omitted from some messages.
View the message alone to see the lines that have been omitted