Mailing List Archive: 49091 messages
  • Home
  • Script library
  • AltME Archive
  • Mailing list
  • Articles Index
  • Site search
 

[REBOL] Essay on Associative Data Stores

From: joel::neely::fedex::com at: 16-Sep-2000 15:47

Short version: 1. A general-purpose Associative Data Store (ADS) is not directly implemented by append and select for a variety of reasons. 1.1. Keys and values are not distinguised. 1.2. Storage management/economy require extra code to handle replacement of values for existing keys. 1.3. The block! type doesn't appear to be supported as a key in a manner consistent with other data types. 2. The path notation does not support a general purpose ADS implementation for a variety of reasons. 2.1. It restricts the range of possible key types. 2.2. It does not support the addition of values, requiring distinct syntax for addition of new associations versus updating or retrieving existing associations. 2.2. The overloading of "/" in REBOL provides a potentially rich source of bugs (or at least confusion). Conclusion: Applications that need a general-purpose ADS will require additional code to provide that functionality. This note is already too long; I'll provide a (further-modified) possible implementation in a separate email later. Note: I'm not saying that blocks and the native operations on blocks are "bad", just that they aren't sufficient for the general case. -jn- Long-winded version: Greetings, fellow Rebolutionaries! There has been a good bit of discussion regarding the concept and possible REBOL implementation, of an "Associative Data Store" (which I'll abbreviate hereafter as ADS, aka "associative array", "associative list", "a-list", "lookup table", dictionary , etc.) From my perspective, there are some issues in this discussion that have become confusingly entangled; as I am a bear of small brain, only able to think of one thing at a time, I'd like to disentangle them. Let me begin with a definition that lets me address the different features I believe a completely general-purpose ADS would possess. An ADS is a mechanism for storing and retrieving associations between a collection of (distinct!) "keys" and their corresponding "value"s. It can be visualized as a table of two columns: each row holds an association between the key in the left-hand column and the value in the right-hand column. An association is added/updated by ensuring that the desired key is in a left-hand entry (inserting it if it is not already present) and that the corresponding right-hand entry contains the desired value (inserting or replacing, depending on whether the key was already present). An association may be retrieved for a given "target" (if the target occurs among the keys) by locating the target in the left-hand column and returning the value in the corresponding position of the right-hand column. If the target does not occur in the left-hand column, we might simply declare an error or we might return a default "not found" value -- such as none in REBOL. Note that I've said nothing about what types of data may serve as keys or values. An obvious consequence of this description is that ANY data type whatsoever can be used as a value, and any data type that supports simple (in)equality testing can be used as a key. [At this point, let me apologize if I have insulted anybody's intelligence. I certainly didn't intend to; I just wanted to make my assumptions about an ADS completely explicit.] Now let's look at some of the categories of suggestions that have been discussed on the list in response to the question, How can I implement an ADS in REBOL? 1. "Why not just use block/append/find/select?" We can see several reasons why block/append/find/select do not provide an implementation of the general-purpose ADS described above. (I'm not saying here that they can't be USED IN an implementation of ADS, just that they don't support all of the stated requirements without additional coding.) 1.1. Keys and values are not distinguished by select , which allows "false positive" results to be obtained. E.g.:
>> append blk ["a" "b"]
== ["a" "b"]
>> append blk ["e" "f"]
== ["a" "b" "e" "f"]
>> append blk ["i" "j"]
== ["a" "b" "e" "f" "i" "j"]
>> append blk ["o" "p"]
== ["a" "b" "e" "f" "i" "j" "o" "p"]
>> append blk ["u" "v"]
== ["a" "b" "e" "f" "i" "j" "o" "p" "u" "v"] In other words, blk is intended as an ADS associating each vowel with the following consonant. This sometimes works:
>> select blk "i"
== "j" and sometimes doesn't:
>> select blk "b"
== "e" because select doesn't understand which entries were intended as keys and which as values. In addition to the above false positive result, this problem can also mask the existance of a legitimate association, as in:
>> blk2: [14 79 12 86 53 92 79 104]
== [14 79 12 86 53 92 79 104]
>> select blk2 79
== 12 1.2. Simply appending a new pair on the end of a block is not sufficient for defining an association, as it doesn't deal with the possibility that the key already exists (either as a key or as a value...) In such cases, the newly-added pair will never be found, unless additional code is written to deal with this situation. Of course, this could be sidestepped by using insert to place the new pair at the head of the block. This would provide functionally correct behavior (if we ignore 1.1 for a moment) but would have the undesirable side effect of consuming progressively more memory as new entries were added to replace associations on previous keys. 1.3. It appears that find and select don't treat block! data consistenly with other data types when used as keys and search targets, even though equality is defined for block! data. Consider the function:
>> test-select: func [b [block!] k [any-type!] v [any-type!]] [
[ append b reduce [k v] [ select b k [ ]
>> blk: []
== [] A naive analysis would say, "We're going to put k and v into the block and then look up the value associated with k. Therefore, (except for the problem discussed under 1.1) we should always get back v." Let's try it:
>> test-select blk "a" "b"
== "b"
>> test-select blk 2 14
== 14
>> test-select blk true 1.2.3.4
== 1.2.3.4
>> test-select blk 1.2.3.5 "www.gorp.org"
== "www.gorp.org" So far, so good. Now...
>> test-select blk [2 3] 6
== none Hmmm. That didn't work. However it gets more puzzling (to me, at least) as we continue to poke around!
>> blk
== ["a" "b" 2 14 true 1.2.3.4 1.2.3.5 "www.gorp.org" [2 3] 6] The "[2 3]" entry seems to be present, which we can confirm by the aliasing bug:
>> select blk "www.gorp.org"
== [2 3] but it doesn't seem to be recognized by select as a key
>> select blk [2 3]
== none
>> select blk reduce [2 3]
== none
>> select blk reduce [[2 3]]
== 6 Now THAT was a surprise! I'll be glad to hear any explanations anyone can offer for that last case! Whatever the explantion may be, it appears that select is NOT using the normal meaning of equality, as demonstrated by:
>> pick blk 9
== [2 3]
>> [2 3] = pick blk 9
== true
>> select blk [2 3]
== none 2. "Why not use the built-in path notation?" The path notation is even less suitable for a general purpose ADS implementation for a variety of reasons. 2.1. It restricts the range of possible key types. Let's start small, with our vowels example:
>> blk: ["a" "b" "e" "f" "i" "j" "o" "p" "u" "v"]
== ["a" "b" "e" "f" "i" "j" "o" "p" "u" "v"] To use pathing as an access notation, we are FORCED to create additional words, instead of being able to access data via literal keys.
>> blk/a
** Script Error: Invalid path value: a. ** Where: blk/a
>> blk/:("a")
** Syntax Error: Invalid word-get -- :. ** Where: (line 1) blk/:("a") That didn't work because there isn't a word! value of a in the dictionary block, and because the colon prefix isn't an operator. Since our keys here are string! data, we have to create a word (variable) to hold the key, and use the associated get-word! as a path element, as in:
>> key: "a"
== "a"
>> blk/:key
== "b" However, if our dictionary starts to get more interesting, we get some more surprises!
>> blk: [
[ "a" "b" "e" "f" "i" "j" "o" "p" "u" "v" [ 2 1 3 2 5 3 7 8 11 55 [ true "yep" false "nope" [ 1.2.3.4 "castor.romefounders.org" [ 1.2.3.5 "pollux.romefounders.org" [ ] == [ "a" "b" "e" "f" "i" "j" "o" "p" "u" "v" 2 1 3 2 5 3 7 8 11 55 true "yep" false "nope" 1.2.3.4 "castor.romef... Now we have a more interesting variety of types for keys and associated values. Does the create-a-word-and-use-its-get-word trick still work?
>> key: 1.2.3.4
== 1.2.3.4
>> blk/:key
== "castor.romefounders.org" It would appear so, until we try
>> key: true
== true
>> blk/:key
== "a" Whoa! Did I set key to what I thought I did?
>> key
== true Yep. It appears there's another weirdism here! And it also bites the other logic! value when used as a key.
>> key: false
== false
>> blk/:key
== "b" Purely speculation on my part, but I'd guess this relates to what we'll cover in point 2.3 below. To be fair, I should point out that this example is really due to TWO issues. The logic! keys are giving false positives, and the dictionary doesn't actually contain any logic! keys. Consider this:
>> pick blk 21
== true
>> type? pick blk 21
== word!
>> true = pick blk 21
== false
>> 'true = pick blk 21
== true The twenty-first entry in blk (intended as the eleventh key) is actually a word! value, and not a logic! value, as blk was initialized with a literal (and therefore unevaluated) block. However, we can show that this is not what caused the retrieval failure very simply:
>> blk: reduce blk
== [ "a" "b" "e" "f" "i" "j" "o" "p" "u" "v" 2 1 3 2 5 3 7 8 11 55 true "yep" false "nope" 1.2.3.4 "castor.romefounde...
>> pick blk 21
== true
>> type? pick blk 21
== logic!
>> true = pick blk 21
== true
>> 'true = pick blk 21
== false
>> key: true
== true
>> blk/:key
== "a" So, even after the effort of using a get-word! path element, we have some problems. 2.2. It does not support the addition of values, requiring distinct syntax for addition of new associations versus updating or retrieving existing associations.
>> blk: [a "first" b "second" c "third"]
== [a "first" b "second" c "third"]
>> blk/a
== "first"
>> blk/a: "primo"
== [a "primo" b "second" c "third"] So far, so good. However, we can't add new values this way:
>> blk/d: "fourth"
** Script Error: Invalid path value: d. ** Where: blk/d: "fourth" and we can't use "calculated" keys for updating:
>> key: 'b
== b
>> blk/:key
== "second"
>> blk/:key: "alternate"
** Syntax Error: Invalid word -- :key:. ** Where: (line 1) blk/:key: "alternate" All of these are leading up to the last point on paths. 2.3. The overloading of "/" in REBOL provides a potentially rich source of bugs (or, at the very least, confusion). The "/" is used for several distinct concepts in REBOL: - Refinements - a combination boolean argument and switch that MAY signal the presence of other arguments. - Paths - SYMBOLIC selection from within a block, similar to (but more fragile than) using select with a word! key. - Indexing - POSITIONAL selection from within a block. The "/" notation breaks referential transparency (which is very odd in a language billed as "Expression Based".) For those not jargon-saturated, referential transparency is just a complicated way of saying that equal-valued expressions should be freely interchangeable. For example, all expressions evaluating to 2 can be substituted for each other, as in:
>> tangent 2
== 3.49207694917477E-2
>> a: 2
== 2
>> tangent a
== 3.49207694917477E-2
>> (1 + 1)
== 2
>> tangent (1 + 1)
== 3.49207694917477E-2
>> b: 1
== 1
>> tangent (3 * b - 1)
== 3.49207694917477E-2 This doesn't work with "/", as demonstrated by:
>> now
== 16-Sep-2000/15:13:33-5:00
>> now/date
== 16-Sep-2000
>> jetzt: now
== 16-Sep-2000/15:13:44-5:00
>> jetzt/date
== 16-Sep-2000 So far, so good. A word referring to a date! and a function that returns a date! start off looking similar. But:
>> jetzt/date/year
== 2000
>> jetzt/date/month
== 9
>> now/date/year
** Script Error: Too many refinements. ** Where: now/date/year
>> now/date/month
** Script Error: Too many refinements. ** Where: now/date/month The slashes after jetzt are all interpreted as structure accessors, but the slashes after now are all interpreted as refinements. There isn't a notation to say, "Take the value from this function call and access its components." without storing it somewhere first.
>> (now/date)/year
== /year
>> yuck: now/date
== 16-Sep-2000
>> yuck/year
== 2000 With respect to the last two uses of "/", it's not always obvious, even from reading your own source code, which one you'll get.
>> blk: [a "first" b "second" c "third" 1 "one" 2 "two" 3 "three"]
== [a "first" b "second" c "third" 1 "one" 2 "two" 3 "three"]
>> key: 'a
== a
>> blk/:key
== "first"
>> key: 1
== 1
>> blk/:key
== a
>> key: 2
== 2
>> blk/:key
== "first" If the value of key is a word, the slash retrieves a value by symbolic lookup, but if the value of key is numeric, the slash retrieves a value by position. Therefore, it cannot be used as a notational substitute for
>> select blk 1
== "one" Bear in mind that we might have something similar to key: some-complicated-function with-some-arguments which could require significantly more effort to determine the meaning of val: blk/:key in the general case. That's (more than) enough for now! -jn-