Essay on Associative Data Stores
[1/8] from: joel::neely::fedex::com at: 16-Sep-2000 15:47
Short version:
1. A general-purpose Associative Data Store (ADS) is not
directly implemented by append and select for a
variety of reasons.
1.1. Keys and values are not distinguised.
1.2. Storage management/economy require extra code to
handle replacement of values for existing keys.
1.3. The block! type doesn't appear to be supported as a key
in a manner consistent with other data types.
2. The path notation does not support a general purpose ADS
implementation for a variety of reasons.
2.1. It restricts the range of possible key types.
2.2. It does not support the addition of values, requiring
distinct syntax for addition of new associations versus
updating or retrieving existing associations.
2.2. The overloading of "/" in REBOL provides a potentially
rich source of bugs (or at least confusion).
Conclusion:
Applications that need a general-purpose ADS will require
additional code to provide that functionality. This note is
already too long; I'll provide a (further-modified) possible
implementation in a separate email later.
Note:
I'm not saying that blocks and the native operations on blocks
are "bad", just that they aren't sufficient for the general
case.
-jn-
Long-winded version:
Greetings, fellow Rebolutionaries!
There has been a good bit of discussion regarding the concept
and possible REBOL implementation, of an "Associative Data
Store" (which I'll abbreviate hereafter as ADS, aka "associative
array", "associative list", "a-list", "lookup table",
dictionary
, etc.) From my perspective, there are some issues
in this discussion that have become confusingly entangled; as
I am a bear of small brain, only able to think of one thing at
a time, I'd like to disentangle them.
Let me begin with a definition that lets me address the
different features I believe a completely general-purpose ADS
would possess.
An ADS is a mechanism for storing and retrieving associations
between a collection of (distinct!) "keys" and their
corresponding "value"s. It can be visualized as a table of
two columns: each row holds an association between the key in
the left-hand column and the value in the right-hand column.
An association is added/updated by ensuring that the desired
key is in a left-hand entry (inserting it if it is not already
present) and that the corresponding right-hand entry contains
the desired value (inserting or replacing, depending on whether
the key was already present).
An association may be retrieved for a given "target" (if the
target occurs among the keys) by locating the target in the
left-hand column and returning the value in the corresponding
position of the right-hand column. If the target does not
occur in the left-hand column, we might simply declare an error
or we might return a default "not found" value -- such as none
in REBOL.
Note that I've said nothing about what types of data may serve
as keys or values. An obvious consequence of this description
is that ANY data type whatsoever can be used as a value, and any
data type that supports simple (in)equality testing can be used
as a key.
[At this point, let me apologize if I have insulted anybody's
intelligence. I certainly didn't intend to; I just wanted to
make my assumptions about an ADS completely explicit.]
Now let's look at some of the categories of suggestions that
have been discussed on the list in response to the question,
How can I implement an ADS in REBOL?
1. "Why not just use block/append/find/select?"
We can see several reasons why block/append/find/select do not
provide an implementation of the general-purpose ADS described
above. (I'm not saying here that they can't be USED IN an
implementation of ADS, just that they don't support all of the
stated requirements without additional coding.)
1.1. Keys and values are not distinguished by select , which
allows "false positive" results to be obtained. E.g.:
>> append blk ["a" "b"]
== ["a" "b"]
>> append blk ["e" "f"]
== ["a" "b" "e" "f"]
>> append blk ["i" "j"]
== ["a" "b" "e" "f" "i" "j"]
>> append blk ["o" "p"]
== ["a" "b" "e" "f" "i" "j" "o" "p"]
>> append blk ["u" "v"]
== ["a" "b" "e" "f" "i" "j" "o" "p" "u" "v"]
In other words, blk is intended as an ADS associating each
vowel with the following consonant. This sometimes works:
>> select blk "i"
== "j"
and sometimes doesn't:
>> select blk "b"
== "e"
because select doesn't understand which entries were intended
as keys and which as values. In addition to the above false
positive result, this problem can also mask the existance of
a legitimate association, as in:
>> blk2: [14 79 12 86 53 92 79 104]
== [14 79 12 86 53 92 79 104]
>> select blk2 79
== 12
1.2. Simply appending a new pair on the end of a block is not
sufficient for defining an association, as it doesn't
deal with the possibility that the key already exists (either
as a key or as a value...) In such cases, the newly-added
pair will never be found, unless additional code is written to
deal with this situation. Of course, this could be sidestepped
by using insert to place the new pair at the head of the
block. This would provide functionally correct behavior (if we
ignore 1.1 for a moment) but would have the undesirable side
effect of consuming progressively more memory as new entries
were added to replace associations on previous keys.
1.3. It appears that find and select don't treat block!
data consistenly with other data types when used as keys
and search targets, even though equality is defined for block!
data. Consider the function:
>> test-select: func [b [block!] k [any-type!] v [any-type!]] [
[ append b reduce [k v]
[ select b k
[ ]
>> blk: []
== []
A naive analysis would say, "We're going to put k and v into the
block and then look up the value associated with k. Therefore,
(except for the problem discussed under 1.1) we should always
get back v." Let's try it:
>> test-select blk "a" "b"
== "b"
>> test-select blk 2 14
== 14
>> test-select blk true 1.2.3.4
== 1.2.3.4
>> test-select blk 1.2.3.5 "www.gorp.org"
== "www.gorp.org"
So far, so good. Now...
>> test-select blk [2 3] 6
== none
Hmmm. That didn't work. However it gets more puzzling (to me,
at least) as we continue to poke around!
>> blk
== ["a" "b" 2 14 true 1.2.3.4 1.2.3.5 "www.gorp.org" [2 3] 6]
The "[2 3]" entry seems to be present, which we can confirm by
the aliasing bug:
>> select blk "www.gorp.org"
== [2 3]
but it doesn't seem to be recognized by select as a key
>> select blk [2 3]
== none
>> select blk reduce [2 3]
== none
>> select blk reduce [[2 3]]
== 6
Now THAT was a surprise! I'll be glad to hear any explanations
anyone can offer for that last case! Whatever the explantion
may be, it appears that select is NOT using the normal
meaning of equality, as demonstrated by:
>> pick blk 9
== [2 3]
>> [2 3] = pick blk 9
== true
>> select blk [2 3]
== none
2. "Why not use the built-in path notation?"
The path notation is even less suitable for a general purpose
ADS implementation for a variety of reasons.
2.1. It restricts the range of possible key types.
Let's start small, with our vowels example:
>> blk: ["a" "b" "e" "f" "i" "j" "o" "p" "u" "v"]
== ["a" "b" "e" "f" "i" "j" "o" "p" "u" "v"]
To use pathing as an access notation, we are FORCED to create
additional words, instead of being able to access data via
literal keys.
>> blk/a
** Script Error: Invalid path value: a.
** Where: blk/a
>> blk/:("a")
** Syntax Error: Invalid word-get -- :.
** Where: (line 1) blk/:("a")
That didn't work because there isn't a word! value of a
in the dictionary block, and because the colon prefix isn't an
operator. Since our keys here are string! data, we have to
create a word (variable) to hold the key, and use the associated
get-word! as a path element, as in:
>> key: "a"
== "a"
>> blk/:key
== "b"
However, if our dictionary starts to get more interesting, we
get some more surprises!
>> blk: [
[ "a" "b" "e" "f" "i" "j" "o" "p" "u" "v"
[ 2 1 3 2 5 3 7 8 11 55
[ true "yep" false "nope"
[ 1.2.3.4 "castor.romefounders.org"
[ 1.2.3.5 "pollux.romefounders.org"
[ ]
== [
"a" "b" "e" "f" "i" "j" "o" "p" "u" "v"
2 1 3 2 5 3 7 8 11 55
true "yep" false "nope"
1.2.3.4 "castor.romef...
Now we have a more interesting variety of types for keys and
associated values. Does the create-a-word-and-use-its-get-word
trick still work?
>> key: 1.2.3.4
== 1.2.3.4
>> blk/:key
== "castor.romefounders.org"
It would appear so, until we try
>> key: true
== true
>> blk/:key
== "a"
Whoa! Did I set key to what I thought I did?
>> key
== true
Yep. It appears there's another weirdism here! And it also
bites the other logic! value when used as a key.
>> key: false
== false
>> blk/:key
== "b"
Purely speculation on my part, but I'd guess this relates to
what we'll cover in point 2.3 below.
To be fair, I should point out that this example is really due
to TWO issues. The logic! keys are giving false positives,
and the dictionary doesn't actually contain any logic! keys.
Consider this:
>> pick blk 21
== true
>> type? pick blk 21
== word!
>> true = pick blk 21
== false
>> 'true = pick blk 21
== true
The twenty-first entry in blk (intended as the eleventh key) is
actually a word! value, and not a logic! value, as blk was
initialized with a literal (and therefore unevaluated) block.
However, we can show that this is not what caused the retrieval
failure very simply:
>> blk: reduce blk
== [
"a" "b" "e" "f" "i" "j" "o" "p" "u" "v"
2 1 3 2 5 3 7 8 11 55 true "yep" false "nope"
1.2.3.4 "castor.romefounde...
>> pick blk 21
== true
>> type? pick blk 21
== logic!
>> true = pick blk 21
== true
>> 'true = pick blk 21
== false
>> key: true
== true
>> blk/:key
== "a"
So, even after the effort of using a get-word! path element,
we have some problems.
2.2. It does not support the addition of values, requiring
distinct syntax for addition of new associations versus
updating or retrieving existing associations.
>> blk: [a "first" b "second" c "third"]
== [a "first" b "second" c "third"]
>> blk/a
== "first"
>> blk/a: "primo"
== [a "primo" b "second" c "third"]
So far, so good. However, we can't add new values this way:
>> blk/d: "fourth"
** Script Error: Invalid path value: d.
** Where: blk/d: "fourth"
and we can't use "calculated" keys for updating:
>> key: 'b
== b
>> blk/:key
== "second"
>> blk/:key: "alternate"
** Syntax Error: Invalid word -- :key:.
** Where: (line 1) blk/:key: "alternate"
All of these are leading up to the last point on paths.
2.3. The overloading of "/" in REBOL provides a potentially
rich source of bugs (or, at the very least, confusion).
The "/" is used for several distinct concepts in REBOL:
- Refinements - a combination boolean argument and switch
that MAY signal the presence of other arguments.
- Paths - SYMBOLIC selection from within a block, similar to
(but more fragile than) using select with a word! key.
- Indexing - POSITIONAL selection from within a block.
The "/" notation breaks referential transparency (which is
very odd in a language billed as "Expression Based".) For
those not jargon-saturated, referential transparency is just
a complicated way of saying that equal-valued expressions
should be freely interchangeable. For example, all expressions
evaluating to 2 can be substituted for each other, as in:
>> tangent 2
== 3.49207694917477E-2
>> a: 2
== 2
>> tangent a
== 3.49207694917477E-2
>> (1 + 1)
== 2
>> tangent (1 + 1)
== 3.49207694917477E-2
>> b: 1
== 1
>> tangent (3 * b - 1)
== 3.49207694917477E-2
This doesn't work with "/", as demonstrated by:
>> now
== 16-Sep-2000/15:13:33-5:00
>> now/date
== 16-Sep-2000
>> jetzt: now
== 16-Sep-2000/15:13:44-5:00
>> jetzt/date
== 16-Sep-2000
So far, so good. A word referring to a date! and a function
that returns a date! start off looking similar. But:
>> jetzt/date/year
== 2000
>> jetzt/date/month
== 9
>> now/date/year
** Script Error: Too many refinements.
** Where: now/date/year
>> now/date/month
** Script Error: Too many refinements.
** Where: now/date/month
The slashes after jetzt are all interpreted as structure
accessors, but the slashes after now are all interpreted as
refinements. There isn't a notation to say, "Take the value
from this function call and access its components." without
storing it somewhere first.
>> (now/date)/year
== /year
>> yuck: now/date
== 16-Sep-2000
>> yuck/year
== 2000
With respect to the last two uses of "/", it's not always
obvious, even from reading your own source code, which one
you'll get.
>> blk: [a "first" b "second" c "third" 1 "one" 2 "two" 3 "three"]
== [a "first" b "second" c "third" 1 "one" 2 "two" 3 "three"]
>> key: 'a
== a
>> blk/:key
== "first"
>> key: 1
== 1
>> blk/:key
== a
>> key: 2
== 2
>> blk/:key
== "first"
If the value of key is a word, the slash retrieves a value
by symbolic lookup, but if the value of key is numeric, the
slash retrieves a value by position. Therefore, it cannot be
used as a notational substitute for
>> select blk 1
== "one"
Bear in mind that we might have something similar to
key: some-complicated-function with-some-arguments
which could require significantly more effort to determine
the meaning of
val: blk/:key
in the general case.
That's (more than) enough for now!
-jn-
[2/8] from: lmecir:geocities at: 18-Sep-2000 8:57
Hi Joel,
two notes:
1) I think that some posts on ADS missed one very important point - the
speed. The problem is, that Find (sequential search) may be a bottleneck for
some apps.
2) Select may be less general than Select/only sometimes.
Regards
Ladislav
[3/8] from: ole_f:post3:tele:dk at: 18-Sep-2000 12:45
If you read this, you are using an old mail reader. The solution
if you have difficulty reading the rest of this mail is getting
hold of newer, MIME-compliant software.
--X=Boundary=X
Content-Type: text/plain; charset=iso-8859-1
Content-Transfer-Encoding: quoted-printable
X-Wordwrap: 78
Hi Joel, 16-Sep-2000 you wrote:
>1.1. Keys and values are not distinguised.
>1.2. Storage management/economy require extra code to
> handle replacement of values for existing keys.
>1.3. The block! type doesn't appear to be supported as a key
> in a manner consistent with other data types.
I've attached some functions for manipulating with an associative data
structure. I personally think they suffice for my normal uses, but I'm
looking forward to your comments (though, please keep it short then, I don't
have the time to read through all that ASCII ;-) ).
BTW, using a hash! for the "keys" array would probably be a better idea than
just using a block!, as I currently do. _If_ find/only works faster on hash!
lists, that is (it should be, but I don't know).
Kind regards,
--
Ole Friis <[ole_f--post3--tele--dk]>
Amiga is a trademark of Amiga Inc.
--X=Boundary=X
Content-Type: text/plain; charset=US-ASCII; name="ADS.r"
Content-Transfer-Encoding: 7bit
REBOL []
create-ads: func [
"Creates a new associative data structure"
][
make object! [
keys: copy []
values: copy []
]
]
lookup-ads: func [
"Returns the value associated with the key"
ads [object!] "The associative data structure"
key [any-type!] "Key"
/local t-list
][
if found? t-list: find/only ads/keys key [
pick ads/values index? t-list
]
]
associate-ads: func [
"Appends the new key/value pair in the data structure"
ads [object!] "The associative data structure"
key [any-type!] "Key"
value [any-type!] "Value"
/local t-list
][
either found? t-list: find/only ads/keys key [
change/only at ads/values index? t-list value
][
append/only ads/keys key
append/only ads/values value
]
]
--X=Boundary=X--
[4/8] from: g:santilli:tiscalinet:it at: 18-Sep-2000 19:21
Hello [joel--neely--fedex--com]!
On 16-Set-00, you wrote:
j>>> test-select: func [b [block!] k [any-type!] v
j> [any-type!]] [
j> [ append b reduce [k v]
j> [ select b k
j> [ ]
j>>> blk: []
j> == []
Try this:
>> test-select: func [b k v] [
[ append b reduce [k v]
[ select/only b k
[ ]
>> blk: []
== []
>> test-select blk "a" "b"
== "b"
>> test-select blk [2 3] 6
== 6
j>>> select blk reduce [[2 3]]
j> == 6
j> Now THAT was a surprise! I'll be glad to hear any explanations
j> anyone can offer for that last case! Whatever the explantion
REBOL functions tend to treat series as a sequence of values, not
as a single value. They have the /ONLY refinement to treat series
as a single value. You SELECT was serching for the value 2
followed by the value 3, not for the value [2 3].
Also notice that:
>> blk: [1 2 "one-two" 1 3 "one-three" 2 3 "two-three"]
== [1 2 "one-two" 1 3 "one-three" 2 3 "two-three"]
>> select blk [1 2]
== "one-two"
>> select blk [1 3]
== "one-three"
>> select blk [2 3]
== "two-three"
Regards,
Gabriele.
--
Gabriele Santilli <[giesse--writeme--com]> - Amigan - REBOL programmer
Amiga Group Italia sez. L'Aquila -- http://www.amyresource.it/AGI/
[5/8] from: joel:neely:fedex at: 19-Sep-2000 7:44
Hi, Ole!
[ole_f--post3--tele--dk] wrote:
[snip]
> I've attached some functions for manipulating with an associative
> data structure. I personally think they suffice for my normal uses,
> but I'm looking forward to your comments (though, please keep it
> short then, I don't have the time to read through all that
> ASCII ;-) ).
>
(Not only do you not have time to read it, I really don't have time
to be typing it! ;-)
I'll wait to respond with specifics until I've had a chance to look
them over adequately. I'm curious about the motivations (and
the performance) of your compromise solution -- part object, part
function library.
Would I be correct in guessing that you chose to make associate-ads
and lookup-ads standalone functions instead of object methods to
avoid the duplication which REBOL (currently -- hint, hint, hint)
imposes on objects?
> BTW, using a hash! for the "keys" array would probably be a better
> idea than just using a block!, as I currently do. _If_ find/only
> works faster on hash! lists, that is (it should be, but I don't
> know).
>
Disclaimer: I've not done the empirical testing of this hypothesis
for the REBOL implementation of hash! , so PLEASE take this as
speculation.
My general experience is that hashing vs. array/list lookup is
subject to tradeoffs. There is overhead (both space and time)
with hashing, which makes it perform more poorly than simpler
but straightforward lookup schemes for small cases. As the size
of the data collection grows, however, that overhead amortizes
nicely. Ultimately, for "large enough" collections of data,
the hash table beats array/list algorithms -- the best way to
determine how large is large enough, of course, is to run some
benchmarks, which I'd like to do some time when my higher
priorities have been handled (and there are LOTS of them! ;-)
-jn-
[6/8] from: larry:ecotope at: 19-Sep-2000 9:51
Hi Joel, Ole
What Joel writes about tradeoffs is true, but the most important thing to
know is that the REBOL hash! datatype only hashes strings (a little missing
documentation). For other datatypes, you just get a more expensive block
with no speed benefits. Jim has said that he will extend it to hash numbers
as well, but no schedule on that.
Cheers
-Larry
PS Joel, I have really enjoyed your recent essays and benchmarks.
---------excerpt from original message
[7/8] from: ole_f:post3:tele:dk at: 20-Sep-2000 14:30
Hi Larry, 19-Sep-2000 you wrote:
>What Joel writes about tradeoffs is true, but the most important thing to
>know is that the REBOL hash! datatype only hashes strings (a little missing
>documentation). For other datatypes, you just get a more expensive block
>with no speed benefits. Jim has said that he will extend it to hash numbers
>as well, but no schedule on that.
OK, thanks for the info. That scheme sounds reasonable to me, given that the
only thing I've used hash tables for in the past has been strings. I don't
immediately see any reason for having fast lookup on more advanced types.
(But probably others immediately see the reason, no doubt ;-) )
Kind regards,
--
Ole Friis <[ole_f--post3--tele--dk]>
Amiga is a trademark of Amiga Inc.
[8/8] from: ole_f:post3:tele:dk at: 20-Sep-2000 14:32
Hi Joel, 19-Sep-2000 you wrote:
[...]
>I'll wait to respond with specifics until I've had a chance to look
>them over adequately. I'm curious about the motivations (and
<<quoted lines omitted: 4>>
>avoid the duplication which REBOL (currently -- hint, hint, hint)
>imposes on objects?
You can get rid of the duplication by doing a super-object containing the
functions, then cloning this object for all your "real objects". So no, that
wasn't the case. The reason was laziness ;-)
1: I needed to encapsulate the two different lists into something, and I chose
an object. Could've used a list with the two lists as the only elements, but
I just didn't.
2: When I tested my code, I just wanted to be able to 'do the source file and
then directly use the functions on my existing object. If I had put the
functions inside the object, then i would have to create a new object in
order to use the new functions. This is the laziness part.
So, there wasn't any clever reason. And anyway, I just wrote it to show how to
do the thing relatively elegant by separating the keys and the values into
two separate lists, instead of the approach everybody else uses by cramming
everything into a single list.
Anyway, it can be done a lot nicer than I have done.
Kind regards,
--
Ole Friis <[ole_f--post3--tele--dk]>
Amiga is a trademark of Amiga Inc.
Notes
- Quoted lines have been omitted from some messages.
View the message alone to see the lines that have been omitted