Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Accumulo, mail # dev - Re: [jira] [Commented] (ACCUMULO-227) Improve in memory map counts to provide cell level uniqueness for repeated columns in mutation


Copy link to this message
-
Re: [jira] [Commented] (ACCUMULO-227) Improve in memory map counts to provide cell level uniqueness for repeated columns in mutation
Aaron Cordova 2011-12-22, 21:02
Why is it that none of you seem to consider two keys that differ by timestamp to be two different keys?

On Dec 22, 2011, at 4:00 PM, Adam Fuchs wrote:

> Aaron,
>
> I think it would be more accurate to describe Accumulo as an underlying
> multi-map with support for aggregation overlays. A map can be thought of as
> a multi-map with an overlay that takes the first of the multiple entries.
> This is in fact the default configuration of Accumulo tables, where the
> VersioningIterator defines this overlay. Other Iterator configurations
> provide different overlays.
>
> There are two challenges that make it difficult to case the underlying
> representation as a map. The first is that the definition of uniqueness of
> a Key is a bit muddy. I think that many users consider the uniqueness to
> include row, column family, and column qualifier. Those that use cell-level
> security also include the column visibility. Timestamp doesn't usually make
> it into the uniqueness concept, from a user's perspective, even though that
> affects the sort order of Keys. In fact, most users let Accumulo set the
> timestamp for them. I think your definition of uniqueness takes timestamp
> into account, and from that perspective what we're doing is sort of like
> providing a finer grained timestamp instead of using one timestamp for an
> entire Mutation (or for all Mutations that show up within a millisecond).
>
> The second challenge is that the overlay is persisted and is not
> reversible. Aggregators don't keep the Keys that they aggregate, so if a
> user wants to replace a Key in the underlying map and have that replacement
> operation be reflected in the overlay, we can't really do that. However, we
> can do that if the underlying store is a multi-map (which is what we do
> now).
>
> Adam
>
> On Thu, Dec 22, 2011 at 3:41 PM, Aaron Cordova <[EMAIL PROTECTED]> wrote:
>
>> Rather than aggregation functionality being defined as some operation
>> performed across a set of the values of different keys, you're advocating
>> allowing inserting identical keys and aggregating their values as well?
>> This just seems semantically sloppy to me.
>>
>> These types of changes just incur a cost in terms of understanding for the
>> user. Rather than being able to describe Accumulo as a map, a well defined
>> and understood concept, that also supports aggregations over a set of keys
>> that share a subkey, we would then have to describe Accumulo as a map, most
>> of the time, except when it functions more like a multi-map, in the case of
>> aggregation in the presence of multiple values for the same key ... it's
>> just confusing.
>>
>> Even with aggregators configured over a table, it still functions as a map
>> - in fact like two maps, one 'underlying' map, in which each key has one
>> value, and an 'aggregate' map, in which keys also have one value, define as
>> an aggregation over the 'underlying' map. Perhaps one could argue that what
>> I just described could be termed a multi-map, but from the user's point of
>> view, thinking of it as an 'underlying' map, which is how the user sees the
>> table when writing, and an 'aggregate' map, which is how the user sees the
>> table when reading is more clean. Users are used to this situation if
>> they've ever used views in a relational database.
>>
>> For you and John, who are steeped in this field, this distinction, and
>> this change, probably doesn't seem like a big deal. But when telling a new
>> user about Accumulo, being able to explain to them that Accumulo is a map,
>> is very useful. It makes predicting the behavior of Accumulo possible. If
>> users can put identical key-value pairs into a mutation, and if Accumulo
>> treats them as distinct, users' predictions will be wrong.
>>
>> Feel free to make this change, but just consider the collective cognitive
>> cost it incurred by altering the semantics. Earlier you argued that
>> extending the times aggregations are executed to include the client would