Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Accumulo >> mail # dev >> Re: [jira] [Commented] (ACCUMULO-227) Improve in memory map counts to provide cell level uniqueness for repeated columns in mutation


Copy link to this message
-
Re: [jira] [Commented] (ACCUMULO-227) Improve in memory map counts to provide cell level uniqueness for repeated columns in mutation
Why is it that none of you seem to consider two keys that differ by timestamp to be two different keys?

On Dec 22, 2011, at 4:00 PM, Adam Fuchs wrote:

> Aaron,
>
> I think it would be more accurate to describe Accumulo as an underlying
> multi-map with support for aggregation overlays. A map can be thought of as
> a multi-map with an overlay that takes the first of the multiple entries.
> This is in fact the default configuration of Accumulo tables, where the
> VersioningIterator defines this overlay. Other Iterator configurations
> provide different overlays.
>
> There are two challenges that make it difficult to case the underlying
> representation as a map. The first is that the definition of uniqueness of
> a Key is a bit muddy. I think that many users consider the uniqueness to
> include row, column family, and column qualifier. Those that use cell-level
> security also include the column visibility. Timestamp doesn't usually make
> it into the uniqueness concept, from a user's perspective, even though that
> affects the sort order of Keys. In fact, most users let Accumulo set the
> timestamp for them. I think your definition of uniqueness takes timestamp
> into account, and from that perspective what we're doing is sort of like
> providing a finer grained timestamp instead of using one timestamp for an
> entire Mutation (or for all Mutations that show up within a millisecond).
>
> The second challenge is that the overlay is persisted and is not
> reversible. Aggregators don't keep the Keys that they aggregate, so if a
> user wants to replace a Key in the underlying map and have that replacement
> operation be reflected in the overlay, we can't really do that. However, we
> can do that if the underlying store is a multi-map (which is what we do
> now).
>
> Adam
>
> On Thu, Dec 22, 2011 at 3:41 PM, Aaron Cordova <[EMAIL PROTECTED]> wrote:
>
>> Rather than aggregation functionality being defined as some operation
>> performed across a set of the values of different keys, you're advocating
>> allowing inserting identical keys and aggregating their values as well?
>> This just seems semantically sloppy to me.
>>
>> These types of changes just incur a cost in terms of understanding for the
>> user. Rather than being able to describe Accumulo as a map, a well defined
>> and understood concept, that also supports aggregations over a set of keys
>> that share a subkey, we would then have to describe Accumulo as a map, most
>> of the time, except when it functions more like a multi-map, in the case of
>> aggregation in the presence of multiple values for the same key ... it's
>> just confusing.
>>
>> Even with aggregators configured over a table, it still functions as a map
>> - in fact like two maps, one 'underlying' map, in which each key has one
>> value, and an 'aggregate' map, in which keys also have one value, define as
>> an aggregation over the 'underlying' map. Perhaps one could argue that what
>> I just described could be termed a multi-map, but from the user's point of
>> view, thinking of it as an 'underlying' map, which is how the user sees the
>> table when writing, and an 'aggregate' map, which is how the user sees the
>> table when reading is more clean. Users are used to this situation if
>> they've ever used views in a relational database.
>>
>> For you and John, who are steeped in this field, this distinction, and
>> this change, probably doesn't seem like a big deal. But when telling a new
>> user about Accumulo, being able to explain to them that Accumulo is a map,
>> is very useful. It makes predicting the behavior of Accumulo possible. If
>> users can put identical key-value pairs into a mutation, and if Accumulo
>> treats them as distinct, users' predictions will be wrong.
>>
>> Feel free to make this change, but just consider the collective cognitive
>> cost it incurred by altering the semantics. Earlier you argued that
>> extending the times aggregations are executed to include the client would