Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Accumulo >> mail # dev >> Re: [jira] [Commented] (ACCUMULO-227) Improve in memory map counts to provide cell level uniqueness for repeated columns in mutation

Copy link to this message
Re: [jira] [Commented] (ACCUMULO-227) Improve in memory map counts to provide cell level uniqueness for repeated columns in mutation
Rather than aggregation functionality being defined as some operation performed across a set of the values of different keys, you're advocating allowing inserting identical keys and aggregating their values as well? This just seems semantically sloppy to me.

These types of changes just incur a cost in terms of understanding for the user. Rather than being able to describe Accumulo as a map, a well defined and understood concept, that also supports aggregations over a set of keys that share a subkey, we would then have to describe Accumulo as a map, most of the time, except when it functions more like a multi-map, in the case of aggregation in the presence of multiple values for the same key ... it's just confusing.

Even with aggregators configured over a table, it still functions as a map - in fact like two maps, one 'underlying' map, in which each key has one value, and an 'aggregate' map, in which keys also have one value, define as an aggregation over the 'underlying' map. Perhaps one could argue that what I just described could be termed a multi-map, but from the user's point of view, thinking of it as an 'underlying' map, which is how the user sees the table when writing, and an 'aggregate' map, which is how the user sees the table when reading is more clean. Users are used to this situation if they've ever used views in a relational database.

For you and John, who are steeped in this field, this distinction, and this change, probably doesn't seem like a big deal. But when telling a new user about Accumulo, being able to explain to them that Accumulo is a map, is very useful. It makes predicting the behavior of Accumulo possible. If users can put identical key-value pairs into a mutation, and if Accumulo treats them as distinct, users' predictions will be wrong.

Feel free to make this change, but just consider the collective cognitive cost it incurred by altering the semantics. Earlier you argued that extending the times aggregations are executed to include the client would be too great. Yet making it possible for Accumulo to cease acting like a map sometime doesn't give you pause?

On Dec 22, 2011, at 2:52 PM, Adam Fuchs wrote:

> Aaron,
> I have to disagree with you. By default, Accumulo tables are distributed
> maps. However, as soon as you configure an aggregator or some other
> interesting iterator on a table the semantics for that table change and it
> is no longer a "proper" distributed map. Therefore I claim that the basic
> tenant to which you refer does not exist as such.
> Users generally don't set the timestamps in a mutation, and aggregators
> certainly don't preserve the keys that they aggregate. Are you suggesting
> that modifying the value associated with a key that has already contributed
> to a persisted aggregate should have an affect that is dependent on the
> original value? So, if I sum a:foo:bar->1 and then a:foo:bar->2 I should
> get 2?
> The fix that is suggested in this ticket just makes the behavior consistent
> between the cases of putting two identical entries in one mutation versus
> putting the two entries in two mutations. However we account for the
> semantics of aggregation we should be for this change.
> Adam
> On Thu, Dec 22, 2011 at 12:31 PM, Aaron Cordova (Commented) (JIRA) <
>>   [
>> https://issues.apache.org/jira/browse/ACCUMULO-227?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13174913#comment-13174913]
>> Aaron Cordova commented on ACCUMULO-227:
>> ----------------------------------------
>> What the client should expect is that Accumulo will only store/process one
>> value per unique key: Accumulo is a distributed map. Even if it's only for
>> aggregation's sake, allowing Mutations to submit multiple values per unique
>> key and processing all those values, rather than arbitrarily choosing one,
>> violates the concept of a map, which will cause more confusion on the part
>> of users.