|
Adam Fuchs
2011-12-22, 19:52
Aaron Cordova
2011-12-22, 20:41
Adam Fuchs
2011-12-22, 21:00
Aaron Cordova
2011-12-22, 21:02
Aaron Cordova
2011-12-22, 21:02
Aaron Cordova
2011-12-22, 21:04
Aaron Cordova
2011-12-22, 21:07
Adam Fuchs
2011-12-22, 21:09
Aaron Cordova
2011-12-22, 21:20
Aaron Cordova
2011-12-22, 21:22
Aaron Cordova
2011-12-22, 21:25
Keith Turner
2011-12-22, 21:35
Aaron Cordova
2011-12-22, 21:49
Keith Turner
2011-12-22, 21:55
Keith Turner
2011-12-22, 22:09
Aaron Cordova
2011-12-22, 22:15
Aaron Cordova
2011-12-22, 22:17
Aaron Cordova
2011-12-22, 22:23
Keith Turner
2011-12-22, 22:23
Aaron Cordova
2011-12-22, 22:24
Aaron Cordova
2011-12-22, 22:26
Keith Turner
2011-12-22, 22:32
Keith Turner
2011-12-22, 22:39
Aaron Cordova
2011-12-22, 22:56
Aaron Cordova
2011-12-22, 23:12
|
-
Re: [jira] [Commented] (ACCUMULO-227) Improve in memory map counts to provide cell level uniqueness for repeated columns in mutationAdam Fuchs 2011-12-22, 19:52
Aaron,
I have to disagree with you. By default, Accumulo tables are distributed maps. However, as soon as you configure an aggregator or some other interesting iterator on a table the semantics for that table change and it is no longer a "proper" distributed map. Therefore I claim that the basic tenant to which you refer does not exist as such. Users generally don't set the timestamps in a mutation, and aggregators certainly don't preserve the keys that they aggregate. Are you suggesting that modifying the value associated with a key that has already contributed to a persisted aggregate should have an affect that is dependent on the original value? So, if I sum a:foo:bar->1 and then a:foo:bar->2 I should get 2? The fix that is suggested in this ticket just makes the behavior consistent between the cases of putting two identical entries in one mutation versus putting the two entries in two mutations. However we account for the semantics of aggregation we should be for this change. Adam On Thu, Dec 22, 2011 at 12:31 PM, Aaron Cordova (Commented) (JIRA) < [EMAIL PROTECTED]> wrote: > > [ > https://issues.apache.org/jira/browse/ACCUMULO-227?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13174913#comment-13174913] > > Aaron Cordova commented on ACCUMULO-227: > ---------------------------------------- > > What the client should expect is that Accumulo will only store/process one > value per unique key: Accumulo is a distributed map. Even if it's only for > aggregation's sake, allowing Mutations to submit multiple values per unique > key and processing all those values, rather than arbitrarily choosing one, > violates the concept of a map, which will cause more confusion on the part > of users. > > The right thing to do for users who want to submit lots of values to > aggregate under a sub key is to insist that they make their cells differ by > at least one element in the key. Again, aggregating multiple values under > the same key violates the basic tenet that Accumulo is a map. Aggregation > is performed across different keys sharing a sub key. > > If having the users generate unique timestamps is a problem, there are > several strategies for dealing with that. One is to generate random > timestamps. If aggregation is being done over timestamps, the actual > timestamp shouldn't matter / ever be interpreted. If there are worries > about Accumulo doing something undesired with random timestamps, one could > generate random column qualifiers, etc. and aggregate over those. > > To address what Adam said about versioning - aggregating tables should > probably turn off the iterator that only keeps the latest version. But that > has nothing to do with the policy for handling multiple identical cells. > > Finally, I'm not advocating we do anything to support aggregation on the > client side, but rather leave it up to the application developer to exploit > any opportunities for aggregation in their application. > > > > Improve in memory map counts to provide cell level uniqueness for > repeated columns in mutation > > > ----------------------------------------------------------------------------------------------- > > > > Key: ACCUMULO-227 > > URL: https://issues.apache.org/jira/browse/ACCUMULO-227 > > Project: Accumulo > > Issue Type: Improvement > > Components: tserver > > Reporter: John Vines > > Assignee: John Vines > > Fix For: 1.5.0 > > > > > > Currently for isolation we only isolate mutations. This doesn't allow > mutations with identical cells within it. We should increase the mutation > counts to account for each individual cell instead of each mutation. > > -- > This message is automatically generated by JIRA. > If you think it was sent incorrectly, please contact your JIRA > administrators: > https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa > For more information on JIRA, see: http://www.atlassian.com/software/jira
-
Re: [jira] [Commented] (ACCUMULO-227) Improve in memory map counts to provide cell level uniqueness for repeated columns in mutationAaron Cordova 2011-12-22, 20:41
Rather than aggregation functionality being defined as some operation performed across a set of the values of different keys, you're advocating allowing inserting identical keys and aggregating their values as well? This just seems semantically sloppy to me.
These types of changes just incur a cost in terms of understanding for the user. Rather than being able to describe Accumulo as a map, a well defined and understood concept, that also supports aggregations over a set of keys that share a subkey, we would then have to describe Accumulo as a map, most of the time, except when it functions more like a multi-map, in the case of aggregation in the presence of multiple values for the same key ... it's just confusing. Even with aggregators configured over a table, it still functions as a map - in fact like two maps, one 'underlying' map, in which each key has one value, and an 'aggregate' map, in which keys also have one value, define as an aggregation over the 'underlying' map. Perhaps one could argue that what I just described could be termed a multi-map, but from the user's point of view, thinking of it as an 'underlying' map, which is how the user sees the table when writing, and an 'aggregate' map, which is how the user sees the table when reading is more clean. Users are used to this situation if they've ever used views in a relational database. For you and John, who are steeped in this field, this distinction, and this change, probably doesn't seem like a big deal. But when telling a new user about Accumulo, being able to explain to them that Accumulo is a map, is very useful. It makes predicting the behavior of Accumulo possible. If users can put identical key-value pairs into a mutation, and if Accumulo treats them as distinct, users' predictions will be wrong. Feel free to make this change, but just consider the collective cognitive cost it incurred by altering the semantics. Earlier you argued that extending the times aggregations are executed to include the client would be too great. Yet making it possible for Accumulo to cease acting like a map sometime doesn't give you pause? On Dec 22, 2011, at 2:52 PM, Adam Fuchs wrote: > Aaron, > > I have to disagree with you. By default, Accumulo tables are distributed > maps. However, as soon as you configure an aggregator or some other > interesting iterator on a table the semantics for that table change and it > is no longer a "proper" distributed map. Therefore I claim that the basic > tenant to which you refer does not exist as such. > > Users generally don't set the timestamps in a mutation, and aggregators > certainly don't preserve the keys that they aggregate. Are you suggesting > that modifying the value associated with a key that has already contributed > to a persisted aggregate should have an affect that is dependent on the > original value? So, if I sum a:foo:bar->1 and then a:foo:bar->2 I should > get 2? > > The fix that is suggested in this ticket just makes the behavior consistent > between the cases of putting two identical entries in one mutation versus > putting the two entries in two mutations. However we account for the > semantics of aggregation we should be for this change. > > Adam > > > On Thu, Dec 22, 2011 at 12:31 PM, Aaron Cordova (Commented) (JIRA) < > [EMAIL PROTECTED]> wrote: > >> >> [ >> https://issues.apache.org/jira/browse/ACCUMULO-227?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13174913#comment-13174913] >> >> Aaron Cordova commented on ACCUMULO-227: >> ---------------------------------------- >> >> What the client should expect is that Accumulo will only store/process one >> value per unique key: Accumulo is a distributed map. Even if it's only for >> aggregation's sake, allowing Mutations to submit multiple values per unique >> key and processing all those values, rather than arbitrarily choosing one, >> violates the concept of a map, which will cause more confusion on the part >> of users.
-
Re: [jira] [Commented] (ACCUMULO-227) Improve in memory map counts to provide cell level uniqueness for repeated columns in mutationAdam Fuchs 2011-12-22, 21:00
Aaron,
I think it would be more accurate to describe Accumulo as an underlying multi-map with support for aggregation overlays. A map can be thought of as a multi-map with an overlay that takes the first of the multiple entries. This is in fact the default configuration of Accumulo tables, where the VersioningIterator defines this overlay. Other Iterator configurations provide different overlays. There are two challenges that make it difficult to case the underlying representation as a map. The first is that the definition of uniqueness of a Key is a bit muddy. I think that many users consider the uniqueness to include row, column family, and column qualifier. Those that use cell-level security also include the column visibility. Timestamp doesn't usually make it into the uniqueness concept, from a user's perspective, even though that affects the sort order of Keys. In fact, most users let Accumulo set the timestamp for them. I think your definition of uniqueness takes timestamp into account, and from that perspective what we're doing is sort of like providing a finer grained timestamp instead of using one timestamp for an entire Mutation (or for all Mutations that show up within a millisecond). The second challenge is that the overlay is persisted and is not reversible. Aggregators don't keep the Keys that they aggregate, so if a user wants to replace a Key in the underlying map and have that replacement operation be reflected in the overlay, we can't really do that. However, we can do that if the underlying store is a multi-map (which is what we do now). Adam On Thu, Dec 22, 2011 at 3:41 PM, Aaron Cordova <[EMAIL PROTECTED]> wrote: > Rather than aggregation functionality being defined as some operation > performed across a set of the values of different keys, you're advocating > allowing inserting identical keys and aggregating their values as well? > This just seems semantically sloppy to me. > > These types of changes just incur a cost in terms of understanding for the > user. Rather than being able to describe Accumulo as a map, a well defined > and understood concept, that also supports aggregations over a set of keys > that share a subkey, we would then have to describe Accumulo as a map, most > of the time, except when it functions more like a multi-map, in the case of > aggregation in the presence of multiple values for the same key ... it's > just confusing. > > Even with aggregators configured over a table, it still functions as a map > - in fact like two maps, one 'underlying' map, in which each key has one > value, and an 'aggregate' map, in which keys also have one value, define as > an aggregation over the 'underlying' map. Perhaps one could argue that what > I just described could be termed a multi-map, but from the user's point of > view, thinking of it as an 'underlying' map, which is how the user sees the > table when writing, and an 'aggregate' map, which is how the user sees the > table when reading is more clean. Users are used to this situation if > they've ever used views in a relational database. > > For you and John, who are steeped in this field, this distinction, and > this change, probably doesn't seem like a big deal. But when telling a new > user about Accumulo, being able to explain to them that Accumulo is a map, > is very useful. It makes predicting the behavior of Accumulo possible. If > users can put identical key-value pairs into a mutation, and if Accumulo > treats them as distinct, users' predictions will be wrong. > > Feel free to make this change, but just consider the collective cognitive > cost it incurred by altering the semantics. Earlier you argued that > extending the times aggregations are executed to include the client would > be too great. Yet making it possible for Accumulo to cease acting like a > map sometime doesn't give you pause? > > On Dec 22, 2011, at 2:52 PM, Adam Fuchs wrote: > > > Aaron, > > > > I have to disagree with you. By default, Accumulo tables are distributed
-
Re: [jira] [Commented] (ACCUMULO-227) Improve in memory map counts to provide cell level uniqueness for repeated columns in mutationAaron Cordova 2011-12-22, 21:02
Why is it that none of you seem to consider two keys that differ by timestamp to be two different keys?
On Dec 22, 2011, at 4:00 PM, Adam Fuchs wrote: > Aaron, > > I think it would be more accurate to describe Accumulo as an underlying > multi-map with support for aggregation overlays. A map can be thought of as > a multi-map with an overlay that takes the first of the multiple entries. > This is in fact the default configuration of Accumulo tables, where the > VersioningIterator defines this overlay. Other Iterator configurations > provide different overlays. > > There are two challenges that make it difficult to case the underlying > representation as a map. The first is that the definition of uniqueness of > a Key is a bit muddy. I think that many users consider the uniqueness to > include row, column family, and column qualifier. Those that use cell-level > security also include the column visibility. Timestamp doesn't usually make > it into the uniqueness concept, from a user's perspective, even though that > affects the sort order of Keys. In fact, most users let Accumulo set the > timestamp for them. I think your definition of uniqueness takes timestamp > into account, and from that perspective what we're doing is sort of like > providing a finer grained timestamp instead of using one timestamp for an > entire Mutation (or for all Mutations that show up within a millisecond). > > The second challenge is that the overlay is persisted and is not > reversible. Aggregators don't keep the Keys that they aggregate, so if a > user wants to replace a Key in the underlying map and have that replacement > operation be reflected in the overlay, we can't really do that. However, we > can do that if the underlying store is a multi-map (which is what we do > now). > > Adam > > On Thu, Dec 22, 2011 at 3:41 PM, Aaron Cordova <[EMAIL PROTECTED]> wrote: > >> Rather than aggregation functionality being defined as some operation >> performed across a set of the values of different keys, you're advocating >> allowing inserting identical keys and aggregating their values as well? >> This just seems semantically sloppy to me. >> >> These types of changes just incur a cost in terms of understanding for the >> user. Rather than being able to describe Accumulo as a map, a well defined >> and understood concept, that also supports aggregations over a set of keys >> that share a subkey, we would then have to describe Accumulo as a map, most >> of the time, except when it functions more like a multi-map, in the case of >> aggregation in the presence of multiple values for the same key ... it's >> just confusing. >> >> Even with aggregators configured over a table, it still functions as a map >> - in fact like two maps, one 'underlying' map, in which each key has one >> value, and an 'aggregate' map, in which keys also have one value, define as >> an aggregation over the 'underlying' map. Perhaps one could argue that what >> I just described could be termed a multi-map, but from the user's point of >> view, thinking of it as an 'underlying' map, which is how the user sees the >> table when writing, and an 'aggregate' map, which is how the user sees the >> table when reading is more clean. Users are used to this situation if >> they've ever used views in a relational database. >> >> For you and John, who are steeped in this field, this distinction, and >> this change, probably doesn't seem like a big deal. But when telling a new >> user about Accumulo, being able to explain to them that Accumulo is a map, >> is very useful. It makes predicting the behavior of Accumulo possible. If >> users can put identical key-value pairs into a mutation, and if Accumulo >> treats them as distinct, users' predictions will be wrong. >> >> Feel free to make this change, but just consider the collective cognitive >> cost it incurred by altering the semantics. Earlier you argued that >> extending the times aggregations are executed to include the client would
-
Re: [jira] [Commented] (ACCUMULO-227) Improve in memory map counts to provide cell level uniqueness for repeated columns in mutationAaron Cordova 2011-12-22, 21:02
The timestamp is part of the key. If two keys differ by timestamp, they are different keys. The versioning iterator filters out certain _keys_ and their values.
If Accumulo allows two identical keys to be inserted, that behavior should change. In my opinion, it should arbitrarily throw away all but one key value pair, so as to behave like a proper map. On Dec 22, 2011, at 3:55 PM, Keith Turner (Commented) (JIRA) wrote: > > [ https://issues.apache.org/jira/browse/ACCUMULO-227?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13175052#comment-13175052 ] > > Keith Turner commented on ACCUMULO-227: > --------------------------------------- > > Aaron, > > By default Accumulo is a map (when configured w/ the versioning iterator). To get the map behavior you mentioned w/ aggregation, I think you could put the versioning iterator below the aggregating iterator. Then aggregation would never see two identical keys. > > Without the versioning iterator, if two identical key values exist in two map file then the user will see both. This has nothing to do w/ the in memory map. This change just makes the behavior when the Versioning iterator is removed consistent. > > There is one oddity when there are two identical keys, nondeterministic behavior. If two files have the same key value and you have the versioning iterator configured, then you may see different values for the same key at different times. Eric suggested sorting on the value to make this deterministic. > >> Improve in memory map counts to provide cell level uniqueness for repeated columns in mutation >> ----------------------------------------------------------------------------------------------- >> >> Key: ACCUMULO-227 >> URL: https://issues.apache.org/jira/browse/ACCUMULO-227 >> Project: Accumulo >> Issue Type: Improvement >> Components: tserver >> Reporter: John Vines >> Assignee: John Vines >> Fix For: 1.5.0 >> >> >> Currently for isolation we only isolate mutations. This doesn't allow mutations with identical cells within it. We should increase the mutation counts to account for each individual cell instead of each mutation. > > -- > This message is automatically generated by JIRA. > If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa > For more information on JIRA, see: http://www.atlassian.com/software/jira > >
-
Re: [jira] [Commented] (ACCUMULO-227) Improve in memory map counts to provide cell level uniqueness for repeated columns in mutationAaron Cordova 2011-12-22, 21:04
On Dec 22, 2011, at 4:00 PM, Adam Fuchs wrote: > Timestamp doesn't usually make > it into the uniqueness concept, from a user's perspective, even though that > affects the sort order of Keys. In fact, most users let Accumulo set the > timestamp for them. I think your definition of uniqueness takes timestamp > into account, and from that perspective what we're doing is sort of like > providing a finer grained timestamp instead of using one timestamp for an > entire Mutation (or for all Mutations that show up within a millisecond). Timestamps do define separate keys. This is not just my definition - this is in the BigTable design as well as Hbase's, and likely every other BigTable clone.
-
Re: [jira] [Commented] (ACCUMULO-227) Improve in memory map counts to provide cell level uniqueness for repeated columns in mutationAaron Cordova 2011-12-22, 21:07
Saying that Accumulo is a multi-map, because you can think of it that way by not considering timestamps part of the key (which they are) and because the versioning iterator is turned on be default will only confuse users. Especially those who read the BigTable paper, or hear about the BigTable data model, or who are familiar with the concept of a key-value store, and decide to try to use Accumulo.
> The second challenge is that the overlay is persisted and is not > reversible. Aggregators don't keep the Keys that they aggregate, so if a > user wants to replace a Key in the underlying map and have that replacement > operation be reflected in the overlay, we can't really do that. However, we > can do that if the underlying store is a multi-map (which is what we do > now). > > Adam
-
Re: [jira] [Commented] (ACCUMULO-227) Improve in memory map counts to provide cell level uniqueness for repeated columns in mutationAdam Fuchs 2011-12-22, 21:09
Sorry, I thought we were talking about users' perceptions of semantics.
Bigtable also supports holding multiple versions of key/value pairs, so it can be thought of as having an underlying multi-map as well. Adam On Thu, Dec 22, 2011 at 4:04 PM, Aaron Cordova <[EMAIL PROTECTED]> wrote: > > On Dec 22, 2011, at 4:00 PM, Adam Fuchs wrote: > > > Timestamp doesn't usually make > > it into the uniqueness concept, from a user's perspective, even though > that > > affects the sort order of Keys. In fact, most users let Accumulo set the > > timestamp for them. I think your definition of uniqueness takes timestamp > > into account, and from that perspective what we're doing is sort of like > > providing a finer grained timestamp instead of using one timestamp for an > > entire Mutation (or for all Mutations that show up within a millisecond). > > Timestamps do define separate keys. This is not just my definition - this > is in the BigTable design as well as Hbase's, and likely every other > BigTable clone. > > >
-
Re: [jira] [Commented] (ACCUMULO-227) Improve in memory map counts to provide cell level uniqueness for repeated columns in mutationAaron Cordova 2011-12-22, 21:20
_You_ can think of it that way, cause you're Adam Fucsh, distributed database expert extraordinaire, but that's not how the BigTable data model was described by the original authors - "BigTable is a sparse, sorted, distributed, multidimensional map", and most users do understand Accumulo to be a map of keys to values where the keys are made up of a row, colfam, colqual, colvis, and timestamp and the values are arbitrary byte pairs.
To start explaining to people that Accumulo is a multi-map, or to actually make it into a multi-map (i.e. allowing identical keys, where a key includes the timestamp), would be a mistake, in my opinion. On Dec 22, 2011, at 4:09 PM, Adam Fuchs wrote: > Sorry, I thought we were talking about users' perceptions of semantics. > Bigtable also supports holding multiple versions of key/value pairs, so it > can be thought of as having an underlying multi-map as well. > > Adam > > > On Thu, Dec 22, 2011 at 4:04 PM, Aaron Cordova <[EMAIL PROTECTED]> wrote: > >> >> On Dec 22, 2011, at 4:00 PM, Adam Fuchs wrote: >> >>> Timestamp doesn't usually make >>> it into the uniqueness concept, from a user's perspective, even though >> that >>> affects the sort order of Keys. In fact, most users let Accumulo set the >>> timestamp for them. I think your definition of uniqueness takes timestamp >>> into account, and from that perspective what we're doing is sort of like >>> providing a finer grained timestamp instead of using one timestamp for an >>> entire Mutation (or for all Mutations that show up within a millisecond). >> >> Timestamps do define separate keys. This is not just my definition - this >> is in the BigTable design as well as Hbase's, and likely every other >> BigTable clone. >> >> >>
-
Re: [jira] [Commented] (ACCUMULO-227) Improve in memory map counts to provide cell level uniqueness for repeated columns in mutationAaron Cordova 2011-12-22, 21:22
by "byte pairs" I mean byte arrays .. of course ...
On Dec 22, 2011, at 4:20 PM, Aaron Cordova wrote: > _You_ can think of it that way, cause you're Adam Fucsh, distributed database expert extraordinaire, but that's not how the BigTable data model was described by the original authors - "BigTable is a sparse, sorted, distributed, multidimensional map", and most users do understand Accumulo to be a map of keys to values where the keys are made up of a row, colfam, colqual, colvis, and timestamp and the values are arbitrary byte pairs. > > To start explaining to people that Accumulo is a multi-map, or to actually make it into a multi-map (i.e. allowing identical keys, where a key includes the timestamp), would be a mistake, in my opinion. > > > On Dec 22, 2011, at 4:09 PM, Adam Fuchs wrote: > >> Sorry, I thought we were talking about users' perceptions of semantics. >> Bigtable also supports holding multiple versions of key/value pairs, so it >> can be thought of as having an underlying multi-map as well. >> >> Adam >> >> >> On Thu, Dec 22, 2011 at 4:04 PM, Aaron Cordova <[EMAIL PROTECTED]> wrote: >> >>> >>> On Dec 22, 2011, at 4:00 PM, Adam Fuchs wrote: >>> >>>> Timestamp doesn't usually make >>>> it into the uniqueness concept, from a user's perspective, even though >>> that >>>> affects the sort order of Keys. In fact, most users let Accumulo set the >>>> timestamp for them. I think your definition of uniqueness takes timestamp >>>> into account, and from that perspective what we're doing is sort of like >>>> providing a finer grained timestamp instead of using one timestamp for an >>>> entire Mutation (or for all Mutations that show up within a millisecond). >>> >>> Timestamps do define separate keys. This is not just my definition - this >>> is in the BigTable design as well as Hbase's, and likely every other >>> BigTable clone. >>> >>> >>> >
-
Re: [jira] [Commented] (ACCUMULO-227) Improve in memory map counts to provide cell level uniqueness for repeated columns in mutationAaron Cordova 2011-12-22, 21:25
and by Fucsh I mean Fuchs of course ..
On Dec 22, 2011, at 4:22 PM, Aaron Cordova wrote: > by "byte pairs" I mean byte arrays .. of course ... > > On Dec 22, 2011, at 4:20 PM, Aaron Cordova wrote: > >> _You_ can think of it that way, cause you're Adam Fucsh, distributed database expert extraordinaire, but that's not how the BigTable data model was described by the original authors - "BigTable is a sparse, sorted, distributed, multidimensional map", and most users do understand Accumulo to be a map of keys to values where the keys are made up of a row, colfam, colqual, colvis, and timestamp and the values are arbitrary byte pairs. >> >> To start explaining to people that Accumulo is a multi-map, or to actually make it into a multi-map (i.e. allowing identical keys, where a key includes the timestamp), would be a mistake, in my opinion. >> >> >> On Dec 22, 2011, at 4:09 PM, Adam Fuchs wrote: >> >>> Sorry, I thought we were talking about users' perceptions of semantics. >>> Bigtable also supports holding multiple versions of key/value pairs, so it >>> can be thought of as having an underlying multi-map as well. >>> >>> Adam >>> >>> >>> On Thu, Dec 22, 2011 at 4:04 PM, Aaron Cordova <[EMAIL PROTECTED]> wrote: >>> >>>> >>>> On Dec 22, 2011, at 4:00 PM, Adam Fuchs wrote: >>>> >>>>> Timestamp doesn't usually make >>>>> it into the uniqueness concept, from a user's perspective, even though >>>> that >>>>> affects the sort order of Keys. In fact, most users let Accumulo set the >>>>> timestamp for them. I think your definition of uniqueness takes timestamp >>>>> into account, and from that perspective what we're doing is sort of like >>>>> providing a finer grained timestamp instead of using one timestamp for an >>>>> entire Mutation (or for all Mutations that show up within a millisecond). >>>> >>>> Timestamps do define separate keys. This is not just my definition - this >>>> is in the BigTable design as well as Hbase's, and likely every other >>>> BigTable clone. >>>> >>>> >>>> >> >
-
Re: [jira] [Commented] (ACCUMULO-227) Improve in memory map counts to provide cell level uniqueness for repeated columns in mutationKeith Turner 2011-12-22, 21:35
Big table has versions. Does the big table paper actually describe
the behavior of inserting two identical keys at different times when the table is set to show two versions? If these keys were in two separate map files/sstables then something would have to make a decision to suppress one of them. I am not sure the big table paper got that specific. You could suppress one of the keys, or just consider them to be two versions. We have been considering them to be versions. On Thu, Dec 22, 2011 at 4:20 PM, Aaron Cordova <[EMAIL PROTECTED]> wrote: > _You_ can think of it that way, cause you're Adam Fucsh, distributed database expert extraordinaire, but that's not how the BigTable data model was described by the original authors - "BigTable is a sparse, sorted, distributed, multidimensional map", and most users do understand Accumulo to be a map of keys to values where the keys are made up of a row, colfam, colqual, colvis, and timestamp and the values are arbitrary byte pairs. > > To start explaining to people that Accumulo is a multi-map, or to actually make it into a multi-map (i.e. allowing identical keys, where a key includes the timestamp), would be a mistake, in my opinion. > > > On Dec 22, 2011, at 4:09 PM, Adam Fuchs wrote: > >> Sorry, I thought we were talking about users' perceptions of semantics. >> Bigtable also supports holding multiple versions of key/value pairs, so it >> can be thought of as having an underlying multi-map as well. >> >> Adam >> >> >> On Thu, Dec 22, 2011 at 4:04 PM, Aaron Cordova <[EMAIL PROTECTED]> wrote: >> >>> >>> On Dec 22, 2011, at 4:00 PM, Adam Fuchs wrote: >>> >>>> Timestamp doesn't usually make >>>> it into the uniqueness concept, from a user's perspective, even though >>> that >>>> affects the sort order of Keys. In fact, most users let Accumulo set the >>>> timestamp for them. I think your definition of uniqueness takes timestamp >>>> into account, and from that perspective what we're doing is sort of like >>>> providing a finer grained timestamp instead of using one timestamp for an >>>> entire Mutation (or for all Mutations that show up within a millisecond). >>> >>> Timestamps do define separate keys. This is not just my definition - this >>> is in the BigTable design as well as Hbase's, and likely every other >>> BigTable clone. >>> >>> >>> >
-
Re: [jira] [Commented] (ACCUMULO-227) Improve in memory map counts to provide cell level uniqueness for repeated columns in mutationAaron Cordova 2011-12-22, 21:49
I think it's fine to consider different versions of 'identical keys', meaning row,colfam,colqual, because in that case the implementation still treats two keys that only differ by timestamp as two unique keys. But I don't think we should allow multiple identical _versions_ of identical keys, to use your terminology. I think we should throw all but one away if the user does happen to try to insert them and if the user wants to aggregate across values, he or she must use different version numbers or timestamps or whatever.
If generating unique timestamps within mutations that want to perform several updates to the same row,colfam,colqual is a problem, why don't we allow the user to 'put()' multiple updates into a mutation, and on the server then assign slightly different timestamps to the identical row,colfam,colqual triples that are found in a mutation. Would that make everyone happy? On Dec 22, 2011, at 4:35 PM, Keith Turner wrote: > Big table has versions. Does the big table paper actually describe > the behavior of inserting two identical keys at different times when > the table is set to show two versions? If these keys were in two > separate map files/sstables then something would have to make a > decision to suppress one of them. I am not sure the big table paper > got that specific. You could suppress one of the keys, or just > consider them to be two versions. We have been considering them to be > versions.
-
Re: [jira] [Commented] (ACCUMULO-227) Improve in memory map counts to provide cell level uniqueness for repeated columns in mutationKeith Turner 2011-12-22, 21:55
I am not sure that Big Table can really be thought of as a map, in the
sense of a Java Tree Map. Inserting the exact same key as an existing key will not overwrite the value in a deterministic way like it would in a TreeMap. To truly overwrite a value you must insert with a key that has a greater timestamp. To support making updates to a key using the exact same timestamp a BigTable implementation would need to keep another hidden timestamp (or something that indicates order of arrival). Otherwise the system has no way know which value came second and which to suppress. I do not remember any mention of a secondary hidden timestamp in the BigTable paper. Without this extra info I am not sure how the BigTable model would make deterministic decisions that mimic the order of arrival behavior of a TreeMap. Therefore I suspect BigTable treats keys that are exactly the same the same as it treats multiple versions. Keith On Thu, Dec 22, 2011 at 4:20 PM, Aaron Cordova <[EMAIL PROTECTED]> wrote: > _You_ can think of it that way, cause you're Adam Fucsh, distributed database expert extraordinaire, but that's not how the BigTable data model was described by the original authors - "BigTable is a sparse, sorted, distributed, multidimensional map", and most users do understand Accumulo to be a map of keys to values where the keys are made up of a row, colfam, colqual, colvis, and timestamp and the values are arbitrary byte pairs. > > To start explaining to people that Accumulo is a multi-map, or to actually make it into a multi-map (i.e. allowing identical keys, where a key includes the timestamp), would be a mistake, in my opinion. > > > On Dec 22, 2011, at 4:09 PM, Adam Fuchs wrote: > >> Sorry, I thought we were talking about users' perceptions of semantics. >> Bigtable also supports holding multiple versions of key/value pairs, so it >> can be thought of as having an underlying multi-map as well. >> >> Adam >> >> >> On Thu, Dec 22, 2011 at 4:04 PM, Aaron Cordova <[EMAIL PROTECTED]> wrote: >> >>> >>> On Dec 22, 2011, at 4:00 PM, Adam Fuchs wrote: >>> >>>> Timestamp doesn't usually make >>>> it into the uniqueness concept, from a user's perspective, even though >>> that >>>> affects the sort order of Keys. In fact, most users let Accumulo set the >>>> timestamp for them. I think your definition of uniqueness takes timestamp >>>> into account, and from that perspective what we're doing is sort of like >>>> providing a finer grained timestamp instead of using one timestamp for an >>>> entire Mutation (or for all Mutations that show up within a millisecond). >>> >>> Timestamps do define separate keys. This is not just my definition - this >>> is in the BigTable design as well as Hbase's, and likely every other >>> BigTable clone. >>> >>> >>> >
-
Re: [jira] [Commented] (ACCUMULO-227) Improve in memory map counts to provide cell level uniqueness for repeated columns in mutationKeith Turner 2011-12-22, 22:09
On Thu, Dec 22, 2011 at 4:49 PM, Aaron Cordova <[EMAIL PROTECTED]> wrote:
> I think it's fine to consider different versions of 'identical keys', meaning row,colfam,colqual, because in that case the implementation still treats two keys that only differ by timestamp as two unique keys. But I don't think we should allow multiple identical _versions_ of identical keys, to use your terminology. I think we should throw all but one away if the user does happen to try to insert them and if the user wants to aggregate across values, he or she must use different version numbers or timestamps or whatever. > > If generating unique timestamps within mutations that want to perform several updates to the same row,colfam,colqual is a problem, why don't we allow the user to 'put()' multiple updates into a mutation, and on the server then assign slightly different timestamps to the identical row,colfam,colqual triples that are found in a mutation. Would that make everyone happy? This still does not address the issue of separate mutations inserting the exact same key. Also timestamps are only set on the keys in a mutation if the user does not set them. So if a table comes to have multiple keys that are exactly the same, what do you propose? That we drop them? Which one will you drop? One nice thing about Accumulo is that if you wish to have this behavior, you can very easily write an iterator to do it. I think you are proposing that we configure an iterator to do this by default? I think if the user is inserting things with exact same key and expecting it to behave like a treemap (honor order of arrival), then it never will. Even if we drop duplicate keys, we will not achieve the map behavior you described.
-
Re: [jira] [Commented] (ACCUMULO-227) Improve in memory map counts to provide cell level uniqueness for repeated columns in mutationAaron Cordova 2011-12-22, 22:15
I propose Accumulo drops all but one arbitrarily
On Dec 22, 2011, at 5:09 PM, Keith Turner wrote: > This still does not address the issue of separate mutations inserting > the exact same key. Also timestamps are only set on the keys in a > mutation if the user does not set them. > > So if a table comes to have multiple keys that are exactly the same, > what do you propose? That we drop them? Which one will you drop? > One nice thing about Accumulo is that if you wish to have this > behavior, you can very easily write an iterator to do it. I think you > are proposing that we configure an iterator to do this by default? > > I think if the user is inserting things with exact same key and > expecting it to behave like a treemap (honor order of arrival), then > it never will. Even if we drop duplicate keys, we will not achieve > the map behavior you described.
-
Re: [jira] [Commented] (ACCUMULO-227) Improve in memory map counts to provide cell level uniqueness for repeated columns in mutationAaron Cordova 2011-12-22, 22:17
The reason for 'all but one' is to adhere to the concept of a map, and by 'arbitrarily' I mean, whatever way incurs the least processing cost.
On Dec 22, 2011, at 5:09 PM, Keith Turner wrote: > This still does not address the issue of separate mutations inserting > the exact same key. Also timestamps are only set on the keys in a > mutation if the user does not set them. > > So if a table comes to have multiple keys that are exactly the same, > what do you propose? That we drop them? Which one will you drop? > One nice thing about Accumulo is that if you wish to have this > behavior, you can very easily write an iterator to do it. I think you > are proposing that we configure an iterator to do this by default? > > I think if the user is inserting things with exact same key and > expecting it to behave like a treemap (honor order of arrival), then > it never will. Even if we drop duplicate keys, we will not achieve > the map behavior you described.
-
Re: [jira] [Commented] (ACCUMULO-227) Improve in memory map counts to provide cell level uniqueness for repeated columns in mutationAaron Cordova 2011-12-22, 22:23
And just to be clear, since there are several definitions of key flying around - in the following case:
row1,colfam1,colqual1,4 -> valueA row1,colfam1,colqual1,5 -> valueB These can coexist peacefully - although the versioning iterator might supress all but k versions. in this case: row1,colfam1,colqual1,4 -> valueA row1,colfam1,colqual1,4 -> valueB Accumulo should throw one away arbitrarily. I think what you mentioned, a system iterator that performs this logic, would be a good implementation. On Dec 22, 2011, at 5:09 PM, Keith Turner wrote: > On Thu, Dec 22, 2011 at 4:49 PM, Aaron Cordova <[EMAIL PROTECTED]> wrote: >> I think it's fine to consider different versions of 'identical keys', meaning row,colfam,colqual, because in that case the implementation still treats two keys that only differ by timestamp as two unique keys. But I don't think we should allow multiple identical _versions_ of identical keys, to use your terminology. I think we should throw all but one away if the user does happen to try to insert them and if the user wants to aggregate across values, he or she must use different version numbers or timestamps or whatever. >> >> If generating unique timestamps within mutations that want to perform several updates to the same row,colfam,colqual is a problem, why don't we allow the user to 'put()' multiple updates into a mutation, and on the server then assign slightly different timestamps to the identical row,colfam,colqual triples that are found in a mutation. Would that make everyone happy? > > This still does not address the issue of separate mutations inserting > the exact same key. Also timestamps are only set on the keys in a > mutation if the user does not set them. > > So if a table comes to have multiple keys that are exactly the same, > what do you propose? That we drop them? Which one will you drop? > One nice thing about Accumulo is that if you wish to have this > behavior, you can very easily write an iterator to do it. I think you > are proposing that we configure an iterator to do this by default? > > I think if the user is inserting things with exact same key and > expecting it to behave like a treemap (honor order of arrival), then > it never will. Even if we drop duplicate keys, we will not achieve > the map behavior you described.
-
Re: [jira] [Commented] (ACCUMULO-227) Improve in memory map counts to provide cell level uniqueness for repeated columns in mutationKeith Turner 2011-12-22, 22:23
On Thu, Dec 22, 2011 at 5:15 PM, Aaron Cordova <[EMAIL PROTECTED]> wrote:
> I propose Accumulo drops all but one arbitrarily > Ok, the default configuration currently does this. Like I said in another comment Eric suggested sorting on the value in this case so that its not arbitrary, and scans behave deterministically. In the case where the user starts modifying the iterator stack, I suppose you want an iterator that users can not see or override/remove that does this? That is not something I would advocate for? Keith
-
Re: [jira] [Commented] (ACCUMULO-227) Improve in memory map counts to provide cell level uniqueness for repeated columns in mutationAaron Cordova 2011-12-22, 22:24
If you want to make it possible for the user to turn this functionality off - (let's call it the "GO MULTIMAP!!" option) that's fine with me, as long as by default it's turned on.
On Dec 22, 2011, at 5:23 PM, Keith Turner wrote: > On Thu, Dec 22, 2011 at 5:15 PM, Aaron Cordova <[EMAIL PROTECTED]> wrote: >> I propose Accumulo drops all but one arbitrarily >> > Ok, the default configuration currently does this. Like I said in > another comment Eric suggested sorting on the value in this case so > that its not arbitrary, and scans behave deterministically. > > In the case where the user starts modifying the iterator stack, I > suppose you want an iterator that users can not see or override/remove > that does this? That is not something I would advocate for? > > Keith
-
Re: [jira] [Commented] (ACCUMULO-227) Improve in memory map counts to provide cell level uniqueness for repeated columns in mutationAaron Cordova 2011-12-22, 22:26
> Ok, the default configuration currently does this. Like I said in > another comment Eric suggested sorting on the value in this case so > that its not arbitrary, and scans behave deterministically. That's fine too, if the value of dropping deterministically outweighs the cost of sorting in these cases - it should be so rare that it doesn't really matter much how one does it.
-
Re: [jira] [Commented] (ACCUMULO-227) Improve in memory map counts to provide cell level uniqueness for repeated columns in mutationKeith Turner 2011-12-22, 22:32
On Thu, Dec 22, 2011 at 5:23 PM, Aaron Cordova <[EMAIL PROTECTED]> wrote:
> And just to be clear, since there are several definitions of key flying around - in the following case: > > row1,colfam1,colqual1,4 -> valueA > row1,colfam1,colqual1,5 -> valueB > > These can coexist peacefully - although the versioning iterator might supress all but k versions. > > in this case: > > row1,colfam1,colqual1,4 -> valueA > row1,colfam1,colqual1,4 -> valueB > > Accumulo should throw one away arbitrarily. I think what you mentioned, a system iterator that performs this logic, would be a good implementation. I am opposed to making this a system iterator. I like iterators seeing the data in sorted form, not "sort -u" :)
-
Re: [jira] [Commented] (ACCUMULO-227) Improve in memory map counts to provide cell level uniqueness for repeated columns in mutationKeith Turner 2011-12-22, 22:39
BigTable did not define LSM iterators. In the context of LSM
iterators, I think of the system as an online sort of the data that continually pulls sorted subsets of the data through the iterator stack. Sorting data does not imply making it unique. Giving iterators access to all data gives them the greatest level of flexibility. On Thu, Dec 22, 2011 at 5:24 PM, Aaron Cordova <[EMAIL PROTECTED]> wrote: > If you want to make it possible for the user to turn this functionality off - (let's call it the "GO MULTIMAP!!" option) that's fine with me, as long as by default it's turned on. > > > On Dec 22, 2011, at 5:23 PM, Keith Turner wrote: > >> On Thu, Dec 22, 2011 at 5:15 PM, Aaron Cordova <[EMAIL PROTECTED]> wrote: >>> I propose Accumulo drops all but one arbitrarily >>> >> Ok, the default configuration currently does this. Like I said in >> another comment Eric suggested sorting on the value in this case so >> that its not arbitrary, and scans behave deterministically. >> >> In the case where the user starts modifying the iterator stack, I >> suppose you want an iterator that users can not see or override/remove >> that does this? That is not something I would advocate for? >> >> Keith >
-
Re: [jira] [Commented] (ACCUMULO-227) Improve in memory map counts to provide cell level uniqueness for repeated columns in mutationAaron Cordova 2011-12-22, 22:56
I acknowledge everything you say as true - but disagree that the greatest level of flexibility results in the best experience for the user. BigTable is defined as a map - I think most users expect a map, and that making Accumulo behave like a multi-map will most likely result in poorer adoption.
On Dec 22, 2011, at 5:39 PM, Keith Turner wrote: > BigTable did not define LSM iterators. In the context of LSM > iterators, I think of the system as an online sort of the data that > continually pulls sorted subsets of the data through the iterator > stack. Sorting data does not imply making it unique. Giving > iterators access to all data gives them the greatest level of > flexibility.
-
Re: [jira] [Commented] (ACCUMULO-227) Improve in memory map counts to provide cell level uniqueness for repeated columns in mutationAaron Cordova 2011-12-22, 23:12
It doesn't have to be an iterator that can't be turned off, just one that's enabled by default.
On Dec 22, 2011, at 5:32 PM, Keith Turner wrote: > On Thu, Dec 22, 2011 at 5:23 PM, Aaron Cordova <[EMAIL PROTECTED]> wrote: >> And just to be clear, since there are several definitions of key flying around - in the following case: >> >> row1,colfam1,colqual1,4 -> valueA >> row1,colfam1,colqual1,5 -> valueB >> >> These can coexist peacefully - although the versioning iterator might supress all but k versions. >> >> in this case: >> >> row1,colfam1,colqual1,4 -> valueA >> row1,colfam1,colqual1,4 -> valueB >> >> Accumulo should throw one away arbitrarily. I think what you mentioned, a system iterator that performs this logic, would be a good implementation. > > I am opposed to making this a system iterator. I like iterators > seeing the data in sorted form, not "sort -u" :) |