|
|
-
org.apache.accumulo.core.iterators.Combiner: key scope?
Jason Trost 2012-03-15, 11:33
I found myself needing a combiner that will sum the values of a row where each key has the same row and column family (but col qual differs).
I was looking through the Combiner class and I was wondering if there would be any issues with making this line of the code configurable. (line 70 in org.apache.accumulo.core.iterators.Combiner)
private boolean _hasNext() { return source.hasTop() && !source.getTopKey().isDeleted() && topKey.equals(source.getTopKey(), PartialKey.ROW_COLFAM_COLQUAL_COLVIS); }
Specifically I was thinking it would be useful to be able to configure the partial key field using one of the following values (from org.apache.accumulo.core.data.PartialKey): ROW ROW_COLFAM ROW_COLFAM_COLQUAL ROW_COLFAM_COLQUAL_COLVIS
I see the main value here in using Combiners at scan time to perform various rollups and counts.
I am guessing there may be some security implications of doing this? Maybe the labels of aggregations based on any partial key not including colvis would need to be combined.
Thoughts on this?
Thanks,
--Jason
-
Re: org.apache.accumulo.core.iterators.Combiner: key scope?
Keith Turner 2012-03-15, 12:44
On Thu, Mar 15, 2012 at 7:33 AM, Jason Trost <[EMAIL PROTECTED]> wrote: > I found myself needing a combiner that will sum the values of a row > where each key has the same row and column family (but col qual > differs). > > I was looking through the Combiner class and I was wondering if there > would be any issues with making this line of the code configurable. > (line 70 in org.apache.accumulo.core.iterators.Combiner) > > private boolean _hasNext() { > return source.hasTop() && !source.getTopKey().isDeleted() && > topKey.equals(source.getTopKey(), > PartialKey.ROW_COLFAM_COLQUAL_COLVIS); > } > > Specifically I was thinking it would be useful to be able to configure > the partial key field using one of the following values (from > org.apache.accumulo.core.data.PartialKey): > ROW > ROW_COLFAM > ROW_COLFAM_COLQUAL > ROW_COLFAM_COLQUAL_COLVIS > > I see the main value here in using Combiners at scan time to perform > various rollups and counts. > > I am guessing there may be some security implications of doing this? > Maybe the labels of aggregations based on any partial key not > including colvis would need to be combined. > > Thoughts on this? > > Thanks, > > --Jason
Yes, security is a concern. A user wrote something like what you mentioned. They combined col vis by taking the unique set of col vis expressions and joining them with & and putting parens around them. This was only intended to be used at scan time and not compaction time.
Keith
-
Re: org.apache.accumulo.core.iterators.Combiner: key scope?
Billie J Rinaldi 2012-03-19, 19:50
Another thing to consider is what to do with the differing column qualifiers. Throw them away, returning a blank column qualifier on the single Key returned? What if we want to combine column qualifiers and ignore Values instead? Should we try to pass the qualifiers into a reduce method with the Values? That would be a more general approach, but I'm not sure how to create an API that wouldn't be messy.
Billie ----- Original Message ----- > From: "Jason Trost" <[EMAIL PROTECTED]> > To: [EMAIL PROTECTED] > Sent: Thursday, March 15, 2012 7:33:48 AM > Subject: org.apache.accumulo.core.iterators.Combiner: key scope? > I found myself needing a combiner that will sum the values of a row > where each key has the same row and column family (but col qual > differs). > > I was looking through the Combiner class and I was wondering if there > would be any issues with making this line of the code configurable. > (line 70 in org.apache.accumulo.core.iterators.Combiner) > > private boolean _hasNext() { > return source.hasTop() && !source.getTopKey().isDeleted() && > topKey.equals(source.getTopKey(), > PartialKey.ROW_COLFAM_COLQUAL_COLVIS); > } > > Specifically I was thinking it would be useful to be able to configure > the partial key field using one of the following values (from > org.apache.accumulo.core.data.PartialKey): > ROW > ROW_COLFAM > ROW_COLFAM_COLQUAL > ROW_COLFAM_COLQUAL_COLVIS > > I see the main value here in using Combiners at scan time to perform > various rollups and counts. > > I am guessing there may be some security implications of doing this? > Maybe the labels of aggregations based on any partial key not > including colvis would need to be combined. > > Thoughts on this? > > Thanks, > > --Jason
-
Re: org.apache.accumulo.core.iterators.Combiner: key scope?
Keith Turner 2012-03-19, 20:02
On Mon, Mar 19, 2012 at 3:50 PM, Billie J Rinaldi <[EMAIL PROTECTED]> wrote: > Another thing to consider is what to do with the differing column qualifiers. Throw them away, returning a blank column qualifier on the single Key returned? What if we want to combine column qualifiers and ignore Values instead? Should we try to pass the qualifiers into a reduce method with the Values? That would be a more general approach, but I'm not sure how to create an API that wouldn't be messy. > > Billie
Billie
The following API might address the issues you raised
public abstract Pair<Key, Value> reduce(Iterator<Pair<Key,Value>> iter)
Keith
-
Re: org.apache.accumulo.core.iterators.Combiner: key scope?
Aaron Cordova 2012-03-19, 20:09
I suppose this would be a bad time to bring up the idea of returning more than one Pair ..
The original semantics of reduce() from lisp is to compact everything down into one object .. but the original MapReduce semantics allow reduce and map functions to emit() as many new KV pairs as one desires. To bring Accumulo's reduce() function closer to the usage of MapReduce's reduce() might not introduce a huge amount of cognitive load on users, especially if they are coming from the MapReduce world.
However, I am strongly in favor of avoiding over-generalized and complicated APIs, and am certainly willing to deal with the constraint of only returning one Pair if everyone feels this will keep adoption and usage easy and simple. On Mar 19, 2012, at 4:02 PM, Keith Turner wrote:
> On Mon, Mar 19, 2012 at 3:50 PM, Billie J Rinaldi > <[EMAIL PROTECTED]> wrote: >> Another thing to consider is what to do with the differing column qualifiers. Throw them away, returning a blank column qualifier on the single Key returned? What if we want to combine column qualifiers and ignore Values instead? Should we try to pass the qualifiers into a reduce method with the Values? That would be a more general approach, but I'm not sure how to create an API that wouldn't be messy. >> >> Billie > > Billie > > The following API might address the issues you raised > > public abstract Pair<Key, Value> reduce(Iterator<Pair<Key,Value>> iter) > > Keith
-
Re: org.apache.accumulo.core.iterators.Combiner: key scope?
Keith Turner 2012-03-19, 20:28
On Mon, Mar 19, 2012 at 4:09 PM, Aaron Cordova <[EMAIL PROTECTED]> wrote: > I suppose this would be a bad time to bring up the idea of returning more than one Pair .. > > The original semantics of reduce() from lisp is to compact everything down into one object .. but the original MapReduce semantics allow reduce and map functions to emit() as many new KV pairs as one desires. To bring Accumulo's reduce() function closer to the usage of MapReduce's reduce() might not introduce a huge amount of cognitive load on users, especially if they are coming from the MapReduce world. > > However, I am strongly in favor of avoiding over-generalized and complicated APIs, and am certainly willing to deal with the constraint of only returning one Pair if everyone feels this will keep adoption and usage easy and simple. >
I think thats reducing to multiple is ok. The important part is getting the API right. What API were you thinking of? Even if we do not do it, its nice to explore it and know what our options are.
One thing that I realized about returning a key or keys, is that it gives the user a chance to return something out of sorted order. This is a difference w/ the map reduce model, the output of a map reduce reducer need not be sorted. If the user generates keys out of order, this will not be caught until runtime. The API on the current combiner does not give control over the key. So that prevents this bug.
Keith
-
Re: org.apache.accumulo.core.iterators.Combiner: key scope?
Billie J Rinaldi 2012-03-19, 20:31
On Monday, March 19, 2012 4:02:38 PM, "Keith Turner" <[EMAIL PROTECTED]> wrote: > On Mon, Mar 19, 2012 at 3:50 PM, Billie J Rinaldi > <[EMAIL PROTECTED]> wrote: > > Another thing to consider is what to do with the differing column > > qualifiers. Throw them away, returning a blank column qualifier on > > the single Key returned? What if we want to combine column > > qualifiers and ignore Values instead? Should we try to pass the > > qualifiers into a reduce method with the Values? That would be a > > more general approach, but I'm not sure how to create an API that > > wouldn't be messy. > > > > Billie > > Billie > > The following API might address the issues you raised > > public abstract Pair<Key, Value> reduce(Iterator<Pair<Key,Value>> > iter) > > Keith
The iterator will have to decide which key/value pairs to pass to the reduce method, presumably using a PartialKey. PartialKey.ROW would pass an entire row to reduce, PartialKey.ROW_COLFAM would pass a column family of a row, etc. So the prefix of every key passed to the reduce would be the same, and the prefix of the Key(s) returned would have to be the same as well. Would we just ignore the prefix of the returned Key and fill in the expected prefix? Or would we throw an error if the method produced a Key with a different prefix?
If we allow multiple Keys to be returned, we'll have to make sure they're sorted. We could have the reduce method return a SortedMap<Key,Value>, but it would have to fit in memory.
Billie
-
Re: org.apache.accumulo.core.iterators.Combiner: key scope?
Keith Turner 2012-03-19, 20:35
On Mon, Mar 19, 2012 at 4:09 PM, Aaron Cordova <[EMAIL PROTECTED]> wrote: > The original semantics of reduce() from lisp is to compact everything down into one object .. but the original MapReduce semantics allow reduce and map functions to emit() as many new KV pairs as one desires. To bring Accumulo's reduce() function closer to the usage of MapReduce's reduce() might not introduce a huge amount of cognitive load on users, especially if they are coming from the MapReduce world.
Another thing that map reduce allows is for a reducer to emit zero KV. Users have asked if this was possible in a combiner/aggregator before, the ability to filter. Allowing a combiner to do this can be more efficient than a Combiner+Filter, because the Filter may need to redo computation that the Combiner just did inorder to make a decision.
Keith
-
Re: org.apache.accumulo.core.iterators.Combiner: key scope?
Keith Turner 2012-03-19, 20:44
On Mon, Mar 19, 2012 at 4:09 PM, Aaron Cordova <[EMAIL PROTECTED]> wrote: > However, I am strongly in favor of avoiding over-generalized and complicated APIs, and am certainly willing to deal with the constraint of only returning one Pair if everyone feels this will keep adoption and usage easy and simple. >
Thinking about this, Combiners make it easier to write a certain type of Iterator. Combiner++ would need to maintain this property. If using Combiner++ is harder than writing an iterator that accomplishes the same thing, then Combiner++ has no point.
Keith
-
Re: org.apache.accumulo.core.iterators.Combiner: key scope?
Keith Turner 2012-03-19, 21:38
On Mon, Mar 19, 2012 at 4:31 PM, Billie J Rinaldi <[EMAIL PROTECTED]> wrote: > On Monday, March 19, 2012 4:02:38 PM, "Keith Turner" <[EMAIL PROTECTED]> wrote: >> On Mon, Mar 19, 2012 at 3:50 PM, Billie J Rinaldi >> <[EMAIL PROTECTED]> wrote: >> > Another thing to consider is what to do with the differing column >> > qualifiers. Throw them away, returning a blank column qualifier on >> > the single Key returned? What if we want to combine column >> > qualifiers and ignore Values instead? Should we try to pass the >> > qualifiers into a reduce method with the Values? That would be a >> > more general approach, but I'm not sure how to create an API that >> > wouldn't be messy. >> > >> > Billie >> >> Billie >> >> The following API might address the issues you raised >> >> public abstract Pair<Key, Value> reduce(Iterator<Pair<Key,Value>> >> iter) >> >> Keith > > The iterator will have to decide which key/value pairs to pass to the reduce method, presumably using a PartialKey. PartialKey.ROW would pass an entire row to reduce, PartialKey.ROW_COLFAM would pass a column family of a row, etc. So the prefix of every key passed to the reduce would be the same, and the prefix of the Key(s) returned would have to be the same as well. Would we just ignore the prefix of the returned Key and fill in the expected prefix? Or would we throw an error if the method produced a Key with a different prefix? > > If we allow multiple Keys to be returned, we'll have to make sure they're sorted. We could have the reduce method return a SortedMap<Key,Value>, but it would have to fit in memory. > Billie, We have discussed this issue before and you found one cool way to avoid buffering data in memory, Generators. Unfortunately Java does not support this w/o creating extra threads. http://en.wikipedia.org/wiki/Generator_(computer_programming)Keith
-
Re: org.apache.accumulo.core.iterators.Combiner: key scope?
Aaron Cordova 2012-03-20, 12:46
Returning 0 to 1 KV pair or just value would be nice, and less of a change than 0 to N KV pairs
On Mar 19, 2012, at 4:35 PM, Keith Turner wrote:
> On Mon, Mar 19, 2012 at 4:09 PM, Aaron Cordova <[EMAIL PROTECTED]> wrote: >> The original semantics of reduce() from lisp is to compact everything down into one object .. but the original MapReduce semantics allow reduce and map functions to emit() as many new KV pairs as one desires. To bring Accumulo's reduce() function closer to the usage of MapReduce's reduce() might not introduce a huge amount of cognitive load on users, especially if they are coming from the MapReduce world. > > Another thing that map reduce allows is for a reducer to emit zero KV. > Users have asked if this was possible in a combiner/aggregator > before, the ability to filter. Allowing a combiner to do this can be > more efficient than a Combiner+Filter, because the Filter may need to > redo computation that the Combiner just did inorder to make a > decision. > > Keith
-
Re: org.apache.accumulo.core.iterators.Combiner: key scope?
Aaron Cordova 2012-03-20, 12:49
On Mar 19, 2012, at 4:28 PM, Keith Turner wrote:
> On Mon, Mar 19, 2012 at 4:09 PM, Aaron Cordova <[EMAIL PROTECTED]> wrote: >> I suppose this would be a bad time to bring up the idea of returning more than one Pair .. >> >> The original semantics of reduce() from lisp is to compact everything down into one object .. but the original MapReduce semantics allow reduce and map functions to emit() as many new KV pairs as one desires. To bring Accumulo's reduce() function closer to the usage of MapReduce's reduce() might not introduce a huge amount of cognitive load on users, especially if they are coming from the MapReduce world. >> >> However, I am strongly in favor of avoiding over-generalized and complicated APIs, and am certainly willing to deal with the constraint of only returning one Pair if everyone feels this will keep adoption and usage easy and simple. >> > > I think thats reducing to multiple is ok. The important part is > getting the API right. What API were you thinking of? Even if we do > not do it, its nice to explore it and know what our options are. > > One thing that I realized about returning a key or keys, is that it > gives the user a chance to return something out of sorted order. This > is a difference w/ the map reduce model, the output of a map reduce > reducer need not be sorted.
Right, but that's true of the output of Map() and the framework just sorts the KV pairs for you.
However, I don't see a good way for Accumulo to maintain global sort order of a list of KV pairs from reduce() now so maybe that's reason enough to not do it.
> If the user generates keys out of order, > this will not be caught until runtime. The API on the current > combiner does not give control over the key. So that prevents this > bug. > > Keith
|
|