|
Matt Corgan
2012-12-09, 07:27
lars hofhansl
2012-12-09, 20:52
Matt Corgan
2012-12-10, 01:03
Stack
2012-12-10, 20:24
Stack
2012-12-10, 20:26
lars hofhansl
2012-12-11, 06:56
Matt Corgan
2012-12-11, 07:16
lars hofhansl
2012-12-11, 07:28
Matt Corgan
2012-12-11, 08:08
Dave Latham
2012-12-11, 18:54
|
-
InternalScanner next(..) methodsMatt Corgan 2012-12-09, 07:27
I'm looking at the KeyValueHeap trying to see how we can make it work with
Cells. I'm curious, in this method @Override public boolean next(List<KeyValue> result, int limit, String metric) throws IOException { if (this.current == null) { return false; } InternalScanner currentAsInternal = (InternalScanner)this.current; boolean mayContainMoreRows = currentAsInternal.next(result, limit, metric); how is it getting multiple results from a single scanner without putting the scanner back on the heap? Couldn't that skip KeyValues? Is it that it's only used at the Region level where the family-per-file semantics guarantee that all KeyValues in a single family will sort together? My bigger question is regarding the next(List<KeyValue> result, int limit) methods from the InternalScanner interface. What's the reasoning for getting multiple results in one call as opposed to calling the next() method a bunch of times? Buffering the KeyValues in a List like that means the Cells would have to be expanded into full KeyValues which would be nice to avoid. Is there some logic that depends on getting a whole row of values, even though you may only get a partial row due to the limit param? Similarly, I see there is Filter.filterRow(List<KeyValue>) which looks like it's barely used. Is that an important method? Doesn't look like it's used much, but maybe people have custom Filters that need it. Thanks, Matt
-
Re: InternalScanner next(..) methodslars hofhansl 2012-12-09, 20:52
This method specifically only works when this is a heap of StoreScanners (i.e. on the RegionScanner level), which is very confusing (to me anyway).
Maybe we should have two separate KeyValueHeap implementation to make it less confusing. The list here comprises KVs for the same row key. These KVs need to be collected together so that Filters can operate on entire rows. I just looked at that code this week. We need to fix this stuff. :) -- Lars ________________________________ From: Matt Corgan <[EMAIL PROTECTED]> To: dev <[EMAIL PROTECTED]> Sent: Saturday, December 8, 2012 11:27 PM Subject: InternalScanner next(..) methods I'm looking at the KeyValueHeap trying to see how we can make it work with Cells. I'm curious, in this method @Override public boolean next(List<KeyValue> result, int limit, String metric) throws IOException { if (this.current == null) { return false; } InternalScanner currentAsInternal = (InternalScanner)this.current; boolean mayContainMoreRows = currentAsInternal.next(result, limit, metric); how is it getting multiple results from a single scanner without putting the scanner back on the heap? Couldn't that skip KeyValues? Is it that it's only used at the Region level where the family-per-file semantics guarantee that all KeyValues in a single family will sort together? My bigger question is regarding the next(List<KeyValue> result, int limit) methods from the InternalScanner interface. What's the reasoning for getting multiple results in one call as opposed to calling the next() method a bunch of times? Buffering the KeyValues in a List like that means the Cells would have to be expanded into full KeyValues which would be nice to avoid. Is there some logic that depends on getting a whole row of values, even though you may only get a partial row due to the limit param? Similarly, I see there is Filter.filterRow(List<KeyValue>) which looks like it's barely used. Is that an important method? Doesn't look like it's used much, but maybe people have custom Filters that need it. Thanks, Matt
-
Re: InternalScanner next(..) methodsMatt Corgan 2012-12-10, 01:03
Seems like the two uses of the KeyValueHeap hive diverged enough that they
deserve separate implementations, like you say Lars. RegionHeap needs the row buffering, but StoreHeap apparently does not. Does the RegionHeap need lazy-seek methods? If we separate them we might be able to get Cells through the StoreHeap which will buy us some big speed increases, especially on compactions where we don't have to use the RegionHeap. During a compaction, if we can pass each Cell immediately from the heap to the CellOutputStream, then we never need to inflate it into a KeyValue. Looking at the usage of "List<KeyValue> kvs" at the bottom of Compactor.java, i don't see anything preventing this. Looks like the List<KeyValue> is just used to move 10 cells at a time instead of 1, which i don't think buys much. On Sun, Dec 9, 2012 at 12:52 PM, lars hofhansl <[EMAIL PROTECTED]> wrote: > This method specifically only works when this is a heap of StoreScanners > (i.e. on the RegionScanner level), which is very confusing (to me anyway). > Maybe we should have two separate KeyValueHeap implementation to make it > less confusing. > > The list here comprises KVs for the same row key. These KVs need to be > collected together so that Filters can operate on entire rows. > > I just looked at that code this week. We need to fix this stuff. :) > > > -- Lars > > > > ________________________________ > From: Matt Corgan <[EMAIL PROTECTED]> > To: dev <[EMAIL PROTECTED]> > Sent: Saturday, December 8, 2012 11:27 PM > Subject: InternalScanner next(..) methods > > I'm looking at the KeyValueHeap trying to see how we can make it work with > Cells. I'm curious, in this method > > @Override > public boolean next(List<KeyValue> result, int limit, String metric) > throws IOException { > if (this.current == null) { > return false; > } > InternalScanner currentAsInternal = (InternalScanner)this.current; > boolean mayContainMoreRows = currentAsInternal.next(result, limit, > metric); > > how is it getting multiple results from a single scanner without putting > the scanner back on the heap? Couldn't that skip KeyValues? Is it that > it's only used at the Region level where the family-per-file semantics > guarantee that all KeyValues in a single family will sort together? > > My bigger question is regarding the next(List<KeyValue> result, int limit) > methods from the InternalScanner interface. What's the reasoning for > getting multiple results in one call as opposed to calling the next() > method a bunch of times? Buffering the KeyValues in a List like that means > the Cells would have to be expanded into full KeyValues which would be nice > to avoid. Is there some logic that depends on getting a whole row of > values, even though you may only get a partial row due to the limit param? > > Similarly, I see there is Filter.filterRow(List<KeyValue>) which looks like > it's barely used. Is that an important method? Doesn't look like it's > used much, but maybe people have custom Filters that need it. > > Thanks, > Matt >
-
Re: InternalScanner next(..) methodsStack 2012-12-10, 20:24
On Sat, Dec 8, 2012 at 11:27 PM, Matt Corgan <[EMAIL PROTECTED]> wrote:
> I'm looking at the KeyValueHeap trying to see how we can make it work with > Cells. I'm curious, in this method > > @Override > public boolean next(List<KeyValue> result, int limit, String metric) > throws IOException { > if (this.current == null) { > return false; > } > InternalScanner currentAsInternal = (InternalScanner)this.current; > boolean mayContainMoreRows = currentAsInternal.next(result, limit, > metric); > > how is it getting multiple results from a single scanner without putting > the scanner back on the heap? Couldn't that skip KeyValues? Is it that > it's only used at the Region level where the family-per-file semantics > guarantee that all KeyValues in a single family will sort together? > > Sort of like Lars, if it strikes the likes of you as voodoo, then it needs fixing. > My bigger question is regarding the next(List<KeyValue> result, int limit) > methods from the InternalScanner interface. What's the reasoning for > getting multiple results in one call as opposed to calling the next() > method a bunch of times? Buffering the KeyValues in a List like that means > the Cells would have to be expanded into full KeyValues which would be nice > to avoid. Is there some logic that depends on getting a whole row of > values, even though you may only get a partial row due to the limit param? > > +1 on no buffering as we go up through the server layers > Similarly, I see there is Filter.filterRow(List<KeyValue>) which looks like > it's barely used. Is that an important method? Doesn't look like it's > used much, but maybe people have custom Filters that need it. > +1 on removing this method if messes us up; "custom" filters that "may" be using it is not reason to keep it in 0.96 -- the singularity. St.Ack
-
Re: InternalScanner next(..) methodsStack 2012-12-10, 20:26
On Sun, Dec 9, 2012 at 12:52 PM, lars hofhansl <[EMAIL PROTECTED]> wrote:
> This method specifically only works when this is a heap of StoreScanners > (i.e. on the RegionScanner level), which is very confusing (to me anyway). > Maybe we should have two separate KeyValueHeap implementation to make it > less confusing. > > The list here comprises KVs for the same row key. These KVs need to be > collected together so that Filters can operate on entire rows. > > We should change Filter Interface instead giving it a stream rather than "whole row"? > I just looked at that code this week. We need to fix this stuff. :) > > +1 St.Ack
-
Re: InternalScanner next(..) methodslars hofhansl 2012-12-11, 06:56
The offending method here is:
public void filterRow(List<KeyValue> kvs); on Filter.java But even if we changed that to accept a stream of KVs, how are we going to make sure the filter code does not hold on to the KVs it received? As long as we allow custom code in filter we cannot reuse the memory backing the KVs, unless we force the filter code to make a copy of any KV it wants to hold on (which would make the above method extremely expensive). Of our pre-canned filters only DependentColumnFilter implements this method. -- Lars ________________________________ From: Stack <[EMAIL PROTECTED]> To: HBase Dev List <[EMAIL PROTECTED]>; lars hofhansl <[EMAIL PROTECTED]> Sent: Monday, December 10, 2012 12:26 PM Subject: Re: InternalScanner next(..) methods On Sun, Dec 9, 2012 at 12:52 PM, lars hofhansl <[EMAIL PROTECTED]> wrote: > This method specifically only works when this is a heap of StoreScanners > (i.e. on the RegionScanner level), which is very confusing (to me anyway). > Maybe we should have two separate KeyValueHeap implementation to make it > less confusing. > > The list here comprises KVs for the same row key. These KVs need to be > collected together so that Filters can operate on entire rows. > > We should change Filter Interface instead giving it a stream rather than "whole row"? > I just looked at that code this week. We need to fix this stuff. :) > > +1 St.Ack
-
Re: InternalScanner next(..) methodsMatt Corgan 2012-12-11, 07:16
>
> unless we force the filter code to make a copy of any KV it wants to hold > on (which would make the above method extremely expensive). I was thinking we force the copy for the few filters that do need it. Then it's no more expensive than the current situation where all filters are forcing the copy? Right now it's just happening lower down the stack. We could add a CellTool.enforceIndependent(cell) method that only makes a copy if it's not already a KeyValue or other non-transient implementation. Slightly more robust than doing "instanceof KeyValue" check. Trying to understand too, how does this filter method work if limit=1 is passed to the next(results, limit) method)? Or any time the limit is less than the row width. On Mon, Dec 10, 2012 at 10:56 PM, lars hofhansl <[EMAIL PROTECTED]> wrote: > The offending method here is: > > public void filterRow(List<KeyValue> kvs); > on Filter.java > > But even if we changed that to accept a stream of KVs, how are we going to > make sure the filter code does not hold on to the KVs it received? > As long as we allow custom code in filter we cannot reuse the memory > backing the KVs, unless we force the filter code to make a copy of any KV > it wants to hold on (which would make the above method extremely expensive). > > > Of our pre-canned filters only DependentColumnFilter implements this > method. > > > -- Lars > > > > ________________________________ > From: Stack <[EMAIL PROTECTED]> > To: HBase Dev List <[EMAIL PROTECTED]>; lars hofhansl < > [EMAIL PROTECTED]> > Sent: Monday, December 10, 2012 12:26 PM > Subject: Re: InternalScanner next(..) methods > > On Sun, Dec 9, 2012 at 12:52 PM, lars hofhansl <[EMAIL PROTECTED]> > wrote: > > > This method specifically only works when this is a heap of StoreScanners > > (i.e. on the RegionScanner level), which is very confusing (to me > anyway). > > Maybe we should have two separate KeyValueHeap implementation to make it > > less confusing. > > > > The list here comprises KVs for the same row key. These KVs need to be > > collected together so that Filters can operate on entire rows. > > > > > We should change Filter Interface instead giving it a stream rather than > "whole row"? > > > > > I just looked at that code this week. We need to fix this stuff. :) > > > > > +1 > > St.Ack >
-
Re: InternalScanner next(..) methodslars hofhansl 2012-12-11, 07:28
Well... To be fair, right now KVs are only copied when some encoding was used, otherwise they are not copied at all.
enforceIndependent sounds like a reasonable approach, or maybe copyIfNeeded, or whatever. Some filters (those that use filterRow(List)) are not compatible with batching. Also note that your needs for PrefixTries might be very specific. The other encodings always store - as far as I can see at least - whole row keys and whole column families and qualifiers that we principally could reuse for the lifetime of the row without breaking existing behavior if the KeyValue interface was changed accordingly. -- Lars ________________________________ From: Matt Corgan <[EMAIL PROTECTED]> To: dev <[EMAIL PROTECTED]>; lars hofhansl <[EMAIL PROTECTED]> Sent: Monday, December 10, 2012 11:16 PM Subject: Re: InternalScanner next(..) methods > > unless we force the filter code to make a copy of any KV it wants to hold > on (which would make the above method extremely expensive). I was thinking we force the copy for the few filters that do need it. Then it's no more expensive than the current situation where all filters are forcing the copy? Right now it's just happening lower down the stack. We could add a CellTool.enforceIndependent(cell) method that only makes a copy if it's not already a KeyValue or other non-transient implementation. Slightly more robust than doing "instanceof KeyValue" check. Trying to understand too, how does this filter method work if limit=1 is passed to the next(results, limit) method)? Or any time the limit is less than the row width. On Mon, Dec 10, 2012 at 10:56 PM, lars hofhansl <[EMAIL PROTECTED]> wrote: > The offending method here is: > > public void filterRow(List<KeyValue> kvs); > on Filter.java > > But even if we changed that to accept a stream of KVs, how are we going to > make sure the filter code does not hold on to the KVs it received? > As long as we allow custom code in filter we cannot reuse the memory > backing the KVs, unless we force the filter code to make a copy of any KV > it wants to hold on (which would make the above method extremely expensive). > > > Of our pre-canned filters only DependentColumnFilter implements this > method. > > > -- Lars > > > > ________________________________ > From: Stack <[EMAIL PROTECTED]> > To: HBase Dev List <[EMAIL PROTECTED]>; lars hofhansl < > [EMAIL PROTECTED]> > Sent: Monday, December 10, 2012 12:26 PM > Subject: Re: InternalScanner next(..) methods > > On Sun, Dec 9, 2012 at 12:52 PM, lars hofhansl <[EMAIL PROTECTED]> > wrote: > > > This method specifically only works when this is a heap of StoreScanners > > (i.e. on the RegionScanner level), which is very confusing (to me > anyway). > > Maybe we should have two separate KeyValueHeap implementation to make it > > less confusing. > > > > The list here comprises KVs for the same row key. These KVs need to be > > collected together so that Filters can operate on entire rows. > > > > > We should change Filter Interface instead giving it a stream rather than > "whole row"? > > > > > I just looked at that code this week. We need to fix this stuff. :) > > > > > +1 > > St.Ack >
-
Re: InternalScanner next(..) methodsMatt Corgan 2012-12-11, 08:08
>
> The other encodings always store whole row keys and whole column families > and qualifiers.. I think you are right but are forgetting about the value part of the cell. The other encoders are somewhere between the prefix-tree and the original KeyValue format. They concatenate row/family/qualifier/timestamp/type together into a single array, but value lives separately in a disconnected part of the data block. So they also require the Cell interface to get through the read path unless you want to copy the value over into the "buffer" before spawning the KeyValue wrapper. You can see the current code copying key+value together in BufferedEncodedSeeker.getKeyValueBuffer(). On Mon, Dec 10, 2012 at 11:28 PM, lars hofhansl <[EMAIL PROTECTED]> wrote: > Well... To be fair, right now KVs are only copied when some encoding was > used, otherwise they are not copied at all. > enforceIndependent sounds like a reasonable approach, or maybe > copyIfNeeded, or whatever. > > > Some filters (those that use filterRow(List)) are not compatible with > batching. > > > Also note that your needs for PrefixTries might be very specific. The > other encodings always store - as far as I can see at least - whole row > keys and whole column families and qualifiers that we principally could > reuse for the lifetime of the row without breaking existing behavior if the > KeyValue interface was changed accordingly. > > > -- Lars > > > > ________________________________ > From: Matt Corgan <[EMAIL PROTECTED]> > To: dev <[EMAIL PROTECTED]>; lars hofhansl <[EMAIL PROTECTED]> > Sent: Monday, December 10, 2012 11:16 PM > Subject: Re: InternalScanner next(..) methods > > > > > unless we force the filter code to make a copy of any KV it wants to hold > > on (which would make the above method extremely expensive). > > I was thinking we force the copy for the few filters that do need it. Then > it's no more expensive than the current situation where all filters are > forcing the copy? Right now it's just happening lower down the stack. > > We could add a CellTool.enforceIndependent(cell) method that only makes a > copy if it's not already a KeyValue or other non-transient implementation. > Slightly more robust than doing "instanceof KeyValue" check. > > Trying to understand too, how does this filter method work if limit=1 is > passed to the next(results, limit) method)? Or any time the limit is less > than the row width. > > On Mon, Dec 10, 2012 at 10:56 PM, lars hofhansl <[EMAIL PROTECTED]> > wrote: > > > The offending method here is: > > > > public void filterRow(List<KeyValue> kvs); > > on Filter.java > > > > But even if we changed that to accept a stream of KVs, how are we going > to > > make sure the filter code does not hold on to the KVs it received? > > As long as we allow custom code in filter we cannot reuse the memory > > backing the KVs, unless we force the filter code to make a copy of any KV > > it wants to hold on (which would make the above method extremely > expensive). > > > > > > Of our pre-canned filters only DependentColumnFilter implements this > > method. > > > > > > -- Lars > > > > > > > > ________________________________ > > From: Stack <[EMAIL PROTECTED]> > > To: HBase Dev List <[EMAIL PROTECTED]>; lars hofhansl < > > [EMAIL PROTECTED]> > > Sent: Monday, December 10, 2012 12:26 PM > > Subject: Re: InternalScanner next(..) methods > > > > On Sun, Dec 9, 2012 at 12:52 PM, lars hofhansl <[EMAIL PROTECTED]> > > wrote: > > > > > This method specifically only works when this is a heap of > StoreScanners > > > (i.e. on the RegionScanner level), which is very confusing (to me > > anyway). > > > Maybe we should have two separate KeyValueHeap implementation to make > it > > > less confusing. > > > > > > The list here comprises KVs for the same row key. These KVs need to be > > > collected together so that Filters can operate on entire rows. > > > > > > > > We should change Filter Interface instead giving it a stream rather than
-
Re: InternalScanner next(..) methodsDave Latham 2012-12-11, 18:54
On Mon, Dec 10, 2012 at 12:26 PM, Stack <[EMAIL PROTECTED]> wrote:
> We should change Filter Interface instead giving it a stream rather than > "whole row"? > We use many custom filter, but none of them rely on this method. The filter already receives a stream of KVs, so I don't think it needs to be changed. +1 for removing it. |