Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
HBase >> mail # user >> Essential column family performance


Copy link to this message
-
Re: Essential column family performance
I tried using reseek() as suggested, along with my patch from HBASE-8306 (30%
selection rate, random distribution and FAST_DIFF encoding on both column
families).
I got uneven results:

2013-04-09 16:59:01,324 INFO  [main] regionserver.TestJoinedScanners(167):
Slow scanner finished in 7.529083 seconds, got 1546 rows

2013-04-09 16:59:06,760 INFO  [main] regionserver.TestJoinedScanners(167):
Joined scanner finished in 5.43579 seconds, got 1546 rows
...
2013-04-09 16:59:12,711 INFO  [main] regionserver.TestJoinedScanners(167):
Slow scanner finished in 5.95016 seconds, got 1546 rows

2013-04-09 16:59:20,240 INFO  [main] regionserver.TestJoinedScanners(167):
Joined scanner finished in 7.529044 seconds, got 1546 rows

FYI

On Tue, Apr 9, 2013 at 4:47 PM, lars hofhansl <[EMAIL PROTECTED]> wrote:

> We did some tests here.
> I ran this through the profiler against a local RegionServer and found the
> part that causes the slowdown is a seek called here:
>              boolean mayHaveData >               (nextJoinedKv != null &&
> nextJoinedKv.matchingRow(currentRow, offset, length))
>               ||
> (this.joinedHeap.seek(KeyValue.createFirstOnRow(currentRow, offset, length))
>                   && joinedHeap.peek() != null
>                   && joinedHeap.peek().matchingRow(currentRow, offset,
> length));
>
> Looking at the code, this is needed because the joinedHeap can fall
> behind, and hence we have to catch it up.
> The key observation, though, is that the joined heap can only ever be
> behind, and hence we do not need a seek, but only a reseek.
>
> Deploying a RegionServer with the seek replaced with reseek we see an
> improvement in *all* cases.
>
> I'll file a jira with a fix later.
>
> -- Lars
>
>
>
> ________________________________
>  From: James Taylor <[EMAIL PROTECTED]>
> To: [EMAIL PROTECTED]
> Sent: Monday, April 8, 2013 6:53 PM
> Subject: Re: Essential column family performance
>
> Good idea, Sergey. We'll rerun with larger non essential column family
> values and see if there's a crossover point. One other difference for us
> is that we're using FAST_DIFF encoding. We'll try with no encoding too.
> Our table has 20 million rows across four regions servers.
>
> Regarding the parallelization we do, we run multiple scans in parallel
> instead of one single scan over the table. We use the region boundaries
> of the table to divide up the work evenly, adding a start/stop key for
> each scan that corresponds to the region boundaries. Our client then
> does a final merge/aggregation step (i.e. adding up the count it gets
> back from the scan for each region).
>
> On 04/08/2013 01:34 PM, Sergey Shelukhin wrote:
> > IntegrationTestLazyCfLoading uses randomly distributed keys with the
> > following condition for filtering:
> > 1 == (Long.parseLong(Bytes.toString(rowKey, 0, 4), 16) & 1); where rowKey
> > is hex string of MD5 key.
> > Then, there are 2 "lazy" CFs, each of which has a value of 4-64k.
> > This test also showed significant improvement IIRC, so random
> distribution
> > and high %%ge of values selected should not be a problem as such.
> >
> > My hunch would be that the additional cost of seeks/merging the results
> > from two CFs outweights the benefit of lazy loading on such small values
> > for the "lazy" CF with lots of data selected. This feature definitely
> makes
> > no sense if you are selecting all values, because then extra work is
> being
> > done for no benefit (everything is read anyway).
> > So the use cases would be larger "lazy" CFs or/and low percentage of
> values
> > selected.
> >
> > Can you try to increase the 2nd CF values' size and rerun the test?
> >
> >
> > On Mon, Apr 8, 2013 at 10:38 AM, James Taylor <[EMAIL PROTECTED]
> >wrote:
> >
> >> In the TestJoinedScanners.java, is the 40% randomly distributed or
> >> sequential?
> >>
> >> In our test, the % is randomly distributed. Also, our custom filter does
> >> the same thing that SingleColumnValueFilter does.  On the client-side,