Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
HBase, mail # user - Essential column family performance


Copy link to this message
-
Re: Essential column family performance
Lars H 2013-04-10, 01:05
Your slow scanner performance seems to vary as well. How come? Slow is with the feature off.

I don't how reseek can be slower than seek in any scenario.

-- Lars

Ted Yu <[EMAIL PROTECTED]> schrieb:

>I tried using reseek() as suggested, along with my patch from HBASE-8306 (30%
>selection rate, random distribution and FAST_DIFF encoding on both column
>families).
>I got uneven results:
>
>2013-04-09 16:59:01,324 INFO  [main] regionserver.TestJoinedScanners(167):
>Slow scanner finished in 7.529083 seconds, got 1546 rows
>
>2013-04-09 16:59:06,760 INFO  [main] regionserver.TestJoinedScanners(167):
>Joined scanner finished in 5.43579 seconds, got 1546 rows
>...
>2013-04-09 16:59:12,711 INFO  [main] regionserver.TestJoinedScanners(167):
>Slow scanner finished in 5.95016 seconds, got 1546 rows
>
>2013-04-09 16:59:20,240 INFO  [main] regionserver.TestJoinedScanners(167):
>Joined scanner finished in 7.529044 seconds, got 1546 rows
>
>FYI
>
>On Tue, Apr 9, 2013 at 4:47 PM, lars hofhansl <[EMAIL PROTECTED]> wrote:
>
>> We did some tests here.
>> I ran this through the profiler against a local RegionServer and found the
>> part that causes the slowdown is a seek called here:
>>              boolean mayHaveData >>               (nextJoinedKv != null &&
>> nextJoinedKv.matchingRow(currentRow, offset, length))
>>               ||
>> (this.joinedHeap.seek(KeyValue.createFirstOnRow(currentRow, offset, length))
>>                   && joinedHeap.peek() != null
>>                   && joinedHeap.peek().matchingRow(currentRow, offset,
>> length));
>>
>> Looking at the code, this is needed because the joinedHeap can fall
>> behind, and hence we have to catch it up.
>> The key observation, though, is that the joined heap can only ever be
>> behind, and hence we do not need a seek, but only a reseek.
>>
>> Deploying a RegionServer with the seek replaced with reseek we see an
>> improvement in *all* cases.
>>
>> I'll file a jira with a fix later.
>>
>> -- Lars
>>
>>
>>
>> ________________________________
>>  From: James Taylor <[EMAIL PROTECTED]>
>> To: [EMAIL PROTECTED]
>> Sent: Monday, April 8, 2013 6:53 PM
>> Subject: Re: Essential column family performance
>>
>> Good idea, Sergey. We'll rerun with larger non essential column family
>> values and see if there's a crossover point. One other difference for us
>> is that we're using FAST_DIFF encoding. We'll try with no encoding too.
>> Our table has 20 million rows across four regions servers.
>>
>> Regarding the parallelization we do, we run multiple scans in parallel
>> instead of one single scan over the table. We use the region boundaries
>> of the table to divide up the work evenly, adding a start/stop key for
>> each scan that corresponds to the region boundaries. Our client then
>> does a final merge/aggregation step (i.e. adding up the count it gets
>> back from the scan for each region).
>>
>> On 04/08/2013 01:34 PM, Sergey Shelukhin wrote:
>> > IntegrationTestLazyCfLoading uses randomly distributed keys with the
>> > following condition for filtering:
>> > 1 == (Long.parseLong(Bytes.toString(rowKey, 0, 4), 16) & 1); where rowKey
>> > is hex string of MD5 key.
>> > Then, there are 2 "lazy" CFs, each of which has a value of 4-64k.
>> > This test also showed significant improvement IIRC, so random
>> distribution
>> > and high %%ge of values selected should not be a problem as such.
>> >
>> > My hunch would be that the additional cost of seeks/merging the results
>> > from two CFs outweights the benefit of lazy loading on such small values
>> > for the "lazy" CF with lots of data selected. This feature definitely
>> makes
>> > no sense if you are selecting all values, because then extra work is
>> being
>> > done for no benefit (everything is read anyway).
>> > So the use cases would be larger "lazy" CFs or/and low percentage of
>> values
>> > selected.
>> >
>> > Can you try to increase the 2nd CF values' size and rerun the test?
>> >
>> >
>> > On Mon, Apr 8, 2013 at 10:38 AM, James Taylor <[EMAIL PROTECTED]