Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
HBase, mail # user - Essential column family performance


Copy link to this message
-
Re: Essential column family performance
Lars Hofhansl 2013-04-10, 01:17
Also the unittest tests with only 10000 rows that would all fit in the memstore. Seek vs reseek should make little difference for the memstore.

We tested with 1m and 10m rows, and flushed the memstore  and compacted the store.

Will do some more verification later tonight.

-- Lars
Lars H <[EMAIL PROTECTED]> wrote:

>Your slow scanner performance seems to vary as well. How come? Slow is with the feature off.
>
>I don't how reseek can be slower than seek in any scenario.
>
>-- Lars
>
>Ted Yu <[EMAIL PROTECTED]> schrieb:
>
>>I tried using reseek() as suggested, along with my patch from HBASE-8306 (30%
>>selection rate, random distribution and FAST_DIFF encoding on both column
>>families).
>>I got uneven results:
>>
>>2013-04-09 16:59:01,324 INFO  [main] regionserver.TestJoinedScanners(167):
>>Slow scanner finished in 7.529083 seconds, got 1546 rows
>>
>>2013-04-09 16:59:06,760 INFO  [main] regionserver.TestJoinedScanners(167):
>>Joined scanner finished in 5.43579 seconds, got 1546 rows
>>...
>>2013-04-09 16:59:12,711 INFO  [main] regionserver.TestJoinedScanners(167):
>>Slow scanner finished in 5.95016 seconds, got 1546 rows
>>
>>2013-04-09 16:59:20,240 INFO  [main] regionserver.TestJoinedScanners(167):
>>Joined scanner finished in 7.529044 seconds, got 1546 rows
>>
>>FYI
>>
>>On Tue, Apr 9, 2013 at 4:47 PM, lars hofhansl <[EMAIL PROTECTED]> wrote:
>>
>>> We did some tests here.
>>> I ran this through the profiler against a local RegionServer and found the
>>> part that causes the slowdown is a seek called here:
>>>              boolean mayHaveData >>>               (nextJoinedKv != null &&
>>> nextJoinedKv.matchingRow(currentRow, offset, length))
>>>               ||
>>> (this.joinedHeap.seek(KeyValue.createFirstOnRow(currentRow, offset, length))
>>>                   && joinedHeap.peek() != null
>>>                   && joinedHeap.peek().matchingRow(currentRow, offset,
>>> length));
>>>
>>> Looking at the code, this is needed because the joinedHeap can fall
>>> behind, and hence we have to catch it up.
>>> The key observation, though, is that the joined heap can only ever be
>>> behind, and hence we do not need a seek, but only a reseek.
>>>
>>> Deploying a RegionServer with the seek replaced with reseek we see an
>>> improvement in *all* cases.
>>>
>>> I'll file a jira with a fix later.
>>>
>>> -- Lars
>>>
>>>
>>>
>>> ________________________________
>>>  From: James Taylor <[EMAIL PROTECTED]>
>>> To: [EMAIL PROTECTED]
>>> Sent: Monday, April 8, 2013 6:53 PM
>>> Subject: Re: Essential column family performance
>>>
>>> Good idea, Sergey. We'll rerun with larger non essential column family
>>> values and see if there's a crossover point. One other difference for us
>>> is that we're using FAST_DIFF encoding. We'll try with no encoding too.
>>> Our table has 20 million rows across four regions servers.
>>>
>>> Regarding the parallelization we do, we run multiple scans in parallel
>>> instead of one single scan over the table. We use the region boundaries
>>> of the table to divide up the work evenly, adding a start/stop key for
>>> each scan that corresponds to the region boundaries. Our client then
>>> does a final merge/aggregation step (i.e. adding up the count it gets
>>> back from the scan for each region).
>>>
>>> On 04/08/2013 01:34 PM, Sergey Shelukhin wrote:
>>> > IntegrationTestLazyCfLoading uses randomly distributed keys with the
>>> > following condition for filtering:
>>> > 1 == (Long.parseLong(Bytes.toString(rowKey, 0, 4), 16) & 1); where rowKey
>>> > is hex string of MD5 key.
>>> > Then, there are 2 "lazy" CFs, each of which has a value of 4-64k.
>>> > This test also showed significant improvement IIRC, so random
>>> distribution
>>> > and high %%ge of values selected should not be a problem as such.
>>> >
>>> > My hunch would be that the additional cost of seeks/merging the results
>>> > from two CFs outweights the benefit of lazy loading on such small values
>>> > for the "lazy" CF with lots of data selected. This feature definitely