Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
HBase >> mail # user >> Essential column family performance


Copy link to this message
-
Re: Essential column family performance
That part did not show up in the profiling session.
It was just the unnecessary seek that slowed it all down.

-- Lars

________________________________
 From: Ted Yu <[EMAIL PROTECTED]>
To: [EMAIL PROTECTED]
Sent: Tuesday, April 9, 2013 9:03 PM
Subject: Re: Essential column family performance
 
Looking at populateFromJoinedHeap():

      KeyValue kv = populateResult(results, this.joinedHeap, limit,

          joinedContinuationRow.getBuffer(), joinedContinuationRow
.getRowOffset(),

          joinedContinuationRow.getRowLength(), metric);

...

      Collections.sort(results, comparator);

Arrays.mergeSort() is used in the Collections.sort() call.

There seems to be some optimization we can do above: we can record the size
of results before calling populateResult(). Upon return, we can merge the
two segments without resorting to Arrays.mergeSort() which is recursive.
On Tue, Apr 9, 2013 at 6:21 PM, Ted Yu <[EMAIL PROTECTED]> wrote:

> bq. with only 10000 rows that would all fit in the memstore.
>
> This aspect should be enhanced in the test.
>
> Cheers
>
> On Tue, Apr 9, 2013 at 6:17 PM, Lars Hofhansl <[EMAIL PROTECTED]> wrote:
>
>> Also the unittest tests with only 10000 rows that would all fit in the
>> memstore. Seek vs reseek should make little difference for the memstore.
>>
>> We tested with 1m and 10m rows, and flushed the memstore  and compacted
>> the store.
>>
>> Will do some more verification later tonight.
>>
>> -- Lars
>>
>>
>> Lars H <[EMAIL PROTECTED]> wrote:
>>
>> >Your slow scanner performance seems to vary as well. How come? Slow is
>> with the feature off.
>> >
>> >I don't how reseek can be slower than seek in any scenario.
>> >
>> >-- Lars
>> >
>> >Ted Yu <[EMAIL PROTECTED]> schrieb:
>> >
>> >>I tried using reseek() as suggested, along with my patch from
>> HBASE-8306 (30%
>> >>selection rate, random distribution and FAST_DIFF encoding on both
>> column
>> >>families).
>> >>I got uneven results:
>> >>
>> >>2013-04-09 16:59:01,324 INFO  [main]
>> regionserver.TestJoinedScanners(167):
>> >>Slow scanner finished in 7.529083 seconds, got 1546 rows
>> >>
>> >>2013-04-09 16:59:06,760 INFO  [main]
>> regionserver.TestJoinedScanners(167):
>> >>Joined scanner finished in 5.43579 seconds, got 1546 rows
>> >>...
>> >>2013-04-09 16:59:12,711 INFO  [main]
>> regionserver.TestJoinedScanners(167):
>> >>Slow scanner finished in 5.95016 seconds, got 1546 rows
>> >>
>> >>2013-04-09 16:59:20,240 INFO  [main]
>> regionserver.TestJoinedScanners(167):
>> >>Joined scanner finished in 7.529044 seconds, got 1546 rows
>> >>
>> >>FYI
>> >>
>> >>On Tue, Apr 9, 2013 at 4:47 PM, lars hofhansl <[EMAIL PROTECTED]> wrote:
>> >>
>> >>> We did some tests here.
>> >>> I ran this through the profiler against a local RegionServer and
>> found the
>> >>> part that causes the slowdown is a seek called here:
>> >>>              boolean mayHaveData >> >>>               (nextJoinedKv != null &&
>> >>> nextJoinedKv.matchingRow(currentRow, offset, length))
>> >>>               ||
>> >>> (this.joinedHeap.seek(KeyValue.createFirstOnRow(currentRow, offset,
>> length))
>> >>>                   && joinedHeap.peek() != null
>> >>>                   && joinedHeap.peek().matchingRow(currentRow, offset,
>> >>> length));
>> >>>
>> >>> Looking at the code, this is needed because the joinedHeap can fall
>> >>> behind, and hence we have to catch it up.
>> >>> The key observation, though, is that the joined heap can only ever be
>> >>> behind, and hence we do not need a seek, but only a reseek.
>> >>>
>> >>> Deploying a RegionServer with the seek replaced with reseek we see an
>> >>> improvement in *all* cases.
>> >>>
>> >>> I'll file a jira with a fix later.
>> >>>
>> >>> -- Lars
>> >>>
>> >>>
>> >>>
>> >>> ________________________________
>> >>>  From: James Taylor <[EMAIL PROTECTED]>
>> >>> To: [EMAIL PROTECTED]
>> >>> Sent: Monday, April 8, 2013 6:53 PM
>> >>> Subject: Re: Essential column family performance
>> >>>
>> >>> Good idea, Sergey. We'll rerun with larger non essential column family
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB