Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
HBase >> mail # dev >> Scanner with explicit columns list is very slow


Copy link to this message
-
Re: Scanner with explicit columns list is very slow
One fast optimization:

There is no need to call reseek on INCLUDE_NEXT_COL - this is going to be
the same row in the same KeyValueScanner (currently on top of KeyValueHeap).
On Mon, Oct 14, 2013 at 2:46 PM, Vladimir Rodionov
<[EMAIL PROTECTED]>wrote:

> I profiled the last test case (5 columns total and 2 in a scan).
>
> 80% of StoreScanner.next() execution time are in :
>
> StoreScanner.reseek() - 71%
> ScanQueryMathcer.getKeyForNextColumn() - 6%
> ScanQueryMathcer.getKeyForNextRow() - 2%
>
> Should I open JIRA?
>
>
> On Mon, Oct 14, 2013 at 2:03 PM, Vladimir Rodionov <[EMAIL PROTECTED]
> > wrote:
>
>> I modified tests:
>>
>> Now I created table with one CF and 5 columns: CQ1,..,CQ5
>>
>> 1. Scan.addColumn(CF, CQ1);
>>     Scan.addColumn(CF, CQ3);
>>
>> 2. Scan.addFamily(CF);
>>
>> Scan performance from block cache:
>>
>> 1.  400K rows per sec
>> 2.  1.6M rows per sec
>>
>> The explicit columns scan performance  is even worse in this case. It is
>> much faster to scan the WHOLE rows and filter columns later in a Filter,
>> than specify columns directly in a Scan.
>>
>> Definitely needs to be explained/investigated.
>>
>>
>> On Mon, Oct 14, 2013 at 11:18 AM, Vladimir Rodionov <
>> [EMAIL PROTECTED]> wrote:
>>
>>> Its 0.94.6 and there is chance that the issue has been fixed already
>>>
>>> Simple table: one column + one qualifier
>>>
>>> Two type of scans:
>>>
>>> 1. Scan.addFamily(CF)
>>>
>>> 2. Scan.addColumn(CF, CQ)
>>>
>>> Both run on block cache (all data in memory)
>>>
>>> Tested on StoreScanner directly.
>>>
>>> 1. 4.2M KVs per sec per one thread
>>> 2. 1.5M KVs per second per one thread.
>>>
>>> The difference? First scanner's ScanQueryMatcher returns INCLUDE, DONE,
>>> second - INCLUDE_NEXT_ROW, DONE
>>> The cost of Row's reseek is huge.
>>>
>>> Best regards,
>>> Vladimir Rodionov
>>> Principal Platform Engineer
>>> Carrier IQ, www.carrieriq.com
>>> e-mail: [EMAIL PROTECTED]
>>>
>>>
>>> Confidentiality Notice:  The information contained in this message,
>>> including any attachments hereto, may be confidential and is intended to be
>>> read only by the individual or entity to whom this message is addressed. If
>>> the reader of this message is not the intended recipient or an agent or
>>> designee of the intended recipient, please note that any review, use,
>>> disclosure or distribution of this message or its attachments, in any form,
>>> is strictly prohibited.  If you have received this message in error, please
>>> immediately notify the sender and/or [EMAIL PROTECTED] and
>>> delete or destroy any copy of this message and its attachments.
>>>
>>
>>
>
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB