Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
HBase >> mail # user >> Essential column family performance


Copy link to this message
-
Re: Essential column family performance
Yes, on 0.94.6. We have our own custom filter derived from FilterBase,
so filterIfMissing isn't the issue - the results of the scan are correct.

I can see that if the essential column family has more data compared to
the non essential column family that the results would eventually even
out. I was hoping to always be able to enable the essential column
family feature. Is there an inherent reason why performance would
degrade like this? Does it boil down to a single sequential scan versus
many seeks?

Thanks,

James

On 04/07/2013 07:44 AM, Ted Yu wrote:
> James:
> Your test was based on 0.94.6.1, right ?
>
> What Filter were you using ?
>
> If you used SingleColumnValueFilter, have you seen my comment here ?
> https://issues.apache.org/jira/browse/HBASE-5416?focusedCommentId=13541229&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13541229
>
> BTW the use case Max Lapan tried to address has non essential column family
> carrying considerably more data compared to essential column family.
>
> Cheers
>
>
>
> On Sat, Apr 6, 2013 at 11:05 PM, James Taylor <[EMAIL PROTECTED]>wrote:
>
>> Hello,
>> We're doing some performance testing of the essential column family
>> feature, and we're seeing some performance degradation when comparing with
>> and without the feature enabled:
>>
>>                            Performance of scan relative
>> % of rows selected        to not enabling the feature
>> ---------------------    ------------------------------**--
>> 100%                            1.0x
>>   80%                            2.0x
>>   60%                            2.3x
>>   40%                            2.2x
>>   20%                            1.5x
>>   10%                            1.0x
>>    5%                            0.67x
>>    0%                            0.30%
>>
>> In our scenario, we have two column families. The key value from the
>> essential column family is used in the filter, while the key value from the
>> other, non essential column family is returned by the scan. Each row
>> contains values for both key values, with the values being relatively
>> narrow (less than 50 bytes). In this scenario, the only time we're seeing a
>> performance gain is when less than 10% of the rows are selected.
>>
>> Is this a reasonable test? Has anyone else measured this?
>>
>> Thanks,
>>
>> James
>>
>>
>>
>>
>>
>>
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB