Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
HBase >> mail # user >> Essential column family performance


Copy link to this message
-
Re: Essential column family performance
Yes, on 0.94.6. We have our own custom filter derived from FilterBase,
so filterIfMissing isn't the issue - the results of the scan are correct.

I can see that if the essential column family has more data compared to
the non essential column family that the results would eventually even
out. I was hoping to always be able to enable the essential column
family feature. Is there an inherent reason why performance would
degrade like this? Does it boil down to a single sequential scan versus
many seeks?

Thanks,

James

On 04/07/2013 07:44 AM, Ted Yu wrote:
> James:
> Your test was based on 0.94.6.1, right ?
>
> What Filter were you using ?
>
> If you used SingleColumnValueFilter, have you seen my comment here ?
> https://issues.apache.org/jira/browse/HBASE-5416?focusedCommentId=13541229&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13541229
>
> BTW the use case Max Lapan tried to address has non essential column family
> carrying considerably more data compared to essential column family.
>
> Cheers
>
>
>
> On Sat, Apr 6, 2013 at 11:05 PM, James Taylor <[EMAIL PROTECTED]>wrote:
>
>> Hello,
>> We're doing some performance testing of the essential column family
>> feature, and we're seeing some performance degradation when comparing with
>> and without the feature enabled:
>>
>>                            Performance of scan relative
>> % of rows selected        to not enabling the feature
>> ---------------------    ------------------------------**--
>> 100%                            1.0x
>>   80%                            2.0x
>>   60%                            2.3x
>>   40%                            2.2x
>>   20%                            1.5x
>>   10%                            1.0x
>>    5%                            0.67x
>>    0%                            0.30%
>>
>> In our scenario, we have two column families. The key value from the
>> essential column family is used in the filter, while the key value from the
>> other, non essential column family is returned by the scan. Each row
>> contains values for both key values, with the values being relatively
>> narrow (less than 50 bytes). In this scenario, the only time we're seeing a
>> performance gain is when less than 10% of the rows are selected.
>>
>> Is this a reasonable test? Has anyone else measured this?
>>
>> Thanks,
>>
>> James
>>
>>
>>
>>
>>
>>