Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
HBase >> mail # user >> Essential column family performance


Copy link to this message
-
Re: Essential column family performance
One of James' motivation was to always be able to enable scanners to make use of essential column families (and thus avoid HBase API version - essential column families was added only in 0.94.5+).
Sounds like general answer to this is: "No you shouldn't. It should still be a per query option, or at least a per table option"
-- Lars

________________________________
 From: Sergey Shelukhin <[EMAIL PROTECTED]>
To: [EMAIL PROTECTED]
Sent: Monday, April 8, 2013 1:34 PM
Subject: Re: Essential column family performance
 
IntegrationTestLazyCfLoading uses randomly distributed keys with the
following condition for filtering:
1 == (Long.parseLong(Bytes.toString(rowKey, 0, 4), 16) & 1); where rowKey
is hex string of MD5 key.
Then, there are 2 "lazy" CFs, each of which has a value of 4-64k.
This test also showed significant improvement IIRC, so random distribution
and high %%ge of values selected should not be a problem as such.

My hunch would be that the additional cost of seeks/merging the results
from two CFs outweights the benefit of lazy loading on such small values
for the "lazy" CF with lots of data selected. This feature definitely makes
no sense if you are selecting all values, because then extra work is being
done for no benefit (everything is read anyway).
So the use cases would be larger "lazy" CFs or/and low percentage of values
selected.

Can you try to increase the 2nd CF values' size and rerun the test?
On Mon, Apr 8, 2013 at 10:38 AM, James Taylor <[EMAIL PROTECTED]>wrote:

> In the TestJoinedScanners.java, is the 40% randomly distributed or
> sequential?
>
> In our test, the % is randomly distributed. Also, our custom filter does
> the same thing that SingleColumnValueFilter does.  On the client-side, we'd
> execute the query in parallel, through multiple scans along the region
> boundaries. Would that have a negative impact on performance for this
> "essential column family" feature?
>
> Thanks,
>
>     James
>
>
> On 04/08/2013 10:10 AM, Anoop John wrote:
>
>> Agree here. The effectiveness depends on what % of data satisfies the
>> condition, how it is distributed across HFile blocks. We will get
>> performance gain when the we will be able to skip some HFile blocks (from
>> non essential CFs). Can test with different HFile block size (lower
>> value)?
>>
>> -Anoop-
>>
>>
>> On Mon, Apr 8, 2013 at 8:19 PM, Ted Yu <[EMAIL PROTECTED]> wrote:
>>
>>  I made the following change in TestJoinedScanners.java:
>>>
>>> -      int flag_percent = 1;
>>> +      int flag_percent = 40;
>>>
>>> The test took longer but still favors joined scanner.
>>> I got some new results:
>>>
>>> 2013-04-08 07:46:06,959 INFO  [main] regionserver.**
>>> TestJoinedScanners(157):
>>> Slow scanner finished in 7.424388 seconds, got 2050 rows
>>> ...
>>> 2013-04-08 07:46:12,010 INFO  [main] regionserver.**
>>> TestJoinedScanners(157):
>>> Joined scanner finished in 5.05063 seconds, got 2050 rows
>>>
>>> 2013-04-08 07:46:18,358 INFO  [main] regionserver.**
>>> TestJoinedScanners(157):
>>> Slow scanner finished in 6.348517 seconds, got 2050 rows
>>> ...
>>> 2013-04-08 07:46:22,946 INFO  [main] regionserver.**
>>> TestJoinedScanners(157):
>>> Joined scanner finished in 4.587545 seconds, got 2050 rows
>>>
>>> Looks like effectiveness of joined scanner is affected by distribution of
>>> data.
>>>
>>> Cheers
>>>
>>> On Sun, Apr 7, 2013 at 8:52 PM, lars hofhansl <[EMAIL PROTECTED]> wrote:
>>>
>>>  Looking at the joined scanner test code, it sets it up such that 1% of
>>>>
>>> the
>>>
>>>> rows match, which would somewhat be in line with James' results.
>>>>
>>>> In my own testing a while ago I found a 100% improvement with 0% match.
>>>>
>>>>
>>>> -- Lars
>>>>
>>>>
>>>>
>>>> ______________________________**__
>>>>   From: Ted Yu <[EMAIL PROTECTED]>
>>>> To: [EMAIL PROTECTED]
>>>> Sent: Sunday, April 7, 2013 4:13 PM
>>>> Subject: Re: Essential column family performance
>>>>
>>>> I have attached 5416-TestJoinedScanners-0.94.**txt to HBASE-5416 for
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB