Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
HBase >> mail # user >> Essential column family performance


Copy link to this message
-
Re: Essential column family performance
bq. is the 40% randomly distributed or sequential?
Looks like the distribution is striped:

        if (i % 100 <= flag_percent) {

          put.add(cf_essential, col_name, flag_yes);
In each stripe, it is sequential.

Let me try simulating random distribution.

On Mon, Apr 8, 2013 at 10:38 AM, James Taylor <[EMAIL PROTECTED]>wrote:

> In the TestJoinedScanners.java, is the 40% randomly distributed or
> sequential?
>
> In our test, the % is randomly distributed. Also, our custom filter does
> the same thing that SingleColumnValueFilter does.  On the client-side, we'd
> execute the query in parallel, through multiple scans along the region
> boundaries. Would that have a negative impact on performance for this
> "essential column family" feature?
>
> Thanks,
>
>     James
>
>
> On 04/08/2013 10:10 AM, Anoop John wrote:
>
>> Agree here. The effectiveness depends on what % of data satisfies the
>> condition, how it is distributed across HFile blocks. We will get
>> performance gain when the we will be able to skip some HFile blocks (from
>> non essential CFs). Can test with different HFile block size (lower
>> value)?
>>
>> -Anoop-
>>
>>
>> On Mon, Apr 8, 2013 at 8:19 PM, Ted Yu <[EMAIL PROTECTED]> wrote:
>>
>>  I made the following change in TestJoinedScanners.java:
>>>
>>> -      int flag_percent = 1;
>>> +      int flag_percent = 40;
>>>
>>> The test took longer but still favors joined scanner.
>>> I got some new results:
>>>
>>> 2013-04-08 07:46:06,959 INFO  [main] regionserver.**
>>> TestJoinedScanners(157):
>>> Slow scanner finished in 7.424388 seconds, got 2050 rows
>>> ...
>>> 2013-04-08 07:46:12,010 INFO  [main] regionserver.**
>>> TestJoinedScanners(157):
>>> Joined scanner finished in 5.05063 seconds, got 2050 rows
>>>
>>> 2013-04-08 07:46:18,358 INFO  [main] regionserver.**
>>> TestJoinedScanners(157):
>>> Slow scanner finished in 6.348517 seconds, got 2050 rows
>>> ...
>>> 2013-04-08 07:46:22,946 INFO  [main] regionserver.**
>>> TestJoinedScanners(157):
>>> Joined scanner finished in 4.587545 seconds, got 2050 rows
>>>
>>> Looks like effectiveness of joined scanner is affected by distribution of
>>> data.
>>>
>>> Cheers
>>>
>>> On Sun, Apr 7, 2013 at 8:52 PM, lars hofhansl <[EMAIL PROTECTED]> wrote:
>>>
>>>  Looking at the joined scanner test code, it sets it up such that 1% of
>>>>
>>> the
>>>
>>>> rows match, which would somewhat be in line with James' results.
>>>>
>>>> In my own testing a while ago I found a 100% improvement with 0% match.
>>>>
>>>>
>>>> -- Lars
>>>>
>>>>
>>>>
>>>> ______________________________**__
>>>>   From: Ted Yu <[EMAIL PROTECTED]>
>>>> To: [EMAIL PROTECTED]
>>>> Sent: Sunday, April 7, 2013 4:13 PM
>>>> Subject: Re: Essential column family performance
>>>>
>>>> I have attached 5416-TestJoinedScanners-0.94.**txt to HBASE-5416 for
>>>> your
>>>> reference.
>>>>
>>>> On my MacBook, I got the following results from the test:
>>>>
>>>> 2013-04-07 16:08:17,474 INFO  [main]
>>>>
>>> regionserver.**TestJoinedScanners(157):
>>>
>>>> Slow scanner finished in 7.973822 seconds, got 100 rows
>>>> ...
>>>> 2013-04-07 16:08:17,946 INFO  [main]
>>>>
>>> regionserver.**TestJoinedScanners(157):
>>>
>>>> Joined scanner finished in 0.47235 seconds, got 100 rows
>>>>
>>>> Cheers
>>>>
>>>> On Sun, Apr 7, 2013 at 4:03 PM, Ted Yu <[EMAIL PROTECTED]> wrote:
>>>>
>>>>  Looking at
>>>>>
>>>>>  https://issues.apache.org/**jira/secure/attachment/**
>>> 12564340/5416-0.94-v3.txt<https://issues.apache.org/jira/secure/attachment/12564340/5416-0.94-v3.txt>
>>> ,
>>>
>>>> I found that it didn't contain TestJoinedScanners which shows
>>>>
>>>>> difference in scanner performance:
>>>>>
>>>>>     LOG.info((slow ? "Slow" : "Joined") + " scanner finished in " +
>>>>> Double.toString(timeSec)
>>>>>
>>>>>        + " seconds, got " + Long.toString(rows_count/2) + " rows");
>>>>>
>>>>> The test uses SingleColumnValueFilter:
>>>>>
>>>>>      SingleColumnValueFilter filter = new SingleColumnValueFilter(
>>>>>
>>>>>          cf_essential, col_name, CompareFilter.CompareOp.EQUAL,