Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
HBase, mail # user - Essential column family performance


Copy link to this message
-
Re: Essential column family performance
Ted Yu 2013-04-09, 02:51
Using 30% selection rate, random distribution and FAST_DIFF encoding on
both column families, I got:

2013-04-08 19:46:21,802 INFO  [main] regionserver.TestJoinedScanners(166):
Slow scanner finished in 5.251182 seconds, got 1547 rows
...
2013-04-08 19:46:26,661 INFO  [main] regionserver.TestJoinedScanners(166):
Joined scanner finished in 4.858834 seconds, got 1547 rows

2013-04-08 19:46:31,891 INFO  [main] regionserver.TestJoinedScanners(166):
Slow scanner finished in 5.22988 seconds, got 1547 rows
...
2013-04-08 19:46:36,566 INFO  [main] regionserver.TestJoinedScanners(166):
Joined scanner finished in 4.674822 seconds, got 1547 rows

Cheers

On Mon, Apr 8, 2013 at 6:53 PM, James Taylor <[EMAIL PROTECTED]> wrote:

> Good idea, Sergey. We'll rerun with larger non essential column family
> values and see if there's a crossover point. One other difference for us is
> that we're using FAST_DIFF encoding. We'll try with no encoding too. Our
> table has 20 million rows across four regions servers.
>
> Regarding the parallelization we do, we run multiple scans in parallel
> instead of one single scan over the table. We use the region boundaries of
> the table to divide up the work evenly, adding a start/stop key for each
> scan that corresponds to the region boundaries. Our client then does a
> final merge/aggregation step (i.e. adding up the count it gets back from
> the scan for each region).
>
>
> On 04/08/2013 01:34 PM, Sergey Shelukhin wrote:
>
>> IntegrationTestLazyCfLoading uses randomly distributed keys with the
>> following condition for filtering:
>> 1 == (Long.parseLong(Bytes.**toString(rowKey, 0, 4), 16) & 1); where
>> rowKey
>> is hex string of MD5 key.
>> Then, there are 2 "lazy" CFs, each of which has a value of 4-64k.
>> This test also showed significant improvement IIRC, so random distribution
>> and high %%ge of values selected should not be a problem as such.
>>
>> My hunch would be that the additional cost of seeks/merging the results
>> from two CFs outweights the benefit of lazy loading on such small values
>> for the "lazy" CF with lots of data selected. This feature definitely
>> makes
>> no sense if you are selecting all values, because then extra work is being
>> done for no benefit (everything is read anyway).
>> So the use cases would be larger "lazy" CFs or/and low percentage of
>> values
>> selected.
>>
>> Can you try to increase the 2nd CF values' size and rerun the test?
>>
>>
>> On Mon, Apr 8, 2013 at 10:38 AM, James Taylor <[EMAIL PROTECTED]
>> >wrote:
>>
>>  In the TestJoinedScanners.java, is the 40% randomly distributed or
>>> sequential?
>>>
>>> In our test, the % is randomly distributed. Also, our custom filter does
>>> the same thing that SingleColumnValueFilter does.  On the client-side,
>>> we'd
>>> execute the query in parallel, through multiple scans along the region
>>> boundaries. Would that have a negative impact on performance for this
>>> "essential column family" feature?
>>>
>>> Thanks,
>>>
>>>      James
>>>
>>>
>>> On 04/08/2013 10:10 AM, Anoop John wrote:
>>>
>>>  Agree here. The effectiveness depends on what % of data satisfies the
>>>> condition, how it is distributed across HFile blocks. We will get
>>>> performance gain when the we will be able to skip some HFile blocks
>>>> (from
>>>> non essential CFs). Can test with different HFile block size (lower
>>>> value)?
>>>>
>>>> -Anoop-
>>>>
>>>>
>>>> On Mon, Apr 8, 2013 at 8:19 PM, Ted Yu <[EMAIL PROTECTED]> wrote:
>>>>
>>>>   I made the following change in TestJoinedScanners.java:
>>>>
>>>>> -      int flag_percent = 1;
>>>>> +      int flag_percent = 40;
>>>>>
>>>>> The test took longer but still favors joined scanner.
>>>>> I got some new results:
>>>>>
>>>>> 2013-04-08 07:46:06,959 INFO  [main] regionserver.**
>>>>> TestJoinedScanners(157):
>>>>> Slow scanner finished in 7.424388 seconds, got 2050 rows
>>>>> ...
>>>>> 2013-04-08 07:46:12,010 INFO  [main] regionserver.**
>>>>> TestJoinedScanners(157):
>>>>> Joined scanner finished in 5.05063 seconds, got 2050 rows