Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
HBase, mail # user - Poor HBase map-reduce scan performance


Copy link to this message
-
Re: Poor HBase map-reduce scan performance
Bryan Keller 2013-07-01, 21:35
I attached my patch to the JIRA issue, in case anyone is interested. It can pretty easily be used on its own without patching HBase. I am currently doing this.
On Jul 1, 2013, at 2:23 PM, Enis Söztutar <[EMAIL PROTECTED]> wrote:

> Bryan,
>
> 3.6x improvement seems exciting. The ballpark difference between HBase scan
> and hdfs scan is in that order, so it is expected I guess.
>
> I plan to get back to the trunk patch, add more tests etc next week. In the
> mean time, if you have any changes to the patch, pls attach the patch.
>
> Enis
>
>
> On Mon, Jul 1, 2013 at 3:59 AM, lars hofhansl <[EMAIL PROTECTED]> wrote:
>
>> Absolutely.
>>
>>
>>
>> ----- Original Message -----
>> From: Ted Yu <[EMAIL PROTECTED]>
>> To: [EMAIL PROTECTED]
>> Cc:
>> Sent: Sunday, June 30, 2013 9:32 PM
>> Subject: Re: Poor HBase map-reduce scan performance
>>
>> Looking at the tail of HBASE-8369, there were some comments which are yet
>> to be addressed.
>>
>> I think trunk patch should be finalized before backporting.
>>
>> Cheers
>>
>> On Mon, Jul 1, 2013 at 12:23 PM, Bryan Keller <[EMAIL PROTECTED]> wrote:
>>
>>> I'll attach my patch to HBASE-8369 tomorrow.
>>>
>>> On Jun 28, 2013, at 10:56 AM, lars hofhansl <[EMAIL PROTECTED]> wrote:
>>>
>>>> If we can make a clean patch with minimal impact to existing code I
>>> would be supportive of a backport to 0.94.
>>>>
>>>> -- Lars
>>>>
>>>>
>>>>
>>>> ----- Original Message -----
>>>> From: Bryan Keller <[EMAIL PROTECTED]>
>>>> To: [EMAIL PROTECTED]; lars hofhansl <[EMAIL PROTECTED]>
>>>> Cc:
>>>> Sent: Tuesday, June 25, 2013 1:56 AM
>>>> Subject: Re: Poor HBase map-reduce scan performance
>>>>
>>>> I tweaked Enis's snapshot input format and backported it to 0.94.6 and
>>> have snapshot scanning functional on my system. Performance is
>> dramatically
>>> better, as expected i suppose. I'm seeing about 3.6x faster performance
>> vs
>>> TableInputFormat. Also, HBase doesn't get bogged down during a scan as
>> the
>>> regionserver is being bypassed. I'm very excited by this. There are some
>>> issues with file permissions and library dependencies but nothing that
>>> can't be worked out.
>>>>
>>>> On Jun 5, 2013, at 6:03 PM, lars hofhansl <[EMAIL PROTECTED]> wrote:
>>>>
>>>>> That's exactly the kind of pre-fetching I was investigating a bit ago
>>> (made a patch, but ran out of time).
>>>>> This pre-fetching is strictly client only, where the client keeps the
>>> server busy while it is processing the previous batch, but filling up a
>> 2nd
>>> buffer.
>>>>>
>>>>>
>>>>> -- Lars
>>>>>
>>>>>
>>>>>
>>>>> ________________________________
>>>>> From: Sandy Pratt <[EMAIL PROTECTED]>
>>>>> To: "[EMAIL PROTECTED]" <[EMAIL PROTECTED]>
>>>>> Sent: Wednesday, June 5, 2013 10:58 AM
>>>>> Subject: Re: Poor HBase map-reduce scan performance
>>>>>
>>>>>
>>>>> Yong,
>>>>>
>>>>> As a thought experiment, imagine how it impacts the throughput of TCP
>> to
>>>>> keep the window size at 1.  That means there's only one packet in
>> flight
>>>>> at a time, and total throughput is a fraction of what it could be.
>>>>>
>>>>> That's effectively what happens with RPC.  The server sends a batch,
>>> then
>>>>> does nothing while it waits for the client to ask for more.  During
>> that
>>>>> time, the pipe between them is empty.  Increasing the batch size can
>>> help
>>>>> a bit, in essence creating a really huge packet, but the problem
>>> remains.
>>>>> There will always be stalls in the pipe.
>>>>>
>>>>> What you want is for the window size to be large enough that the pipe
>> is
>>>>> saturated.  A streaming API accomplishes that by stuffing data down
>> the
>>>>> network pipe as quickly as possible.
>>>>>
>>>>> Sandy
>>>>>
>>>>> On 6/5/13 7:55 AM, "yonghu" <[EMAIL PROTECTED]> wrote:
>>>>>
>>>>>> Can anyone explain why client + rpc + server will decrease the
>>> performance
>>>>>> of scanning? I mean the Regionserver and Tasktracker are the same
>> node
>>>>>> when
>>>>>> you use MapReduce to scan the HBase table. So, in my understanding,