Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
HBase, mail # user - Poor HBase map-reduce scan performance


Copy link to this message
-
Re: Poor HBase map-reduce scan performance
Bryan Keller 2013-07-01, 04:23
I'll attach my patch to HBASE-8369 tomorrow.

On Jun 28, 2013, at 10:56 AM, lars hofhansl <[EMAIL PROTECTED]> wrote:

> If we can make a clean patch with minimal impact to existing code I would be supportive of a backport to 0.94.
>
> -- Lars
>
>
>
> ----- Original Message -----
> From: Bryan Keller <[EMAIL PROTECTED]>
> To: [EMAIL PROTECTED]; lars hofhansl <[EMAIL PROTECTED]>
> Cc:
> Sent: Tuesday, June 25, 2013 1:56 AM
> Subject: Re: Poor HBase map-reduce scan performance
>
> I tweaked Enis's snapshot input format and backported it to 0.94.6 and have snapshot scanning functional on my system. Performance is dramatically better, as expected i suppose. I'm seeing about 3.6x faster performance vs TableInputFormat. Also, HBase doesn't get bogged down during a scan as the regionserver is being bypassed. I'm very excited by this. There are some issues with file permissions and library dependencies but nothing that can't be worked out.
>
> On Jun 5, 2013, at 6:03 PM, lars hofhansl <[EMAIL PROTECTED]> wrote:
>
>> That's exactly the kind of pre-fetching I was investigating a bit ago (made a patch, but ran out of time).
>> This pre-fetching is strictly client only, where the client keeps the server busy while it is processing the previous batch, but filling up a 2nd buffer.
>>
>>
>> -- Lars
>>
>>
>>
>> ________________________________
>> From: Sandy Pratt <[EMAIL PROTECTED]>
>> To: "[EMAIL PROTECTED]" <[EMAIL PROTECTED]>
>> Sent: Wednesday, June 5, 2013 10:58 AM
>> Subject: Re: Poor HBase map-reduce scan performance
>>
>>
>> Yong,
>>
>> As a thought experiment, imagine how it impacts the throughput of TCP to
>> keep the window size at 1.  That means there's only one packet in flight
>> at a time, and total throughput is a fraction of what it could be.
>>
>> That's effectively what happens with RPC.  The server sends a batch, then
>> does nothing while it waits for the client to ask for more.  During that
>> time, the pipe between them is empty.  Increasing the batch size can help
>> a bit, in essence creating a really huge packet, but the problem remains.
>> There will always be stalls in the pipe.
>>
>> What you want is for the window size to be large enough that the pipe is
>> saturated.  A streaming API accomplishes that by stuffing data down the
>> network pipe as quickly as possible.
>>
>> Sandy
>>
>> On 6/5/13 7:55 AM, "yonghu" <[EMAIL PROTECTED]> wrote:
>>
>>> Can anyone explain why client + rpc + server will decrease the performance
>>> of scanning? I mean the Regionserver and Tasktracker are the same node
>>> when
>>> you use MapReduce to scan the HBase table. So, in my understanding, there
>>> will be no rpc cost.
>>>
>>> Thanks!
>>>
>>> Yong
>>>
>>>
>>> On Wed, Jun 5, 2013 at 10:09 AM, Sandy Pratt <[EMAIL PROTECTED]> wrote:
>>>
>>>> https://issues.apache.org/jira/browse/HBASE-8691
>>>>
>>>>
>>>> On 6/4/13 6:11 PM, "Sandy Pratt" <[EMAIL PROTECTED]> wrote:
>>>>
>>>>> Haven't had a chance to write a JIRA yet, but I thought I'd pop in here
>>>>> with an update in the meantime.
>>>>>
>>>>> I tried a number of different approaches to eliminate latency and
>>>>> "bubbles" in the scan pipeline, and eventually arrived at adding a
>>>>> streaming scan API to the region server, along with refactoring the
>>>> scan
>>>>> interface into an event-drive message receiver interface.  In so
>>>> doing, I
>>>>> was able to take scan speed on my cluster from 59,537 records/sec with
>>>> the
>>>>> classic scanner to 222,703 records per second with my new scan API.
>>>>> Needless to say, I'm pleased ;)
>>>>>
>>>>> More details forthcoming when I get a chance.
>>>>>
>>>>> Thanks,
>>>>> Sandy
>>>>>
>>>>> On 5/23/13 3:47 PM, "Ted Yu" <[EMAIL PROTECTED]> wrote:
>>>>>
>>>>>> Thanks for the update, Sandy.
>>>>>>
>>>>>> If you can open a JIRA and attach your producer / consumer scanner
>>>> there,
>>>>>> that would be great.
>>>>>>
>>>>>> On Thu, May 23, 2013 at 3:42 PM, Sandy Pratt <[EMAIL PROTECTED]