Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Plain View
HBase, mail # user - Poor HBase map-reduce scan performance


+
Bryan Keller 2013-05-01, 04:01
+
Ted Yu 2013-05-01, 04:17
+
Bryan Keller 2013-05-01, 04:31
+
Ted Yu 2013-05-01, 04:56
+
Bryan Keller 2013-05-01, 05:01
+
lars hofhansl 2013-05-01, 05:01
+
Bryan Keller 2013-05-01, 06:02
+
Michael Segel 2013-05-01, 14:24
+
lars hofhansl 2013-05-01, 06:21
+
Bryan Keller 2013-05-01, 15:00
+
Bryan Keller 2013-05-02, 01:01
+
lars hofhansl 2013-05-02, 04:41
+
Bryan Keller 2013-05-02, 04:49
+
Bryan Keller 2013-05-02, 17:54
+
Nicolas Liochon 2013-05-02, 18:00
+
lars hofhansl 2013-05-03, 00:46
+
Bryan Keller 2013-05-03, 07:17
+
Bryan Keller 2013-05-03, 10:44
+
lars hofhansl 2013-05-05, 01:33
+
Bryan Keller 2013-05-08, 17:15
+
Bryan Keller 2013-05-10, 15:46
+
Sandy Pratt 2013-05-22, 20:29
+
Ted Yu 2013-05-22, 20:39
+
Sandy Pratt 2013-05-22, 22:33
+
Ted Yu 2013-05-22, 22:57
+
Bryan Keller 2013-05-23, 15:45
+
Sandy Pratt 2013-05-23, 22:42
+
Ted Yu 2013-05-23, 22:47
+
Sandy Pratt 2013-06-05, 01:11
+
Sandy Pratt 2013-06-05, 08:09
+
yonghu 2013-06-05, 14:55
+
Ted Yu 2013-06-05, 16:12
+
yonghu 2013-06-05, 18:14
+
Sandy Pratt 2013-06-05, 18:57
+
Sandy Pratt 2013-06-05, 17:58
Copy link to this message
-
Re: Poor HBase map-reduce scan performance
lars hofhansl 2013-06-06, 01:03
That's exactly the kind of pre-fetching I was investigating a bit ago (made a patch, but ran out of time).
This pre-fetching is strictly client only, where the client keeps the server busy while it is processing the previous batch, but filling up a 2nd buffer.
-- Lars

________________________________
 From: Sandy Pratt <[EMAIL PROTECTED]>
To: "[EMAIL PROTECTED]" <[EMAIL PROTECTED]>
Sent: Wednesday, June 5, 2013 10:58 AM
Subject: Re: Poor HBase map-reduce scan performance
 

Yong,

As a thought experiment, imagine how it impacts the throughput of TCP to
keep the window size at 1.  That means there's only one packet in flight
at a time, and total throughput is a fraction of what it could be.

That's effectively what happens with RPC.  The server sends a batch, then
does nothing while it waits for the client to ask for more.  During that
time, the pipe between them is empty.  Increasing the batch size can help
a bit, in essence creating a really huge packet, but the problem remains.
There will always be stalls in the pipe.

What you want is for the window size to be large enough that the pipe is
saturated.  A streaming API accomplishes that by stuffing data down the
network pipe as quickly as possible.

Sandy

On 6/5/13 7:55 AM, "yonghu" <[EMAIL PROTECTED]> wrote:

>Can anyone explain why client + rpc + server will decrease the performance
>of scanning? I mean the Regionserver and Tasktracker are the same node
>when
>you use MapReduce to scan the HBase table. So, in my understanding, there
>will be no rpc cost.
>
>Thanks!
>
>Yong
>
>
>On Wed, Jun 5, 2013 at 10:09 AM, Sandy Pratt <[EMAIL PROTECTED]> wrote:
>
>> https://issues.apache.org/jira/browse/HBASE-8691
>>
>>
>> On 6/4/13 6:11 PM, "Sandy Pratt" <[EMAIL PROTECTED]> wrote:
>>
>> >Haven't had a chance to write a JIRA yet, but I thought I'd pop in here
>> >with an update in the meantime.
>> >
>> >I tried a number of different approaches to eliminate latency and
>> >"bubbles" in the scan pipeline, and eventually arrived at adding a
>> >streaming scan API to the region server, along with refactoring the
>>scan
>> >interface into an event-drive message receiver interface.  In so
>>doing, I
>> >was able to take scan speed on my cluster from 59,537 records/sec with
>>the
>> >classic scanner to 222,703 records per second with my new scan API.
>> >Needless to say, I'm pleased ;)
>> >
>> >More details forthcoming when I get a chance.
>> >
>> >Thanks,
>> >Sandy
>> >
>> >On 5/23/13 3:47 PM, "Ted Yu" <[EMAIL PROTECTED]> wrote:
>> >
>> >>Thanks for the update, Sandy.
>> >>
>> >>If you can open a JIRA and attach your producer / consumer scanner
>>there,
>> >>that would be great.
>> >>
>> >>On Thu, May 23, 2013 at 3:42 PM, Sandy Pratt <[EMAIL PROTECTED]>
>>wrote:
>> >>
>> >>> I wrote myself a Scanner wrapper that uses a producer/consumer
>>queue to
>> >>> keep the client fed with a full buffer as much as possible.  When
>> >>>scanning
>> >>> my table with scanner caching at 100 records, I see about a 24%
>>uplift
>> >>>in
>> >>> performance (~35k records/sec with the ClientScanner and ~44k
>> >>>records/sec
>> >>> with my P/C scanner).  However, when I set scanner caching to 5000,
>> >>>it's
>> >>> more of a wash compared to the standard ClientScanner: ~53k
>>records/sec
>> >>> with the ClientScanner and ~60k records/sec with the P/C scanner.
>> >>>
>> >>> I'm not sure what to make of those results.  I think next I'll shut
>> >>>down
>> >>> HBase and read the HFiles directly, to see if there's a drop off in
>> >>> performance between reading them directly vs. via the RegionServer.
>> >>>
>> >>> I still think that to really solve this there needs to be sliding
>> >>>window
>> >>> of records in flight between disk and RS, and between RS and client.
>> >>>I'm
>> >>> thinking there's probably a single batch of records in flight
>>between
>> >>>RS
>> >>> and client at the moment.
>> >>>
>> >>> Sandy
>> >>>
>> >>> On 5/23/13 8:45 AM, "Bryan Keller" <[EMAIL PROTECTED]> wrote:
>>
+
Bryan Keller 2013-06-25, 08:56
+
lars hofhansl 2013-06-28, 17:56
+
Bryan Keller 2013-07-01, 04:23
+
Ted Yu 2013-07-01, 04:32
+
lars hofhansl 2013-07-01, 10:59
+
Enis Söztutar 2013-07-01, 21:23
+
Bryan Keller 2013-07-01, 21:35
+
lars hofhansl 2013-05-25, 05:50
+
Enis Söztutar 2013-05-29, 20:29
+
Bryan Keller 2013-06-04, 17:01
+
Michael Segel 2013-05-06, 03:09
+
Matt Corgan 2013-05-01, 06:52
+
Jean-Marc Spaggiari 2013-05-01, 10:56
+
Bryan Keller 2013-05-01, 16:39
+
Naidu MS 2013-05-01, 07:25
+
ramkrishna vasudevan 2013-05-01, 07:27
+
ramkrishna vasudevan 2013-05-01, 07:29