Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
HBase >> mail # user >> Poor HBase map-reduce scan performance


Copy link to this message
-
Re: Poor HBase map-reduce scan performance
Absolutely.

----- Original Message -----
From: Ted Yu <[EMAIL PROTECTED]>
To: [EMAIL PROTECTED]
Cc:
Sent: Sunday, June 30, 2013 9:32 PM
Subject: Re: Poor HBase map-reduce scan performance

Looking at the tail of HBASE-8369, there were some comments which are yet
to be addressed.

I think trunk patch should be finalized before backporting.

Cheers

On Mon, Jul 1, 2013 at 12:23 PM, Bryan Keller <[EMAIL PROTECTED]> wrote:

> I'll attach my patch to HBASE-8369 tomorrow.
>
> On Jun 28, 2013, at 10:56 AM, lars hofhansl <[EMAIL PROTECTED]> wrote:
>
> > If we can make a clean patch with minimal impact to existing code I
> would be supportive of a backport to 0.94.
> >
> > -- Lars
> >
> >
> >
> > ----- Original Message -----
> > From: Bryan Keller <[EMAIL PROTECTED]>
> > To: [EMAIL PROTECTED]; lars hofhansl <[EMAIL PROTECTED]>
> > Cc:
> > Sent: Tuesday, June 25, 2013 1:56 AM
> > Subject: Re: Poor HBase map-reduce scan performance
> >
> > I tweaked Enis's snapshot input format and backported it to 0.94.6 and
> have snapshot scanning functional on my system. Performance is dramatically
> better, as expected i suppose. I'm seeing about 3.6x faster performance vs
> TableInputFormat. Also, HBase doesn't get bogged down during a scan as the
> regionserver is being bypassed. I'm very excited by this. There are some
> issues with file permissions and library dependencies but nothing that
> can't be worked out.
> >
> > On Jun 5, 2013, at 6:03 PM, lars hofhansl <[EMAIL PROTECTED]> wrote:
> >
> >> That's exactly the kind of pre-fetching I was investigating a bit ago
> (made a patch, but ran out of time).
> >> This pre-fetching is strictly client only, where the client keeps the
> server busy while it is processing the previous batch, but filling up a 2nd
> buffer.
> >>
> >>
> >> -- Lars
> >>
> >>
> >>
> >> ________________________________
> >> From: Sandy Pratt <[EMAIL PROTECTED]>
> >> To: "[EMAIL PROTECTED]" <[EMAIL PROTECTED]>
> >> Sent: Wednesday, June 5, 2013 10:58 AM
> >> Subject: Re: Poor HBase map-reduce scan performance
> >>
> >>
> >> Yong,
> >>
> >> As a thought experiment, imagine how it impacts the throughput of TCP to
> >> keep the window size at 1.  That means there's only one packet in flight
> >> at a time, and total throughput is a fraction of what it could be.
> >>
> >> That's effectively what happens with RPC.  The server sends a batch,
> then
> >> does nothing while it waits for the client to ask for more.  During that
> >> time, the pipe between them is empty.  Increasing the batch size can
> help
> >> a bit, in essence creating a really huge packet, but the problem
> remains.
> >> There will always be stalls in the pipe.
> >>
> >> What you want is for the window size to be large enough that the pipe is
> >> saturated.  A streaming API accomplishes that by stuffing data down the
> >> network pipe as quickly as possible.
> >>
> >> Sandy
> >>
> >> On 6/5/13 7:55 AM, "yonghu" <[EMAIL PROTECTED]> wrote:
> >>
> >>> Can anyone explain why client + rpc + server will decrease the
> performance
> >>> of scanning? I mean the Regionserver and Tasktracker are the same node
> >>> when
> >>> you use MapReduce to scan the HBase table. So, in my understanding,
> there
> >>> will be no rpc cost.
> >>>
> >>> Thanks!
> >>>
> >>> Yong
> >>>
> >>>
> >>> On Wed, Jun 5, 2013 at 10:09 AM, Sandy Pratt <[EMAIL PROTECTED]>
> wrote:
> >>>
> >>>> https://issues.apache.org/jira/browse/HBASE-8691
> >>>>
> >>>>
> >>>> On 6/4/13 6:11 PM, "Sandy Pratt" <[EMAIL PROTECTED]> wrote:
> >>>>
> >>>>> Haven't had a chance to write a JIRA yet, but I thought I'd pop in
> here
> >>>>> with an update in the meantime.
> >>>>>
> >>>>> I tried a number of different approaches to eliminate latency and
> >>>>> "bubbles" in the scan pipeline, and eventually arrived at adding a
> >>>>> streaming scan API to the region server, along with refactoring the
> >>>> scan
> >>>>> interface into an event-drive message receiver interface.  In so
> >>>> doing, I
> >>