Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
HBase, mail # user - Poor HBase map-reduce scan performance


Copy link to this message
-
Re: Poor HBase map-reduce scan performance
yonghu 2013-06-05, 18:14
Dear Sandy,

Thanks for your explanation.

However, what I don't get is your term "client", is this "client" means
MapReduce jobs? If I understand you right, this means Map function will
process the tuples and during this processing time, the regionserver did
nothing?

regards!

Yong
On Wed, Jun 5, 2013 at 6:12 PM, Ted Yu <[EMAIL PROTECTED]> wrote:

> bq. the Regionserver and Tasktracker are the same node when you use
> MapReduce to scan the HBase table.
>
> The scan performed by the Tasktracker on that node would very likely access
> data hosted by region server on other node(s). So there would be RPC
> involved.
>
> There is some discussion on providing shadow reads - writes to specific
> region are solely served by one region server but the reads can be served
> by more than one region server. Of course consistency is one aspect that
> must be tackled.
>
> Cheers
>
> On Wed, Jun 5, 2013 at 7:55 AM, yonghu <[EMAIL PROTECTED]> wrote:
>
> > Can anyone explain why client + rpc + server will decrease the
> performance
> > of scanning? I mean the Regionserver and Tasktracker are the same node
> when
> > you use MapReduce to scan the HBase table. So, in my understanding, there
> > will be no rpc cost.
> >
> > Thanks!
> >
> > Yong
> >
> >
> > On Wed, Jun 5, 2013 at 10:09 AM, Sandy Pratt <[EMAIL PROTECTED]> wrote:
> >
> > > https://issues.apache.org/jira/browse/HBASE-8691
> > >
> > >
> > > On 6/4/13 6:11 PM, "Sandy Pratt" <[EMAIL PROTECTED]> wrote:
> > >
> > > >Haven't had a chance to write a JIRA yet, but I thought I'd pop in
> here
> > > >with an update in the meantime.
> > > >
> > > >I tried a number of different approaches to eliminate latency and
> > > >"bubbles" in the scan pipeline, and eventually arrived at adding a
> > > >streaming scan API to the region server, along with refactoring the
> scan
> > > >interface into an event-drive message receiver interface.  In so
> doing,
> > I
> > > >was able to take scan speed on my cluster from 59,537 records/sec with
> > the
> > > >classic scanner to 222,703 records per second with my new scan API.
> > > >Needless to say, I'm pleased ;)
> > > >
> > > >More details forthcoming when I get a chance.
> > > >
> > > >Thanks,
> > > >Sandy
> > > >
> > > >On 5/23/13 3:47 PM, "Ted Yu" <[EMAIL PROTECTED]> wrote:
> > > >
> > > >>Thanks for the update, Sandy.
> > > >>
> > > >>If you can open a JIRA and attach your producer / consumer scanner
> > there,
> > > >>that would be great.
> > > >>
> > > >>On Thu, May 23, 2013 at 3:42 PM, Sandy Pratt <[EMAIL PROTECTED]>
> > wrote:
> > > >>
> > > >>> I wrote myself a Scanner wrapper that uses a producer/consumer
> queue
> > to
> > > >>> keep the client fed with a full buffer as much as possible.  When
> > > >>>scanning
> > > >>> my table with scanner caching at 100 records, I see about a 24%
> > uplift
> > > >>>in
> > > >>> performance (~35k records/sec with the ClientScanner and ~44k
> > > >>>records/sec
> > > >>> with my P/C scanner).  However, when I set scanner caching to 5000,
> > > >>>it's
> > > >>> more of a wash compared to the standard ClientScanner: ~53k
> > records/sec
> > > >>> with the ClientScanner and ~60k records/sec with the P/C scanner.
> > > >>>
> > > >>> I'm not sure what to make of those results.  I think next I'll shut
> > > >>>down
> > > >>> HBase and read the HFiles directly, to see if there's a drop off in
> > > >>> performance between reading them directly vs. via the RegionServer.
> > > >>>
> > > >>> I still think that to really solve this there needs to be sliding
> > > >>>window
> > > >>> of records in flight between disk and RS, and between RS and
> client.
> > > >>>I'm
> > > >>> thinking there's probably a single batch of records in flight
> between
> > > >>>RS
> > > >>> and client at the moment.
> > > >>>
> > > >>> Sandy
> > > >>>
> > > >>> On 5/23/13 8:45 AM, "Bryan Keller" <[EMAIL PROTECTED]> wrote:
> > > >>>
> > > >>> >I am considering scanning a snapshot instead of the table. I
> believe
> > > >>>this
> > > >>> >is what the ExportSnapshot class does. If I could use the scanning