Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
HBase >> mail # user >> Poor HBase map-reduce scan performance


Copy link to this message
-
Re: Poor HBase map-reduce scan performance
Dear Sandy,

Thanks for your explanation.

However, what I don't get is your term "client", is this "client" means
MapReduce jobs? If I understand you right, this means Map function will
process the tuples and during this processing time, the regionserver did
nothing?

regards!

Yong
On Wed, Jun 5, 2013 at 6:12 PM, Ted Yu <[EMAIL PROTECTED]> wrote:

> bq. the Regionserver and Tasktracker are the same node when you use
> MapReduce to scan the HBase table.
>
> The scan performed by the Tasktracker on that node would very likely access
> data hosted by region server on other node(s). So there would be RPC
> involved.
>
> There is some discussion on providing shadow reads - writes to specific
> region are solely served by one region server but the reads can be served
> by more than one region server. Of course consistency is one aspect that
> must be tackled.
>
> Cheers
>
> On Wed, Jun 5, 2013 at 7:55 AM, yonghu <[EMAIL PROTECTED]> wrote:
>
> > Can anyone explain why client + rpc + server will decrease the
> performance
> > of scanning? I mean the Regionserver and Tasktracker are the same node
> when
> > you use MapReduce to scan the HBase table. So, in my understanding, there
> > will be no rpc cost.
> >
> > Thanks!
> >
> > Yong
> >
> >
> > On Wed, Jun 5, 2013 at 10:09 AM, Sandy Pratt <[EMAIL PROTECTED]> wrote:
> >
> > > https://issues.apache.org/jira/browse/HBASE-8691
> > >
> > >
> > > On 6/4/13 6:11 PM, "Sandy Pratt" <[EMAIL PROTECTED]> wrote:
> > >
> > > >Haven't had a chance to write a JIRA yet, but I thought I'd pop in
> here
> > > >with an update in the meantime.
> > > >
> > > >I tried a number of different approaches to eliminate latency and
> > > >"bubbles" in the scan pipeline, and eventually arrived at adding a
> > > >streaming scan API to the region server, along with refactoring the
> scan
> > > >interface into an event-drive message receiver interface.  In so
> doing,
> > I
> > > >was able to take scan speed on my cluster from 59,537 records/sec with
> > the
> > > >classic scanner to 222,703 records per second with my new scan API.
> > > >Needless to say, I'm pleased ;)
> > > >
> > > >More details forthcoming when I get a chance.
> > > >
> > > >Thanks,
> > > >Sandy
> > > >
> > > >On 5/23/13 3:47 PM, "Ted Yu" <[EMAIL PROTECTED]> wrote:
> > > >
> > > >>Thanks for the update, Sandy.
> > > >>
> > > >>If you can open a JIRA and attach your producer / consumer scanner
> > there,
> > > >>that would be great.
> > > >>
> > > >>On Thu, May 23, 2013 at 3:42 PM, Sandy Pratt <[EMAIL PROTECTED]>
> > wrote:
> > > >>
> > > >>> I wrote myself a Scanner wrapper that uses a producer/consumer
> queue
> > to
> > > >>> keep the client fed with a full buffer as much as possible.  When
> > > >>>scanning
> > > >>> my table with scanner caching at 100 records, I see about a 24%
> > uplift
> > > >>>in
> > > >>> performance (~35k records/sec with the ClientScanner and ~44k
> > > >>>records/sec
> > > >>> with my P/C scanner).  However, when I set scanner caching to 5000,
> > > >>>it's
> > > >>> more of a wash compared to the standard ClientScanner: ~53k
> > records/sec
> > > >>> with the ClientScanner and ~60k records/sec with the P/C scanner.
> > > >>>
> > > >>> I'm not sure what to make of those results.  I think next I'll shut
> > > >>>down
> > > >>> HBase and read the HFiles directly, to see if there's a drop off in
> > > >>> performance between reading them directly vs. via the RegionServer.
> > > >>>
> > > >>> I still think that to really solve this there needs to be sliding
> > > >>>window
> > > >>> of records in flight between disk and RS, and between RS and
> client.
> > > >>>I'm
> > > >>> thinking there's probably a single batch of records in flight
> between
> > > >>>RS
> > > >>> and client at the moment.
> > > >>>
> > > >>> Sandy
> > > >>>
> > > >>> On 5/23/13 8:45 AM, "Bryan Keller" <[EMAIL PROTECTED]> wrote:
> > > >>>
> > > >>> >I am considering scanning a snapshot instead of the table. I
> believe
> > > >>>this
> > > >>> >is what the ExportSnapshot class does. If I could use the scanning
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB