Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
HBase >> mail # user >> Poor HBase map-reduce scan performance


Copy link to this message
-
Re: Poor HBase map-reduce scan performance
@Lars, how have your calculated the 35K/row size? I'm not able to find the
same number.

@Bryan, Matt's idea below is good. With the hadoop test you always had data
locality. Which your HBase test, maybe not. Can you take a look at the JMX
console and tell us your locality % ? Also, over those 45 minutes, have you
monitored the CPWIO, GC activities, etc. to see if any of those might have
impacted the performances?

JM

2013/5/1 Matt Corgan <[EMAIL PROTECTED]>

> Not that it's a long-term solution, but try major-compacting before running
> the benchmark.  If the LSM tree is CPU bound in merging HFiles/KeyValues
> through the PriorityQueue, then reducing to a single file per region should
> help.  The merging of HFiles during a scan is not heavily optimized yet.
>
>
> On Tue, Apr 30, 2013 at 11:21 PM, lars hofhansl <[EMAIL PROTECTED]> wrote:
>
> > If you can, try 0.94.4+; it should significantly reduce the amount of
> > bytes copied around in RAM during scanning, especially if you have wide
> > rows and/or large key portions. That in turns makes scans scale better
> > across cores, since RAM is shared resource between cores (much like
> disk).
> >
> >
> > It's not hard to build the latest HBase against Cloudera's version of
> > Hadoop. I can send along a simple patch to pom.xml to do that.
> >
> > -- Lars
> >
> >
> >
> > ________________________________
> >  From: Bryan Keller <[EMAIL PROTECTED]>
> > To: [EMAIL PROTECTED]
> > Sent: Tuesday, April 30, 2013 11:02 PM
> > Subject: Re: Poor HBase map-reduce scan performance
> >
> >
> > The table has hashed keys so rows are evenly distributed amongst the
> > regionservers, and load on each regionserver is pretty much the same. I
> > also have per-table balancing turned on. I get mostly data local mappers
> > with only a few rack local (maybe 10 of the 250 mappers).
> >
> > Currently the table is a wide table schema, with lists of data structures
> > stored as columns with column prefixes grouping the data structures (e.g.
> > 1_name, 1_address, 1_city, 2_name, 2_address, 2_city). I was thinking of
> > moving those data structures to protobuf which would cut down on the
> number
> > of columns. The downside is I can't filter on one value with that, but it
> > is a tradeoff I would make for performance. I was also considering
> > restructuring the table into a tall table.
> >
> > Something interesting is that my old regionserver machines had five 15k
> > SCSI drives instead of 2 SSDs, and performance was about the same. Also,
> my
> > old network was 1gbit, now it is 10gbit. So neither network nor disk I/O
> > appear to be the bottleneck. The CPU is rather high for the regionserver
> so
> > it seems like the best candidate to investigate. I will try profiling it
> > tomorrow and will report back. I may revisit compression on vs off since
> > that is adding load to the CPU.
> >
> > I'll also come up with a sample program that generates data similar to my
> > table.
> >
> >
> > On Apr 30, 2013, at 10:01 PM, lars hofhansl <[EMAIL PROTECTED]> wrote:
> >
> > > Your average row is 35k so scanner caching would not make a huge
> > difference, although I would have expected some improvements by setting
> it
> > to 10 or 50 since you have a wide 10ge pipe.
> > >
> > > I assume your table is split sufficiently to touch all RegionServer...
> > Do you see the same load/IO on all region servers?
> > >
> > > A bunch of scan improvements went into HBase since 0.94.2.
> > > I blogged about some of these changes here:
> > http://hadoop-hbase.blogspot.com/2012/12/hbase-profiling.html
> > >
> > > In your case - since you have many columns, each of which carry the
> > rowkey - you might benefit a lot from HBASE-7279.
> > >
> > > In the end HBase *is* slower than straight HDFS for full scans. How
> > could it not be?
> > > So I would start by looking at HDFS first. Make sure Nagle's is
> disbaled
> > in both HBase and HDFS.
> > >
> > > And lastly SSDs are somewhat new territory for HBase. Maybe Andy
> Purtell