Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
HBase, mail # user - Poor HBase map-reduce scan performance


Copy link to this message
-
Re: Poor HBase map-reduce scan performance
Nicolas Liochon 2013-05-02, 18:00
You can try Yourkit, they have evaluation licenses. There is one gotcha:
some classes are excluded by default, and this includes org.apache.* . So
you need to change the default config when using it with HBase.
On Thu, May 2, 2013 at 7:54 PM, Bryan Keller <[EMAIL PROTECTED]> wrote:

> I ran one of my regionservers through VisualVM. It looks like the top hot
> spots are HFileReaderV2$ScannerV2.getKeyValue() and ByteBuffer.allocate().
> It appears at first glance that memory allocations may be an issue.
> Decompression was next below that but less of an issue it seems.
>
> Would changing the block size, either HDFS or HBase, help here?
>
> Also, if anyone has tips on how else to profile, that would be
> appreciated. VisualVM can produce a lot of noise that is hard to sift
> through.
>
>
> On May 1, 2013, at 9:49 PM, Bryan Keller <[EMAIL PROTECTED]> wrote:
>
> > I used exactly 0.94.4, pulled from the tag in subversion.
> >
> > On May 1, 2013, at 9:41 PM, lars hofhansl <[EMAIL PROTECTED]> wrote:
> >
> >> Hmm... Did you actually use exactly version 0.94.4, or the latest
> 0.94.7.
> >> I would be very curious to see profiling data.
> >>
> >> -- Lars
> >>
> >>
> >>
> >> ----- Original Message -----
> >> From: Bryan Keller <[EMAIL PROTECTED]>
> >> To: "[EMAIL PROTECTED]" <[EMAIL PROTECTED]>
> >> Cc:
> >> Sent: Wednesday, May 1, 2013 6:01 PM
> >> Subject: Re: Poor HBase map-reduce scan performance
> >>
> >> I tried running my test with 0.94.4, unfortunately performance was
> about the same. I'm planning on profiling the regionserver and trying some
> other things tonight and tomorrow and will report back.
> >>
> >> On May 1, 2013, at 8:00 AM, Bryan Keller <[EMAIL PROTECTED]> wrote:
> >>
> >>> Yes I would like to try this, if you can point me to the pom.xml patch
> that would save me some time.
> >>>
> >>> On Tuesday, April 30, 2013, lars hofhansl wrote:
> >>> If you can, try 0.94.4+; it should significantly reduce the amount of
> bytes copied around in RAM during scanning, especially if you have wide
> rows and/or large key portions. That in turns makes scans scale better
> across cores, since RAM is shared resource between cores (much like disk).
> >>>
> >>>
> >>> It's not hard to build the latest HBase against Cloudera's version of
> Hadoop. I can send along a simple patch to pom.xml to do that.
> >>>
> >>> -- Lars
> >>>
> >>>
> >>>
> >>> ________________________________
> >>>  From: Bryan Keller <[EMAIL PROTECTED]>
> >>> To: [EMAIL PROTECTED]
> >>> Sent: Tuesday, April 30, 2013 11:02 PM
> >>> Subject: Re: Poor HBase map-reduce scan performance
> >>>
> >>>
> >>> The table has hashed keys so rows are evenly distributed amongst the
> regionservers, and load on each regionserver is pretty much the same. I
> also have per-table balancing turned on. I get mostly data local mappers
> with only a few rack local (maybe 10 of the 250 mappers).
> >>>
> >>> Currently the table is a wide table schema, with lists of data
> structures stored as columns with column prefixes grouping the data
> structures (e.g. 1_name, 1_address, 1_city, 2_name, 2_address, 2_city). I
> was thinking of moving those data structures to protobuf which would cut
> down on the number of columns. The downside is I can't filter on one value
> with that, but it is a tradeoff I would make for performance. I was also
> considering restructuring the table into a tall table.
> >>>
> >>> Something interesting is that my old regionserver machines had five
> 15k SCSI drives instead of 2 SSDs, and performance was about the same.
> Also, my old network was 1gbit, now it is 10gbit. So neither network nor
> disk I/O appear to be the bottleneck. The CPU is rather high for the
> regionserver so it seems like the best candidate to investigate. I will try
> profiling it tomorrow and will report back. I may revisit compression on vs
> off since that is adding load to the CPU.
> >>>
> >>> I'll also come up with a sample program that generates data similar to
> my table.
> >>>
>