Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
HBase >> mail # user >> Poor HBase map-reduce scan performance

Copy link to this message
Re: Poor HBase map-reduce scan performance

Regarding running raw scans on top of Hfiles, you can try a version of the
patch attached at https://issues.apache.org/jira/browse/HBASE-8369, which
enables exactly this. However, the patch is for trunk.

In that, we open one region from snapshot files in each record reader, and
run a scan through using an internal region scanner. Since this bypasses
the client + rpc + server daemon layers, it should be able to give optimum
scan performance.

There is also a tool called HFilePerformanceBenchmark that intends to
measure raw performance for HFiles. I've had to do a lot of changes to make
is workable, but it might be worth to take a look to see whether there is
any perf difference between scanning a sequence file from hdfs vs scanning
an hfile.

On Fri, May 24, 2013 at 10:50 PM, lars hofhansl <[EMAIL PROTECTED]> wrote:

> Sorry. Haven't gotten to this, yet.
> Scanning in HBase being about 3x slower than straight HDFS is in the right
> ballpark, though. It has to a bit more work.
> Generally, HBase is great at honing in to a subset (some 10-100m rows) of
> the data. Raw scan performance is not (yet) a strength of HBase.
> So with HDFS you get to 75% of the theoretical maximum read throughput;
> hence with HBase you to 25% of the theoretical cluster wide maximum disk
> throughput?
> -- Lars
> ----- Original Message -----
> From: Bryan Keller <[EMAIL PROTECTED]>
> Cc:
> Sent: Friday, May 10, 2013 8:46 AM
> Subject: Re: Poor HBase map-reduce scan performance
> FYI, I ran tests with compression on and off.
> With a plain HDFS sequence file and compression off, I am getting very
> good I/O numbers, roughly 75% of theoretical max for reads. With snappy
> compression on with a sequence file, I/O speed is about 3x slower. However
> the file size is 3x smaller so it takes about the same time to scan.
> With HBase, the results are equivalent (just much slower than a sequence
> file). Scanning a compressed table is about 3x slower I/O than an
> uncompressed table, but the table is 3x smaller, so the time to scan is
> about the same. Scanning an HBase table takes about 3x as long as scanning
> the sequence file export of the table, either compressed or uncompressed.
> The sequence file export file size ends up being just barely larger than
> the table, either compressed or uncompressed
> So in sum, compression slows down I/O 3x, but the file is 3x smaller so
> the time to scan is about the same. Adding in HBase slows things down
> another 3x. So I'm seeing 9x faster I/O scanning an uncompressed sequence
> file vs scanning a compressed table.
> On May 8, 2013, at 10:15 AM, Bryan Keller <[EMAIL PROTECTED]> wrote:
> > Thanks for the offer Lars! I haven't made much progress speeding things
> up.
> >
> > I finally put together a test program that populates a table that is
> similar to my production dataset. I have a readme that should describe
> things, hopefully enough to make it useable. There is code to populate a
> test table, code to scan the table, and code to scan sequence files from an
> export (to compare HBase w/ raw HDFS). I use a gradle build script.
> >
> > You can find the code here:
> >
> > https://dl.dropboxusercontent.com/u/6880177/hbasetest.zip
> >
> >
> > On May 4, 2013, at 6:33 PM, lars hofhansl <[EMAIL PROTECTED]> wrote:
> >
> >> The blockbuffers are not reused, but that by itself should not be a
> problem as they are all the same size (at least I have never identified
> that as one in my profiling sessions).
> >>
> >> My offer still stands to do some profiling myself if there is an easy
> way to generate data of similar shape.
> >>
> >> -- Lars
> >>
> >>
> >>
> >> ________________________________
> >> From: Bryan Keller <[EMAIL PROTECTED]>
> >> Sent: Friday, May 3, 2013 3:44 AM
> >> Subject: Re: Poor HBase map-reduce scan performance
> >>
> >>
> >> Actually I'm not too confident in my results re block size, they may
> have been related to major compaction. I'm going to rerun before drawing