Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
HBase >> mail # user >> Poor HBase map-reduce scan performance

Copy link to this message
Re: Poor HBase map-reduce scan performance
Thanks Enis, I'll see if I can backport this patch - it is exactly what I was going to try. This should solve my scan performance problems if I can get it to work.

On May 29, 2013, at 1:29 PM, Enis Söztutar <[EMAIL PROTECTED]> wrote:

> Hi,
> Regarding running raw scans on top of Hfiles, you can try a version of the
> patch attached at https://issues.apache.org/jira/browse/HBASE-8369, which
> enables exactly this. However, the patch is for trunk.
> In that, we open one region from snapshot files in each record reader, and
> run a scan through using an internal region scanner. Since this bypasses
> the client + rpc + server daemon layers, it should be able to give optimum
> scan performance.
> There is also a tool called HFilePerformanceBenchmark that intends to
> measure raw performance for HFiles. I've had to do a lot of changes to make
> is workable, but it might be worth to take a look to see whether there is
> any perf difference between scanning a sequence file from hdfs vs scanning
> an hfile.
> Enis
> On Fri, May 24, 2013 at 10:50 PM, lars hofhansl <[EMAIL PROTECTED]> wrote:
>> Sorry. Haven't gotten to this, yet.
>> Scanning in HBase being about 3x slower than straight HDFS is in the right
>> ballpark, though. It has to a bit more work.
>> Generally, HBase is great at honing in to a subset (some 10-100m rows) of
>> the data. Raw scan performance is not (yet) a strength of HBase.
>> So with HDFS you get to 75% of the theoretical maximum read throughput;
>> hence with HBase you to 25% of the theoretical cluster wide maximum disk
>> throughput?
>> -- Lars
>> ----- Original Message -----
>> From: Bryan Keller <[EMAIL PROTECTED]>
>> Cc:
>> Sent: Friday, May 10, 2013 8:46 AM
>> Subject: Re: Poor HBase map-reduce scan performance
>> FYI, I ran tests with compression on and off.
>> With a plain HDFS sequence file and compression off, I am getting very
>> good I/O numbers, roughly 75% of theoretical max for reads. With snappy
>> compression on with a sequence file, I/O speed is about 3x slower. However
>> the file size is 3x smaller so it takes about the same time to scan.
>> With HBase, the results are equivalent (just much slower than a sequence
>> file). Scanning a compressed table is about 3x slower I/O than an
>> uncompressed table, but the table is 3x smaller, so the time to scan is
>> about the same. Scanning an HBase table takes about 3x as long as scanning
>> the sequence file export of the table, either compressed or uncompressed.
>> The sequence file export file size ends up being just barely larger than
>> the table, either compressed or uncompressed
>> So in sum, compression slows down I/O 3x, but the file is 3x smaller so
>> the time to scan is about the same. Adding in HBase slows things down
>> another 3x. So I'm seeing 9x faster I/O scanning an uncompressed sequence
>> file vs scanning a compressed table.
>> On May 8, 2013, at 10:15 AM, Bryan Keller <[EMAIL PROTECTED]> wrote:
>>> Thanks for the offer Lars! I haven't made much progress speeding things
>> up.
>>> I finally put together a test program that populates a table that is
>> similar to my production dataset. I have a readme that should describe
>> things, hopefully enough to make it useable. There is code to populate a
>> test table, code to scan the table, and code to scan sequence files from an
>> export (to compare HBase w/ raw HDFS). I use a gradle build script.
>>> You can find the code here:
>>> https://dl.dropboxusercontent.com/u/6880177/hbasetest.zip
>>> On May 4, 2013, at 6:33 PM, lars hofhansl <[EMAIL PROTECTED]> wrote:
>>>> The blockbuffers are not reused, but that by itself should not be a
>> problem as they are all the same size (at least I have never identified
>> that as one in my profiling sessions).
>>>> My offer still stands to do some profiling myself if there is an easy
>> way to generate data of similar shape.