Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Plain View
HBase >> mail # user >> Poor HBase map-reduce scan performance


+
Bryan Keller 2013-05-01, 04:01
+
Ted Yu 2013-05-01, 04:17
+
Bryan Keller 2013-05-01, 04:31
+
Ted Yu 2013-05-01, 04:56
+
Bryan Keller 2013-05-01, 05:01
+
lars hofhansl 2013-05-01, 05:01
+
Bryan Keller 2013-05-01, 06:02
+
Michael Segel 2013-05-01, 14:24
+
lars hofhansl 2013-05-01, 06:21
+
Bryan Keller 2013-05-01, 15:00
+
Bryan Keller 2013-05-02, 01:01
+
lars hofhansl 2013-05-02, 04:41
+
Bryan Keller 2013-05-02, 04:49
+
Bryan Keller 2013-05-02, 17:54
+
Nicolas Liochon 2013-05-02, 18:00
+
lars hofhansl 2013-05-03, 00:46
+
Bryan Keller 2013-05-03, 07:17
+
Bryan Keller 2013-05-03, 10:44
+
lars hofhansl 2013-05-05, 01:33
+
Bryan Keller 2013-05-08, 17:15
+
Bryan Keller 2013-05-10, 15:46
+
Sandy Pratt 2013-05-22, 20:29
+
Ted Yu 2013-05-22, 20:39
+
Sandy Pratt 2013-05-22, 22:33
+
Ted Yu 2013-05-22, 22:57
+
Bryan Keller 2013-05-23, 15:45
+
Sandy Pratt 2013-05-23, 22:42
+
Ted Yu 2013-05-23, 22:47
+
Sandy Pratt 2013-06-05, 01:11
+
Sandy Pratt 2013-06-05, 08:09
+
yonghu 2013-06-05, 14:55
+
Ted Yu 2013-06-05, 16:12
+
yonghu 2013-06-05, 18:14
+
Sandy Pratt 2013-06-05, 18:57
+
Sandy Pratt 2013-06-05, 17:58
+
lars hofhansl 2013-06-06, 01:03
+
Bryan Keller 2013-06-25, 08:56
+
lars hofhansl 2013-06-28, 17:56
+
Bryan Keller 2013-07-01, 04:23
+
Ted Yu 2013-07-01, 04:32
+
lars hofhansl 2013-07-01, 10:59
+
Enis Söztutar 2013-07-01, 21:23
+
Bryan Keller 2013-07-01, 21:35
+
lars hofhansl 2013-05-25, 05:50
Copy link to this message
-
Re: Poor HBase map-reduce scan performance
Hi,

Regarding running raw scans on top of Hfiles, you can try a version of the
patch attached at https://issues.apache.org/jira/browse/HBASE-8369, which
enables exactly this. However, the patch is for trunk.

In that, we open one region from snapshot files in each record reader, and
run a scan through using an internal region scanner. Since this bypasses
the client + rpc + server daemon layers, it should be able to give optimum
scan performance.

There is also a tool called HFilePerformanceBenchmark that intends to
measure raw performance for HFiles. I've had to do a lot of changes to make
is workable, but it might be worth to take a look to see whether there is
any perf difference between scanning a sequence file from hdfs vs scanning
an hfile.

Enis
On Fri, May 24, 2013 at 10:50 PM, lars hofhansl <[EMAIL PROTECTED]> wrote:

> Sorry. Haven't gotten to this, yet.
>
> Scanning in HBase being about 3x slower than straight HDFS is in the right
> ballpark, though. It has to a bit more work.
>
> Generally, HBase is great at honing in to a subset (some 10-100m rows) of
> the data. Raw scan performance is not (yet) a strength of HBase.
>
> So with HDFS you get to 75% of the theoretical maximum read throughput;
> hence with HBase you to 25% of the theoretical cluster wide maximum disk
> throughput?
>
>
> -- Lars
>
>
>
> ----- Original Message -----
> From: Bryan Keller <[EMAIL PROTECTED]>
> To: [EMAIL PROTECTED]
> Cc:
> Sent: Friday, May 10, 2013 8:46 AM
> Subject: Re: Poor HBase map-reduce scan performance
>
> FYI, I ran tests with compression on and off.
>
> With a plain HDFS sequence file and compression off, I am getting very
> good I/O numbers, roughly 75% of theoretical max for reads. With snappy
> compression on with a sequence file, I/O speed is about 3x slower. However
> the file size is 3x smaller so it takes about the same time to scan.
>
> With HBase, the results are equivalent (just much slower than a sequence
> file). Scanning a compressed table is about 3x slower I/O than an
> uncompressed table, but the table is 3x smaller, so the time to scan is
> about the same. Scanning an HBase table takes about 3x as long as scanning
> the sequence file export of the table, either compressed or uncompressed.
> The sequence file export file size ends up being just barely larger than
> the table, either compressed or uncompressed
>
> So in sum, compression slows down I/O 3x, but the file is 3x smaller so
> the time to scan is about the same. Adding in HBase slows things down
> another 3x. So I'm seeing 9x faster I/O scanning an uncompressed sequence
> file vs scanning a compressed table.
>
>
> On May 8, 2013, at 10:15 AM, Bryan Keller <[EMAIL PROTECTED]> wrote:
>
> > Thanks for the offer Lars! I haven't made much progress speeding things
> up.
> >
> > I finally put together a test program that populates a table that is
> similar to my production dataset. I have a readme that should describe
> things, hopefully enough to make it useable. There is code to populate a
> test table, code to scan the table, and code to scan sequence files from an
> export (to compare HBase w/ raw HDFS). I use a gradle build script.
> >
> > You can find the code here:
> >
> > https://dl.dropboxusercontent.com/u/6880177/hbasetest.zip
> >
> >
> > On May 4, 2013, at 6:33 PM, lars hofhansl <[EMAIL PROTECTED]> wrote:
> >
> >> The blockbuffers are not reused, but that by itself should not be a
> problem as they are all the same size (at least I have never identified
> that as one in my profiling sessions).
> >>
> >> My offer still stands to do some profiling myself if there is an easy
> way to generate data of similar shape.
> >>
> >> -- Lars
> >>
> >>
> >>
> >> ________________________________
> >> From: Bryan Keller <[EMAIL PROTECTED]>
> >> To: [EMAIL PROTECTED]
> >> Sent: Friday, May 3, 2013 3:44 AM
> >> Subject: Re: Poor HBase map-reduce scan performance
> >>
> >>
> >> Actually I'm not too confident in my results re block size, they may
> have been related to major compaction. I'm going to rerun before drawing
+
Bryan Keller 2013-06-04, 17:01
+
Michael Segel 2013-05-06, 03:09
+
Matt Corgan 2013-05-01, 06:52
+
Jean-Marc Spaggiari 2013-05-01, 10:56
+
Bryan Keller 2013-05-01, 16:39
+
Naidu MS 2013-05-01, 07:25
+
ramkrishna vasudevan 2013-05-01, 07:27
+
ramkrishna vasudevan 2013-05-01, 07:29
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB