Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Plain View
HBase >> mail # user >> Poor HBase map-reduce scan performance


+
Bryan Keller 2013-05-01, 04:01
+
Ted Yu 2013-05-01, 04:17
+
Bryan Keller 2013-05-01, 04:31
+
Ted Yu 2013-05-01, 04:56
+
Bryan Keller 2013-05-01, 05:01
+
lars hofhansl 2013-05-01, 05:01
+
Bryan Keller 2013-05-01, 06:02
+
Michael Segel 2013-05-01, 14:24
+
lars hofhansl 2013-05-01, 06:21
+
Bryan Keller 2013-05-01, 15:00
+
Bryan Keller 2013-05-02, 01:01
+
lars hofhansl 2013-05-02, 04:41
+
Bryan Keller 2013-05-02, 04:49
+
Bryan Keller 2013-05-02, 17:54
+
Nicolas Liochon 2013-05-02, 18:00
+
lars hofhansl 2013-05-03, 00:46
+
Bryan Keller 2013-05-03, 07:17
+
Bryan Keller 2013-05-03, 10:44
+
lars hofhansl 2013-05-05, 01:33
+
Bryan Keller 2013-05-08, 17:15
+
Bryan Keller 2013-05-10, 15:46
+
Sandy Pratt 2013-05-22, 20:29
+
Ted Yu 2013-05-22, 20:39
+
Sandy Pratt 2013-05-22, 22:33
+
Ted Yu 2013-05-22, 22:57
+
Bryan Keller 2013-05-23, 15:45
+
Sandy Pratt 2013-05-23, 22:42
+
Ted Yu 2013-05-23, 22:47
+
Sandy Pratt 2013-06-05, 01:11
+
Sandy Pratt 2013-06-05, 08:09
+
yonghu 2013-06-05, 14:55
+
Ted Yu 2013-06-05, 16:12
+
yonghu 2013-06-05, 18:14
+
Sandy Pratt 2013-06-05, 18:57
+
Sandy Pratt 2013-06-05, 17:58
+
lars hofhansl 2013-06-06, 01:03
+
Bryan Keller 2013-06-25, 08:56
+
lars hofhansl 2013-06-28, 17:56
+
Bryan Keller 2013-07-01, 04:23
+
Ted Yu 2013-07-01, 04:32
+
lars hofhansl 2013-07-01, 10:59
+
Enis Söztutar 2013-07-01, 21:23
+
Bryan Keller 2013-07-01, 21:35
+
lars hofhansl 2013-05-25, 05:50
Copy link to this message
-
Re: Poor HBase map-reduce scan performance
Hi,

Regarding running raw scans on top of Hfiles, you can try a version of the
patch attached at https://issues.apache.org/jira/browse/HBASE-8369, which
enables exactly this. However, the patch is for trunk.

In that, we open one region from snapshot files in each record reader, and
run a scan through using an internal region scanner. Since this bypasses
the client + rpc + server daemon layers, it should be able to give optimum
scan performance.

There is also a tool called HFilePerformanceBenchmark that intends to
measure raw performance for HFiles. I've had to do a lot of changes to make
is workable, but it might be worth to take a look to see whether there is
any perf difference between scanning a sequence file from hdfs vs scanning
an hfile.

Enis
On Fri, May 24, 2013 at 10:50 PM, lars hofhansl <[EMAIL PROTECTED]> wrote:

> Sorry. Haven't gotten to this, yet.
>
> Scanning in HBase being about 3x slower than straight HDFS is in the right
> ballpark, though. It has to a bit more work.
>
> Generally, HBase is great at honing in to a subset (some 10-100m rows) of
> the data. Raw scan performance is not (yet) a strength of HBase.
>
> So with HDFS you get to 75% of the theoretical maximum read throughput;
> hence with HBase you to 25% of the theoretical cluster wide maximum disk
> throughput?
>
>
> -- Lars
>
>
>
> ----- Original Message -----
> From: Bryan Keller <[EMAIL PROTECTED]>
> To: [EMAIL PROTECTED]
> Cc:
> Sent: Friday, May 10, 2013 8:46 AM
> Subject: Re: Poor HBase map-reduce scan performance
>
> FYI, I ran tests with compression on and off.
>
> With a plain HDFS sequence file and compression off, I am getting very
> good I/O numbers, roughly 75% of theoretical max for reads. With snappy
> compression on with a sequence file, I/O speed is about 3x slower. However
> the file size is 3x smaller so it takes about the same time to scan.
>
> With HBase, the results are equivalent (just much slower than a sequence
> file). Scanning a compressed table is about 3x slower I/O than an
> uncompressed table, but the table is 3x smaller, so the time to scan is
> about the same. Scanning an HBase table takes about 3x as long as scanning
> the sequence file export of the table, either compressed or uncompressed.
> The sequence file export file size ends up being just barely larger than
> the table, either compressed or uncompressed
>
> So in sum, compression slows down I/O 3x, but the file is 3x smaller so
> the time to scan is about the same. Adding in HBase slows things down
> another 3x. So I'm seeing 9x faster I/O scanning an uncompressed sequence
> file vs scanning a compressed table.
>
>
> On May 8, 2013, at 10:15 AM, Bryan Keller <[EMAIL PROTECTED]> wrote:
>
> > Thanks for the offer Lars! I haven't made much progress speeding things
> up.
> >
> > I finally put together a test program that populates a table that is
> similar to my production dataset. I have a readme that should describe
> things, hopefully enough to make it useable. There is code to populate a
> test table, code to scan the table, and code to scan sequence files from an
> export (to compare HBase w/ raw HDFS). I use a gradle build script.
> >
> > You can find the code here:
> >
> > https://dl.dropboxusercontent.com/u/6880177/hbasetest.zip
> >
> >
> > On May 4, 2013, at 6:33 PM, lars hofhansl <[EMAIL PROTECTED]> wrote:
> >
> >> The blockbuffers are not reused, but that by itself should not be a
> problem as they are all the same size (at least I have never identified
> that as one in my profiling sessions).
> >>
> >> My offer still stands to do some profiling myself if there is an easy
> way to generate data of similar shape.
> >>
> >> -- Lars
> >>
> >>
> >>
> >> ________________________________
> >> From: Bryan Keller <[EMAIL PROTECTED]>
> >> To: [EMAIL PROTECTED]
> >> Sent: Friday, May 3, 2013 3:44 AM
> >> Subject: Re: Poor HBase map-reduce scan performance
> >>
> >>
> >> Actually I'm not too confident in my results re block size, they may
> have been related to major compaction. I'm going to rerun before drawing
+
Bryan Keller 2013-06-04, 17:01
+
Michael Segel 2013-05-06, 03:09
+
Matt Corgan 2013-05-01, 06:52
+
Jean-Marc Spaggiari 2013-05-01, 10:56
+
Bryan Keller 2013-05-01, 16:39
+
Naidu MS 2013-05-01, 07:25
+
ramkrishna vasudevan 2013-05-01, 07:27
+
ramkrishna vasudevan 2013-05-01, 07:29