Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Plain View
HBase >> mail # user >> Poor HBase map-reduce scan performance


+
Bryan Keller 2013-05-01, 04:01
+
Ted Yu 2013-05-01, 04:17
+
Bryan Keller 2013-05-01, 04:31
+
Ted Yu 2013-05-01, 04:56
+
Bryan Keller 2013-05-01, 05:01
+
lars hofhansl 2013-05-01, 05:01
Copy link to this message
-
Re: Poor HBase map-reduce scan performance
The table has hashed keys so rows are evenly distributed amongst the regionservers, and load on each regionserver is pretty much the same. I also have per-table balancing turned on. I get mostly data local mappers with only a few rack local (maybe 10 of the 250 mappers).

Currently the table is a wide table schema, with lists of data structures stored as columns with column prefixes grouping the data structures (e.g. 1_name, 1_address, 1_city, 2_name, 2_address, 2_city). I was thinking of moving those data structures to protobuf which would cut down on the number of columns. The downside is I can't filter on one value with that, but it is a tradeoff I would make for performance. I was also considering restructuring the table into a tall table.

Something interesting is that my old regionserver machines had five 15k SCSI drives instead of 2 SSDs, and performance was about the same. Also, my old network was 1gbit, now it is 10gbit. So neither network nor disk I/O appear to be the bottleneck. The CPU is rather high for the regionserver so it seems like the best candidate to investigate. I will try profiling it tomorrow and will report back. I may revisit compression on vs off since that is adding load to the CPU.

I'll also come up with a sample program that generates data similar to my table.
On Apr 30, 2013, at 10:01 PM, lars hofhansl <[EMAIL PROTECTED]> wrote:

> Your average row is 35k so scanner caching would not make a huge difference, although I would have expected some improvements by setting it to 10 or 50 since you have a wide 10ge pipe.
>
> I assume your table is split sufficiently to touch all RegionServer... Do you see the same load/IO on all region servers?
>
> A bunch of scan improvements went into HBase since 0.94.2.
> I blogged about some of these changes here: http://hadoop-hbase.blogspot.com/2012/12/hbase-profiling.html
>
> In your case - since you have many columns, each of which carry the rowkey - you might benefit a lot from HBASE-7279.
>
> In the end HBase *is* slower than straight HDFS for full scans. How could it not be?
> So I would start by looking at HDFS first. Make sure Nagle's is disbaled in both HBase and HDFS.
>
> And lastly SSDs are somewhat new territory for HBase. Maybe Andy Purtell is listening, I think he did some tests with HBase on SSDs.
> With rotating media you typically see an improvement with compression. With SSDs the added CPU needed for decompression might outweigh the benefits.
>
> At the risk of starting a larger discussion here, I would posit that HBase's LSM based design, which trades random IO with sequential IO, might be a bit more questionable on SSDs.
>
> If you can, it would be nice to run a profiler against one of the RegionServers (or maybe do it with the single RS setup) and see where it is bottlenecked.
> (And if you send me a sample program to generate some data - not 700g, though :) - I'll try to do a bit of profiling during the next days as my day job permits, but I do not have any machines with SSDs).
>
> -- Lars
>
>
>
>
> ________________________________
> From: Bryan Keller <[EMAIL PROTECTED]>
> To: [EMAIL PROTECTED]
> Sent: Tuesday, April 30, 2013 9:31 PM
> Subject: Re: Poor HBase map-reduce scan performance
>
>
> Yes, I have tried various settings for setCaching() and I have setCacheBlocks(false)
>
> On Apr 30, 2013, at 9:17 PM, Ted Yu <[EMAIL PROTECTED]> wrote:
>
>> From http://hbase.apache.org/book.html#mapreduce.example :
>>
>> scan.setCaching(500);        // 1 is the default in Scan, which will
>> be bad for MapReduce jobs
>> scan.setCacheBlocks(false);  // don't set to true for MR jobs
>>
>> I guess you have used the above setting.
>>
>> 0.94.x releases are compatible. Have you considered upgrading to, say
>> 0.94.7 which was recently released ?
>>
>> Cheers
>>
>> On Tue, Apr 30, 2013 at 9:01 PM, Bryan Keller <[EMAIL PROTECTED]> wrote:
>>
>>> I have been attempting to speed up my HBase map-reduce scans for a while
>>> now. I have tried just about everything without much luck. I'm running out
+
Michael Segel 2013-05-01, 14:24
+
lars hofhansl 2013-05-01, 06:21
+
Bryan Keller 2013-05-01, 15:00
+
Bryan Keller 2013-05-02, 01:01
+
lars hofhansl 2013-05-02, 04:41
+
Bryan Keller 2013-05-02, 04:49
+
Bryan Keller 2013-05-02, 17:54
+
Nicolas Liochon 2013-05-02, 18:00
+
lars hofhansl 2013-05-03, 00:46
+
Bryan Keller 2013-05-03, 07:17
+
Bryan Keller 2013-05-03, 10:44
+
lars hofhansl 2013-05-05, 01:33
+
Bryan Keller 2013-05-08, 17:15
+
Bryan Keller 2013-05-10, 15:46
+
Sandy Pratt 2013-05-22, 20:29
+
Ted Yu 2013-05-22, 20:39
+
Sandy Pratt 2013-05-22, 22:33
+
Ted Yu 2013-05-22, 22:57
+
Bryan Keller 2013-05-23, 15:45
+
Sandy Pratt 2013-05-23, 22:42
+
Ted Yu 2013-05-23, 22:47
+
Sandy Pratt 2013-06-05, 01:11
+
Sandy Pratt 2013-06-05, 08:09
+
yonghu 2013-06-05, 14:55
+
Ted Yu 2013-06-05, 16:12
+
yonghu 2013-06-05, 18:14
+
Sandy Pratt 2013-06-05, 18:57
+
Sandy Pratt 2013-06-05, 17:58
+
lars hofhansl 2013-06-06, 01:03
+
Bryan Keller 2013-06-25, 08:56
+
lars hofhansl 2013-06-28, 17:56
+
Bryan Keller 2013-07-01, 04:23
+
Ted Yu 2013-07-01, 04:32
+
lars hofhansl 2013-07-01, 10:59
+
Enis Söztutar 2013-07-01, 21:23
+
Bryan Keller 2013-07-01, 21:35
+
lars hofhansl 2013-05-25, 05:50
+
Enis Söztutar 2013-05-29, 20:29
+
Bryan Keller 2013-06-04, 17:01
+
Michael Segel 2013-05-06, 03:09
+
Matt Corgan 2013-05-01, 06:52
+
Jean-Marc Spaggiari 2013-05-01, 10:56
+
Bryan Keller 2013-05-01, 16:39
+
Naidu MS 2013-05-01, 07:25
+
ramkrishna vasudevan 2013-05-01, 07:27
+
ramkrishna vasudevan 2013-05-01, 07:29