Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
HBase >> mail # user >> Poor HBase map-reduce scan performance


Copy link to this message
-
Re: Poor HBase map-reduce scan performance
>From http://hbase.apache.org/book.html#mapreduce.example :

scan.setCaching(500);        // 1 is the default in Scan, which will
be bad for MapReduce jobs
scan.setCacheBlocks(false);  // don't set to true for MR jobs

I guess you have used the above setting.

0.94.x releases are compatible. Have you considered upgrading to, say
0.94.7 which was recently released ?

Cheers

On Tue, Apr 30, 2013 at 9:01 PM, Bryan Keller <[EMAIL PROTECTED]> wrote:

> I have been attempting to speed up my HBase map-reduce scans for a while
> now. I have tried just about everything without much luck. I'm running out
> of ideas and was hoping for some suggestions. This is HBase 0.94.2 and
> Hadoop 2.0.0 (CDH4.2.1).
>
> The table I'm scanning:
> 20 mil rows
> Hundreds of columns/row
> Column keys can be 30-40 bytes
> Column values are generally not large, 1k would be on the large side
> 250 regions
> Snappy compression
> 8gb region size
> 512mb memstore flush
> 128k block size
> 700gb of data on HDFS
>
> My cluster has 8 datanodes which are also regionservers. Each has 8 cores
> (16 HT), 64gb RAM, and 2 SSDs. The network is 10gbit. I have a separate
> machine acting as namenode, HMaster, and zookeeper (single instance). I
> have disk local reads turned on.
>
> I'm seeing around 5 gbit/sec on average network IO. Each disk is getting
> 400mb/sec read IO. Theoretically I could get 400mb/sec * 16 = 6.4gb/sec.
>
> Using Hadoop's TestDFSIO tool, I'm seeing around 1.4gb/sec read speed. Not
> really that great compared to the theoretical I/O. However this is far
> better than I am seeing with HBase map-reduce scans of my table.
>
> I have a simple no-op map-only job (using TableInputFormat) that scans the
> table and does nothing with data. This takes 45 minutes. That's about
> 260mb/sec read speed. This is over 5x slower than straight HDFS.
>  Basically, with HBase I'm seeing read performance of my 16 SSD cluster
> performing nearly 35% slower than a single SSD.
>
> Here are some things I have changed to no avail:
> Scan caching values
> HDFS block sizes
> HBase block sizes
> Region file sizes
> Memory settings
> GC settings
> Number of mappers/node
> Compressed vs not compressed
>
> One thing I notice is that the regionserver is using quite a bit of CPU
> during the map reduce job. When dumping the jstack of the process, it seems
> like it is usually in some type of memory allocation or decompression
> routine which didn't seem abnormal.
>
> I can't seem to pinpoint the bottleneck. CPU use by the regionserver is
> high but not maxed out. Disk I/O and network I/O are low, IO wait is low.
> I'm on the verge of just writing the dataset out to sequence files once a
> day for scan purposes. Is that what others are doing?