Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
HBase >> mail # user >> Poor HBase map-reduce scan performance


Copy link to this message
-
Re: Poor HBase map-reduce scan performance
>From http://hbase.apache.org/book.html#mapreduce.example :

scan.setCaching(500);        // 1 is the default in Scan, which will
be bad for MapReduce jobs
scan.setCacheBlocks(false);  // don't set to true for MR jobs

I guess you have used the above setting.

0.94.x releases are compatible. Have you considered upgrading to, say
0.94.7 which was recently released ?

Cheers

On Tue, Apr 30, 2013 at 9:01 PM, Bryan Keller <[EMAIL PROTECTED]> wrote:

> I have been attempting to speed up my HBase map-reduce scans for a while
> now. I have tried just about everything without much luck. I'm running out
> of ideas and was hoping for some suggestions. This is HBase 0.94.2 and
> Hadoop 2.0.0 (CDH4.2.1).
>
> The table I'm scanning:
> 20 mil rows
> Hundreds of columns/row
> Column keys can be 30-40 bytes
> Column values are generally not large, 1k would be on the large side
> 250 regions
> Snappy compression
> 8gb region size
> 512mb memstore flush
> 128k block size
> 700gb of data on HDFS
>
> My cluster has 8 datanodes which are also regionservers. Each has 8 cores
> (16 HT), 64gb RAM, and 2 SSDs. The network is 10gbit. I have a separate
> machine acting as namenode, HMaster, and zookeeper (single instance). I
> have disk local reads turned on.
>
> I'm seeing around 5 gbit/sec on average network IO. Each disk is getting
> 400mb/sec read IO. Theoretically I could get 400mb/sec * 16 = 6.4gb/sec.
>
> Using Hadoop's TestDFSIO tool, I'm seeing around 1.4gb/sec read speed. Not
> really that great compared to the theoretical I/O. However this is far
> better than I am seeing with HBase map-reduce scans of my table.
>
> I have a simple no-op map-only job (using TableInputFormat) that scans the
> table and does nothing with data. This takes 45 minutes. That's about
> 260mb/sec read speed. This is over 5x slower than straight HDFS.
>  Basically, with HBase I'm seeing read performance of my 16 SSD cluster
> performing nearly 35% slower than a single SSD.
>
> Here are some things I have changed to no avail:
> Scan caching values
> HDFS block sizes
> HBase block sizes
> Region file sizes
> Memory settings
> GC settings
> Number of mappers/node
> Compressed vs not compressed
>
> One thing I notice is that the regionserver is using quite a bit of CPU
> during the map reduce job. When dumping the jstack of the process, it seems
> like it is usually in some type of memory allocation or decompression
> routine which didn't seem abnormal.
>
> I can't seem to pinpoint the bottleneck. CPU use by the regionserver is
> high but not maxed out. Disk I/O and network I/O are low, IO wait is low.
> I'm on the verge of just writing the dataset out to sequence files once a
> day for scan purposes. Is that what others are doing?
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB