Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
HBase >> mail # user >> Poor HBase map-reduce scan performance


Copy link to this message
-
Re: Poor HBase map-reduce scan performance
Yes, I have tried various settings for setCaching() and I have setCacheBlocks(false)

On Apr 30, 2013, at 9:17 PM, Ted Yu <[EMAIL PROTECTED]> wrote:

> From http://hbase.apache.org/book.html#mapreduce.example :
>
> scan.setCaching(500);        // 1 is the default in Scan, which will
> be bad for MapReduce jobs
> scan.setCacheBlocks(false);  // don't set to true for MR jobs
>
> I guess you have used the above setting.
>
> 0.94.x releases are compatible. Have you considered upgrading to, say
> 0.94.7 which was recently released ?
>
> Cheers
>
> On Tue, Apr 30, 2013 at 9:01 PM, Bryan Keller <[EMAIL PROTECTED]> wrote:
>
>> I have been attempting to speed up my HBase map-reduce scans for a while
>> now. I have tried just about everything without much luck. I'm running out
>> of ideas and was hoping for some suggestions. This is HBase 0.94.2 and
>> Hadoop 2.0.0 (CDH4.2.1).
>>
>> The table I'm scanning:
>> 20 mil rows
>> Hundreds of columns/row
>> Column keys can be 30-40 bytes
>> Column values are generally not large, 1k would be on the large side
>> 250 regions
>> Snappy compression
>> 8gb region size
>> 512mb memstore flush
>> 128k block size
>> 700gb of data on HDFS
>>
>> My cluster has 8 datanodes which are also regionservers. Each has 8 cores
>> (16 HT), 64gb RAM, and 2 SSDs. The network is 10gbit. I have a separate
>> machine acting as namenode, HMaster, and zookeeper (single instance). I
>> have disk local reads turned on.
>>
>> I'm seeing around 5 gbit/sec on average network IO. Each disk is getting
>> 400mb/sec read IO. Theoretically I could get 400mb/sec * 16 = 6.4gb/sec.
>>
>> Using Hadoop's TestDFSIO tool, I'm seeing around 1.4gb/sec read speed. Not
>> really that great compared to the theoretical I/O. However this is far
>> better than I am seeing with HBase map-reduce scans of my table.
>>
>> I have a simple no-op map-only job (using TableInputFormat) that scans the
>> table and does nothing with data. This takes 45 minutes. That's about
>> 260mb/sec read speed. This is over 5x slower than straight HDFS.
>> Basically, with HBase I'm seeing read performance of my 16 SSD cluster
>> performing nearly 35% slower than a single SSD.
>>
>> Here are some things I have changed to no avail:
>> Scan caching values
>> HDFS block sizes
>> HBase block sizes
>> Region file sizes
>> Memory settings
>> GC settings
>> Number of mappers/node
>> Compressed vs not compressed
>>
>> One thing I notice is that the regionserver is using quite a bit of CPU
>> during the map reduce job. When dumping the jstack of the process, it seems
>> like it is usually in some type of memory allocation or decompression
>> routine which didn't seem abnormal.
>>
>> I can't seem to pinpoint the bottleneck. CPU use by the regionserver is
>> high but not maxed out. Disk I/O and network I/O are low, IO wait is low.
>> I'm on the verge of just writing the dataset out to sequence files once a
>> day for scan purposes. Is that what others are doing?