Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
HBase, mail # user - Poor HBase map-reduce scan performance


Copy link to this message
-
Re: Poor HBase map-reduce scan performance
Bryan Keller 2013-05-01, 05:01
Yes, I have it enabled (forgot to mention that).

On Apr 30, 2013, at 9:56 PM, Ted Yu <[EMAIL PROTECTED]> wrote:

> Have you tried enabling short circuit read ?
>
> Thanks
>
> On Apr 30, 2013, at 9:31 PM, Bryan Keller <[EMAIL PROTECTED]> wrote:
>
>> Yes, I have tried various settings for setCaching() and I have setCacheBlocks(false)
>>
>> On Apr 30, 2013, at 9:17 PM, Ted Yu <[EMAIL PROTECTED]> wrote:
>>
>>> From http://hbase.apache.org/book.html#mapreduce.example :
>>>
>>> scan.setCaching(500);        // 1 is the default in Scan, which will
>>> be bad for MapReduce jobs
>>> scan.setCacheBlocks(false);  // don't set to true for MR jobs
>>>
>>> I guess you have used the above setting.
>>>
>>> 0.94.x releases are compatible. Have you considered upgrading to, say
>>> 0.94.7 which was recently released ?
>>>
>>> Cheers
>>>
>>> On Tue, Apr 30, 2013 at 9:01 PM, Bryan Keller <[EMAIL PROTECTED]> wrote:
>>>
>>>> I have been attempting to speed up my HBase map-reduce scans for a while
>>>> now. I have tried just about everything without much luck. I'm running out
>>>> of ideas and was hoping for some suggestions. This is HBase 0.94.2 and
>>>> Hadoop 2.0.0 (CDH4.2.1).
>>>>
>>>> The table I'm scanning:
>>>> 20 mil rows
>>>> Hundreds of columns/row
>>>> Column keys can be 30-40 bytes
>>>> Column values are generally not large, 1k would be on the large side
>>>> 250 regions
>>>> Snappy compression
>>>> 8gb region size
>>>> 512mb memstore flush
>>>> 128k block size
>>>> 700gb of data on HDFS
>>>>
>>>> My cluster has 8 datanodes which are also regionservers. Each has 8 cores
>>>> (16 HT), 64gb RAM, and 2 SSDs. The network is 10gbit. I have a separate
>>>> machine acting as namenode, HMaster, and zookeeper (single instance). I
>>>> have disk local reads turned on.
>>>>
>>>> I'm seeing around 5 gbit/sec on average network IO. Each disk is getting
>>>> 400mb/sec read IO. Theoretically I could get 400mb/sec * 16 = 6.4gb/sec.
>>>>
>>>> Using Hadoop's TestDFSIO tool, I'm seeing around 1.4gb/sec read speed. Not
>>>> really that great compared to the theoretical I/O. However this is far
>>>> better than I am seeing with HBase map-reduce scans of my table.
>>>>
>>>> I have a simple no-op map-only job (using TableInputFormat) that scans the
>>>> table and does nothing with data. This takes 45 minutes. That's about
>>>> 260mb/sec read speed. This is over 5x slower than straight HDFS.
>>>> Basically, with HBase I'm seeing read performance of my 16 SSD cluster
>>>> performing nearly 35% slower than a single SSD.
>>>>
>>>> Here are some things I have changed to no avail:
>>>> Scan caching values
>>>> HDFS block sizes
>>>> HBase block sizes
>>>> Region file sizes
>>>> Memory settings
>>>> GC settings
>>>> Number of mappers/node
>>>> Compressed vs not compressed
>>>>
>>>> One thing I notice is that the regionserver is using quite a bit of CPU
>>>> during the map reduce job. When dumping the jstack of the process, it seems
>>>> like it is usually in some type of memory allocation or decompression
>>>> routine which didn't seem abnormal.
>>>>
>>>> I can't seem to pinpoint the bottleneck. CPU use by the regionserver is
>>>> high but not maxed out. Disk I/O and network I/O are low, IO wait is low.
>>>> I'm on the verge of just writing the dataset out to sequence files once a
>>>> day for scan purposes. Is that what others are doing?
>>