Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
HBase >> mail # user >> Poor HBase map-reduce scan performance


Copy link to this message
-
Re: Poor HBase map-reduce scan performance
Thanks for the offer Lars! I haven't made much progress speeding things up.

I finally put together a test program that populates a table that is similar to my production dataset. I have a readme that should describe things, hopefully enough to make it useable. There is code to populate a test table, code to scan the table, and code to scan sequence files from an export (to compare HBase w/ raw HDFS). I use a gradle build script.

You can find the code here:

https://dl.dropboxusercontent.com/u/6880177/hbasetest.zip
On May 4, 2013, at 6:33 PM, lars hofhansl <[EMAIL PROTECTED]> wrote:

> The blockbuffers are not reused, but that by itself should not be a problem as they are all the same size (at least I have never identified that as one in my profiling sessions).
>
> My offer still stands to do some profiling myself if there is an easy way to generate data of similar shape.
>
> -- Lars
>
>
>
> ________________________________
> From: Bryan Keller <[EMAIL PROTECTED]>
> To: [EMAIL PROTECTED]
> Sent: Friday, May 3, 2013 3:44 AM
> Subject: Re: Poor HBase map-reduce scan performance
>
>
> Actually I'm not too confident in my results re block size, they may have been related to major compaction. I'm going to rerun before drawing any conclusions.
>
> On May 3, 2013, at 12:17 AM, Bryan Keller <[EMAIL PROTECTED]> wrote:
>
>> I finally made some progress. I tried a very large HBase block size (16mb), and it significantly improved scan performance. I went from 45-50 min to 24 min. Not great but much better. Before I had it set to 128k. Scanning an equivalent sequence file takes 10 min. My random read performance will probably suffer with such a large block size (theoretically), so I probably can't keep it this big. I care about random read performance too. I've read having a block size this big is not recommended, is that correct?
>>
>> I haven't dug too deeply into the code, are the block buffers reused or is each new block read a new allocation? Perhaps a buffer pool could help here if there isn't one already. When doing a scan, HBase could reuse previously allocated block buffers instead of allocating a new one for each block. Then block size shouldn't affect scan performance much.
>>
>> I'm not using a block encoder. Also, I'm still sifting through the profiler results, I'll see if I can make more sense of it and run some more experiments.
>>
>> On May 2, 2013, at 5:46 PM, lars hofhansl <[EMAIL PROTECTED]> wrote:
>>
>>> Interesting. If you can try 0.94.7 (but it'll probably not have changed that much from 0.94.4)
>>>
>>>
>>> Do you have enabled one of the block encoders (FAST_DIFF, etc)? If so, try without. They currently need to reallocate a ByteBuffer for each single KV.
>>> (Sine you see ScannerV2 rather than EncodedScannerV2 you probably have not enabled encoding, just checking).
>>>
>>>
>>> And do you have a stack trace for the ByteBuffer.allocate(). That is a strange one since it never came up in my profiling (unless you enabled block encoding).
>>> (You can get traces from VisualVM by creating a snapshot, but you'd have to drill in to find the allocate()).
>>>
>>>
>>> During normal scanning (again, without encoding) there should be no allocation happening except for blocks read from disk (and they should all be the same size, thus allocation should be cheap).
>>>
>>> -- Lars
>>>
>>>
>>>
>>> ________________________________
>>> From: Bryan Keller <[EMAIL PROTECTED]>
>>> To: [EMAIL PROTECTED]
>>> Sent: Thursday, May 2, 2013 10:54 AM
>>> Subject: Re: Poor HBase map-reduce scan performance
>>>
>>>
>>> I ran one of my regionservers through VisualVM. It looks like the top hot spots are HFileReaderV2$ScannerV2.getKeyValue() and ByteBuffer.allocate(). It appears at first glance that memory allocations may be an issue. Decompression was next below that but less of an issue it seems.
>>>
>>> Would changing the block size, either HDFS or HBase, help here?
>>>
>>> Also, if anyone has tips on how else to profile, that would be appreciated. VisualVM can produce a lot of noise that is hard to sift through.
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB