On Mon, Jan 28, 2013 at 12:14 PM, Jim Abramson <[EMAIL PROTECTED]<mailto:[EMAIL PROTECTED]>> wrote:
> We are testing HBase for some read-heavy batch operations, and
> encountering frequent, silent RegionServer crashes.
'Silent' is interesting. Which files did you check? .log and the .out?
Nothing in the latter?
Thanks very much for your reply.
In fact, .out log did indicate OOMEs here. I looked right past this initially, as their formatting makes the "kill -9 N" output look like ignorable comments to a Python developer :-)
Knowing that this was the problem all along, I've been doing some experimenting and trying to figure out how to prevent this from happening.
My use case calls for lots of long RowKey scans arriving in bursts, alongside a steady stream of puts and random-access reads. Based on what I've read here http://hbase.apache.org/book/regionserver.arch.html, I assumed that disabling caching on my scans would help. And to an extent it did, as the RegionServers' heap usage stays much more level now while I'm scanning. Unfortunately, now my thrift servers are going OOM during the scans, in exactly the same way - which is arguably better, but still far from acceptable for production.
Any advice on how to harden the system overall against these kinds of memory issues? I'm disinclined to simply jack up memory allocations and hope for the best, since that seems like postponement of the problem, not prevention.
As far as thrift is concerned, i did notice HBASE-4863 and it seems promising, but I may be locked into 0.92 initially.