Thanks for the detailed answer.
we are running 0.90.2, and the problem resolved after running major
It seems that the problem was with client request waiting in the queue (so
I don't understand why major compaction solved it...).
Anyhow, I will try to apply the configurations you sent and see if it can
eliminate the problem.
On Mon, Mar 26, 2012 at 7:43 PM, Jean-Daniel Cryans <[EMAIL PROTECTED]>wrote:
> On Sun, Mar 25, 2012 at 1:23 AM, Lior Schachter <[EMAIL PROTECTED]>
> > Hi all,
> > We use hbase 0.9.2. We recently started to experience region servers
> You mean 0.90.2? Or 0.92.0?
> > crashed under heavy load (2-3 different servers crashes eah load).
> > Seems like missing block in HDFS causes a full GC and regions are being
> > closed.
> Not at all.
> So first we can see that your region server was doing Full GCs back to
> back because it's not able to collect anything (look how the numbers
> aren't decreasing). This eventually leads to a session timeout in
> zookeeper and at some point your region server woke up and saw that it
> lost control of the HDFS files caused by IO fencing (I know those
> exceptions look bad, but it's "normal").
> Now to fix this, there are multiple avenues to explore. The non-stop
> GCing means that the memory is completely full, but it grew slow
> enough that it got a concurrent mode failure before a
> OutOfMemoryError. Here's what's using memory in HBase:
> - MemStores
> - Block Cache
> - Client requests
> - Background tasks like flushing and compacting
> The first two you can control the amount of memory they use, have you
> tweaked that?
> And you say this is happening under heavy load (I'm guessing
> inserts?), so it might be that the client requests carry payloads that
> the region server can't possibly hold all at the same time. Im 0.90
> and 0.92 the amount of memory dedicated for this is unbounded,
> starting in 0.94 this will come in handy:
> At the same time I'm pretty sure you have some blocking and/or
> splitting going on and the client requests are just sitting there in
> the region server memory (grep -i your logs for "block" to confirm)
> while this happens.
> At this point there's 3 things you can do:
> - Use bulk loading instead of brute forcing it into HBase.
> - Tune HBase in order to block as less as possible if you still like
> brute forcing. This means setting bigger regions, bigger memstores,
> and enabling the compactions to block at higher than 7 files.
> - Tune the IPC queue capacity in order to have less calls sitting in
> the RS memory. If you're on 0.90.2 the "easy" way to do it is to have
> less handlers by setting hbase.regionserver.handler.count to less than
> 10. On 0.92 you can tweak it more directly with
> ipc.server.max.queue.size where the default is 100 (or 1000 in 0.90.2
> FWIW). This was all discussed in
> Hope this helps in some ways,