Thank you for your replies.
Not that it matters but, cache is 1, batch is -1 on the scan i.e each RPC
call returns one row. The jobs don't write any data back to HBase,
compaction is de-activated and done manually. At the time of the crash all
datanodes were fine, hbchk showed no inconsistencies. Table size is about
10k regions/3 billion records on the largest tables and we do a lot of
server side filtering to limit what's sent across the network.
Our machines may not be the most powerful, 32GB RAM, 8 cores, 1 disk. It's
also true that when we took a closer look in the past it turned out that
most of the issues we had were somehow rooted in the fact that CPUs were
overloaded, not enough memory available - hardware stuff.
What I don't get is why HBase always crashes. I mean if it's slow ok - the
hardware is a bottleneck but at least you'd expect it to pull through
eventually. Some days all jobs work fine, some days they don't and there is
no telling why. HBase's erratic behavior has been causing us a lot of
headache and we have been spending way too much time fiddling with HBase
configuration settings over the past 18 months.
On Fri, Nov 22, 2013 at 7:05 PM, Ted Yu <[EMAIL PROTECTED]> wrote:
> Thanks Dhaval for the analysis.
> bq. The HBase version is 0.94.6
> Please upgrade to 0.94.13, if possible. There have been several JIRAs
> backporting patches from trunk where jdk 1.7 is supported.
> Please also check your DataNode log to see whether there was problem there
> (likely there was).
> On Sat, Nov 23, 2013 at 2:00 AM, Dhaval Shah <[EMAIL PROTECTED]
> > You logs suggest that you are overloading resources
> > (servers/network/memory). How much data are you scanning with your MR
> > how much are you writing back to HBase? What values are you setting for
> > setBatch, setCaching, setCacheBlocks? How much memory do you have on your
> > region servers? 1 server crashing should not cause a job to fail because
> > will move on to the next one (given the right parmas for retries and
> > interval are set). Your region server logs suggest that its way more
> > complicated than that.
> > 2013-11-17 09:58:37,513 WARN
> > org.apache.hadoop.hbase.regionserver.HRegionServer: Received close for
> > region we are already opening or closing;
> > looks like some state inconsistency issue
> > I also see that you are using Java 7. Though some people have had success
> > using it, I am not sure if Java 7 is currently the recommended version
> > (most people use Java 6!)
> > 2013-11-18 18:01:47,959 INFO org.apache.zookeeper.ClientCnxn: Unable to
> > read additional data from server sessionid 0x342654dfdd30017, likely
> > has closed socket, closing socket connection and attempting reconnect
> > This line is suggesting a problem with your zookeeper. If zookeeper
> > up, HBase will and hence your MR job over HBase will.
> > 2013-11-21 06:54:01,105 WARN org.apache.hadoop.hdfs.DFSClient: Failed to
> > connect to /XXX.XXX.XXX.XXX:50010 for block, add to deadNodes and
> > java.net.ConnectException: Connection refused
> > And this suggests datanode crashed. So many processes (don't know if they
> > belong to the same server or not) crashing at the same time seems to be a
> > load issue or a network issue to me.
> > Regards,
> > Dhaval
> > ________________________________
> > From: David Koch <[EMAIL PROTECTED]>
> > To: [EMAIL PROTECTED]
> > Sent: Friday, 22 November 2013 12:35 PM
> > Subject: Re: One Region Server fails - all M/R jobs crash.
> > Here you go:
> > Task log: http://pastebin.com/VePTLHEk
> > Region Server log: http://pastebin.com/iu8y0VYL
> > On Fri, Nov 22, 2013 at 6:27 PM, Ted Yu <[EMAIL PROTECTED]> wrote:
> > > Attachment didn't go through.
> > >
> > > Can you pastebin their contents ?
> > >
> > > Thanks