Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
HBase >> mail # user >> One Region Server fails - all M/R jobs crash.


Copy link to this message
-
Re: One Region Server fails - all M/R jobs crash.
You logs suggest that you are overloading resources (servers/network/memory). How much data are you scanning with your MR job, how much are you writing back to HBase? What values are you setting for setBatch, setCaching, setCacheBlocks? How much memory do you have on your region servers? 1 server crashing should not cause a job to fail because it will move on to the next one (given the right parmas for retries and retry interval are set). Your region server logs suggest that its way more complicated than that. 

2013-11-17 09:58:37,513 WARN org.apache.hadoop.hbase.regionserver.HRegionServer: Received close for region we are already opening or closing; e54b8e16ffbe2187b9017fef596c62aa

looks like some state inconsistency issue

I also see that you are using Java 7. Though some people have had success using it, I am not sure if Java 7 is currently the recommended version (most people use Java 6!)

2013-11-18 18:01:47,959 INFO org.apache.zookeeper.ClientCnxn: Unable to read additional data from server sessionid 0x342654dfdd30017, likely server has closed socket, closing socket connection and attempting reconnect

This line is suggesting a problem with your zookeeper. If zookeeper screws up, HBase will and hence your MR job over HBase will. 

2013-11-21 06:54:01,105 WARN org.apache.hadoop.hdfs.DFSClient: Failed to connect to /XXX.XXX.XXX.XXX:50010 for block, add to deadNodes and continue. java.net.ConnectException: Connection refused

And this suggests datanode crashed. So many processes (don't know if they belong to the same server or not) crashing at the same time seems to be a load issue or a network issue to me. 
 
Regards,
Dhaval
________________________________
 From: David Koch <[EMAIL PROTECTED]>
To: [EMAIL PROTECTED]
Sent: Friday, 22 November 2013 12:35 PM
Subject: Re: One Region Server fails - all M/R jobs crash.
 

Here you go:

Task log: http://pastebin.com/VePTLHEk
Region Server log: http://pastebin.com/iu8y0VYL

On Fri, Nov 22, 2013 at 6:27 PM, Ted Yu <[EMAIL PROTECTED]> wrote:

> Attachment didn't go through.
>
> Can you pastebin their contents ?
>
> Thanks
>
> On Nov 23, 2013, at 12:55 AM, David Koch <[EMAIL PROTECTED]> wrote:
>
> > Sorry for the previous message, I attach the equired log files.
> >
> > Regards,
> >
> > David
> >
> >
> > On Fri, Nov 22, 2013 at 5:53 PM, David Koch <[EMAIL PROTECTED]>
> wrote:
> >>
> >>
> >>
> >> On Fri, Nov 22, 2013 at 4:17 PM, Ted Yu <[EMAIL PROTECTED]> wrote:
> >>> Can you pastebin snippet of:
> >>> 1. task logs which show failure
> >>> 2. region server log shortly before the crash
> >>>
> >>> Thanks
> >>>
> >>>
> >>> On Fri, Nov 22, 2013 at 7:14 AM, David Koch <[EMAIL PROTECTED]>
> wrote:
> >>>
> >>> > Hello,
> >>> >
> >>> > We experience reliability problems when running M/R jobs over HBase
> tables.
> >>> > Specifically, it suffices for one Region Server to crash in order to
> fail
> >>> > all M/R jobs.
> >>> >
> >>> > My guess is that this is not normal with a replication factor of 3.
> >>> >
> >>> > The HBase version is 0.94.6 installed as part of of Cloudera 4.4.
> HBase
> >>> > settings are pre-sets. Cluster size is 30 machines.
> >>> >
> >>> > What steps can I follow to improve the situation?
> >>> >
> >>> > Thank you,
> >>> >
> >>> > /David
> >>> >
> >
>