Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
HBase >> mail # user >> Bulk loading job failed when one region server went down in the cluster


Copy link to this message
-
Re: Bulk loading job failed when one region server went down in the cluster
Anil,

  Do you have root cause on the RS failure?  I have never heard of one RS
failure causing a whole job to fail.

On Tue, Aug 7, 2012 at 1:59 PM, anil gupta <[EMAIL PROTECTED]> wrote:

> Hi HBase Folks,
>
> I ran the bulk loader yesterday night to load data in a table. During the
> bulk loading job one of the region server crashed and the entire job
> failed. It takes around 2.5 hours for this job to finish and the job failed
> when it was at around 50% complete. After the failure that table was also
> corrupted in HBase. My cluster has 8 region servers.
>
> Is bulk loading not fault tolerant to failure of region servers?
>
> I am using this old email chain because at that time my question went
> unanswered. Please share your views.
>
> Thanks,
> Anil Gupta
>
> On Tue, Apr 3, 2012 at 9:12 AM, anil gupta <[EMAIL PROTECTED]> wrote:
>
> > Hi Kevin,
> >
> > I am not really concerned about the RegionServer going down as the same
> > thing can happen when deployed in production. Although, in production we
> > wont be having VM environment and I am aware that my current Dev
> > environment is not good for heavy processing.  What i am concerned about
> is
> > the failure of bulk loading job when the Region Server failed. Does this
> > mean that Bulk loading job is not fault tolerant to Failure of Region
> > Server? I was expecting the job to be successful even though the
> > RegionServer failed because there 6 more RS running in the cluster. Fault
> > Tolerance is one of the biggest selling point of Hadoop platform. Let me
> > know your views.
> > Thanks for your time.
> >
> > Thanks,
> > Anil Gupta
> >
> >
> > On Tue, Apr 3, 2012 at 7:34 AM, Kevin O'dell <[EMAIL PROTECTED]
> >wrote:
> >
> >> Anil,
> >>
> >>  I am sorry for the delayed response.  Reviewing the logs it appears:
> >>
> >> 2/03/30 15:38:31 INFO zookeeper.ClientCnxn: Client session timed out,
> >> have not heard from server in 59311ms for sessionid 0x136557f99c90065,
> >> closing socket connection and attempting reconnect
> >>
> >> 12/03/30 15:38:32 FATAL regionserver.HRegionServer: ABORTING region
> >> server serverName=ihub-dn-b1,60020,1332955859363, load=(requests=0,
> >> regions=44, usedHeap=446, maxHeap=1197): Unhandled exception:
> >> org.apache.hadoop.hbase.YouAreDeadException: Server REPORT rejected;
> >> currently processing ihub-dn-b1,60020,1332955859363 as dead server
> >>
> >>   It appears to be a classic overworked RS.  You were doing too much
> >> for the RS and it did not respond in time, the Master marked it as
> >> dead, when the RS responded Master said no your are already dead and
> >> aborted the server.  This is why you see the YouAreDeadException.
> >> This is probably due to the shared resources of the VM infrastructure
> >> you are running.  You will either need to devote more resources or add
> >> more nodes(most likely physical) to the cluster if you would like to
> >> keep running these jobs.
> >>
> >> On Fri, Mar 30, 2012 at 9:24 PM, anil gupta <[EMAIL PROTECTED]>
> wrote:
> >> > Hi Kevin,
> >> >
> >> > Here is dropbox link to the log file of region server which failed:
> >> >
> >>
> http://dl.dropbox.com/u/64149128/hbase-hbase-regionserver-ihub-dn-b1.out
> >> > IMHO, the problem starts from the line #3009 which says: 12/03/30
> >> 15:38:32
> >> > FATAL regionserver.HRegionServer: ABORTING region server
> >> > serverName=ihub-dn-b1,60020,1332955859363, load=(requests=0,
> regions=44,
> >> > usedHeap=446, maxHeap=1197): Unhandled exception:
> >> > org.apache.hadoop.hbase.YouAreDeadException: Server REPORT rejected;
> >> > currently processing ihub-dn-b1,60020,1332955859363 as dead server
> >> >
> >> > I have already tested fault tolerance of HBase by manually bringing
> >> down a
> >> > RS while querying a Table and it worked fine and I was expecting the
> >> same
> >> > today(even though the RS went down by itself today) when i was loading
> >> the
> >> > data. But, it didn't work out well.
> >> > Thanks for your time. Let me know if you need more details.

Kevin O'Dell
Customer Operations Engineer, Cloudera