Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
HBase >> mail # user >> Bulk loading job failed when one region server went down in the cluster

Copy link to this message
Re: Bulk loading job failed when one region server went down in the cluster
Hi Guys,

Sorry for not mentioning the version I am currently running. My current
version is HBase 0.92.1(cdh4) and running Hadoop2.0.0-Alpha with YARN for
MR. My original post was for HBase0.92. Here are some more details of my
current setup:
I am running a 8 slave, 4 admin node cluster on CentOS6.0 VM's installed on
VMware Hyprevisor 5.0. Each of my VM is having 3.2 GB of memory and 500
HDFS space.
I use this cluster for POC(Proof of Concepts). I am not looking for any
performance benchmarking from this set-up. Due to some major bugs in YARN i
am unable to make work in a proper way in memory less than 4GB. I am
already having discussion regarding them on Hadoop Mailing List.

Here is the log of failed mapper: http://pastebin.com/f83xE2wv

The problem is that when i start a Bulk loading job in YARN, 8 Map
processes start on each slave and then all of my slaves are hammered badly
due to this. Since the slaves are getting hammered badly then RegionServer
gets lease expired or YourAreDeadExpcetion. Here is the log of RS which
caused the job to fail: http://pastebin.com/9ZQx0DtD

I am aware that this is happening due to underperforming hardware(Two
slaves are using one 7200 rpm Hard Drive in my setup) and some major bugs
regarding running YARN in less than 4 GB memory. My only concern is the
failure of entire MR job and its fault tolerance to RS failures. I am not
really concerned about RS failure since HBase is fault tolerant.

Please let me know if you need anything else.


On Mon, Aug 13, 2012 at 6:58 AM, Michael Segel <[EMAIL PROTECTED]>wrote:

> Yes, it can.
> You can see RS failure causing a cascading RS failure. Of course YMMV and
> it depends on which version you are running.
> OP is on CHD3u2 which still had some issues. CDH3u4 is the latest and he
> should upgrade.
> (Or go to CHD4...)
> -Mike
> On Aug 13, 2012, at 8:51 AM, Kevin O'dell <[EMAIL PROTECTED]>
> wrote:
> > Anil,
> >
> >  Do you have root cause on the RS failure?  I have never heard of one RS
> > failure causing a whole job to fail.
> >
> > On Tue, Aug 7, 2012 at 1:59 PM, anil gupta <[EMAIL PROTECTED]>
> wrote:
> >
> >> Hi HBase Folks,
> >>
> >> I ran the bulk loader yesterday night to load data in a table. During
> the
> >> bulk loading job one of the region server crashed and the entire job
> >> failed. It takes around 2.5 hours for this job to finish and the job
> failed
> >> when it was at around 50% complete. After the failure that table was
> also
> >> corrupted in HBase. My cluster has 8 region servers.
> >>
> >> Is bulk loading not fault tolerant to failure of region servers?
> >>
> >> I am using this old email chain because at that time my question went
> >> unanswered. Please share your views.
> >>
> >> Thanks,
> >> Anil Gupta
> >>
> >> On Tue, Apr 3, 2012 at 9:12 AM, anil gupta <[EMAIL PROTECTED]>
> wrote:
> >>
> >>> Hi Kevin,
> >>>
> >>> I am not really concerned about the RegionServer going down as the same
> >>> thing can happen when deployed in production. Although, in production
> we
> >>> wont be having VM environment and I am aware that my current Dev
> >>> environment is not good for heavy processing.  What i am concerned
> about
> >> is
> >>> the failure of bulk loading job when the Region Server failed. Does
> this
> >>> mean that Bulk loading job is not fault tolerant to Failure of Region
> >>> Server? I was expecting the job to be successful even though the
> >>> RegionServer failed because there 6 more RS running in the cluster.
> Fault
> >>> Tolerance is one of the biggest selling point of Hadoop platform. Let
> me
> >>> know your views.
> >>> Thanks for your time.
> >>>
> >>> Thanks,
> >>> Anil Gupta
> >>>
> >>>
> >>> On Tue, Apr 3, 2012 at 7:34 AM, Kevin O'dell <[EMAIL PROTECTED]
> >>> wrote:
> >>>
> >>>> Anil,
> >>>>
> >>>> I am sorry for the delayed response.  Reviewing the logs it appears:
> >>>>
> >>>> 2/03/30 15:38:31 INFO zookeeper.ClientCnxn: Client session timed out,
Thanks & Regards,
Anil Gupta