Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Plain View
HBase >> mail # user >> Bulk loading job failed when one region server went down in the cluster


+
anil gupta 2012-03-30, 23:05
+
Kevin Odell 2012-03-31, 01:05
+
anil gupta 2012-03-31, 01:24
+
Kevin Odell 2012-04-03, 14:34
+
anil gupta 2012-04-03, 16:12
+
anil gupta 2012-08-07, 17:59
+
Kevin Odell 2012-08-13, 13:51
Copy link to this message
-
Re: Bulk loading job failed when one region server went down in the cluster
Yes, it can.
You can see RS failure causing a cascading RS failure. Of course YMMV and it depends on which version you are running.

OP is on CHD3u2 which still had some issues. CDH3u4 is the latest and he should upgrade.

(Or go to CHD4...)

HTH

-Mike

On Aug 13, 2012, at 8:51 AM, Kevin O'dell <[EMAIL PROTECTED]> wrote:

> Anil,
>
>  Do you have root cause on the RS failure?  I have never heard of one RS
> failure causing a whole job to fail.
>
> On Tue, Aug 7, 2012 at 1:59 PM, anil gupta <[EMAIL PROTECTED]> wrote:
>
>> Hi HBase Folks,
>>
>> I ran the bulk loader yesterday night to load data in a table. During the
>> bulk loading job one of the region server crashed and the entire job
>> failed. It takes around 2.5 hours for this job to finish and the job failed
>> when it was at around 50% complete. After the failure that table was also
>> corrupted in HBase. My cluster has 8 region servers.
>>
>> Is bulk loading not fault tolerant to failure of region servers?
>>
>> I am using this old email chain because at that time my question went
>> unanswered. Please share your views.
>>
>> Thanks,
>> Anil Gupta
>>
>> On Tue, Apr 3, 2012 at 9:12 AM, anil gupta <[EMAIL PROTECTED]> wrote:
>>
>>> Hi Kevin,
>>>
>>> I am not really concerned about the RegionServer going down as the same
>>> thing can happen when deployed in production. Although, in production we
>>> wont be having VM environment and I am aware that my current Dev
>>> environment is not good for heavy processing.  What i am concerned about
>> is
>>> the failure of bulk loading job when the Region Server failed. Does this
>>> mean that Bulk loading job is not fault tolerant to Failure of Region
>>> Server? I was expecting the job to be successful even though the
>>> RegionServer failed because there 6 more RS running in the cluster. Fault
>>> Tolerance is one of the biggest selling point of Hadoop platform. Let me
>>> know your views.
>>> Thanks for your time.
>>>
>>> Thanks,
>>> Anil Gupta
>>>
>>>
>>> On Tue, Apr 3, 2012 at 7:34 AM, Kevin O'dell <[EMAIL PROTECTED]
>>> wrote:
>>>
>>>> Anil,
>>>>
>>>> I am sorry for the delayed response.  Reviewing the logs it appears:
>>>>
>>>> 2/03/30 15:38:31 INFO zookeeper.ClientCnxn: Client session timed out,
>>>> have not heard from server in 59311ms for sessionid 0x136557f99c90065,
>>>> closing socket connection and attempting reconnect
>>>>
>>>> 12/03/30 15:38:32 FATAL regionserver.HRegionServer: ABORTING region
>>>> server serverName=ihub-dn-b1,60020,1332955859363, load=(requests=0,
>>>> regions=44, usedHeap=446, maxHeap=1197): Unhandled exception:
>>>> org.apache.hadoop.hbase.YouAreDeadException: Server REPORT rejected;
>>>> currently processing ihub-dn-b1,60020,1332955859363 as dead server
>>>>
>>>>  It appears to be a classic overworked RS.  You were doing too much
>>>> for the RS and it did not respond in time, the Master marked it as
>>>> dead, when the RS responded Master said no your are already dead and
>>>> aborted the server.  This is why you see the YouAreDeadException.
>>>> This is probably due to the shared resources of the VM infrastructure
>>>> you are running.  You will either need to devote more resources or add
>>>> more nodes(most likely physical) to the cluster if you would like to
>>>> keep running these jobs.
>>>>
>>>> On Fri, Mar 30, 2012 at 9:24 PM, anil gupta <[EMAIL PROTECTED]>
>> wrote:
>>>>> Hi Kevin,
>>>>>
>>>>> Here is dropbox link to the log file of region server which failed:
>>>>>
>>>>
>> http://dl.dropbox.com/u/64149128/hbase-hbase-regionserver-ihub-dn-b1.out
>>>>> IMHO, the problem starts from the line #3009 which says: 12/03/30
>>>> 15:38:32
>>>>> FATAL regionserver.HRegionServer: ABORTING region server
>>>>> serverName=ihub-dn-b1,60020,1332955859363, load=(requests=0,
>> regions=44,
>>>>> usedHeap=446, maxHeap=1197): Unhandled exception:
>>>>> org.apache.hadoop.hbase.YouAreDeadException: Server REPORT rejected;
>>>>> currently processing ihub-dn-b1,60020,1332955859363 as dead server
+
anil gupta 2012-08-13, 19:11
+
Michael Segel 2012-08-13, 19:39
+
anil gupta 2012-08-13, 20:14
+
anil gupta 2012-08-13, 20:24
+
Michael Segel 2012-08-14, 00:17
+
anil gupta 2012-08-14, 01:05
+
Michael Segel 2012-08-14, 01:59
+
Stack 2012-08-15, 21:52
+
anil gupta 2012-08-15, 22:13