|
anil gupta
2012-03-30, 23:05
Kevin O'dell
2012-03-31, 01:05
anil gupta
2012-03-31, 01:24
Kevin O'dell
2012-04-03, 14:34
anil gupta
2012-04-03, 16:12
anil gupta
2012-08-07, 17:59
Kevin O'dell
2012-08-13, 13:51
Michael Segel
2012-08-13, 13:58
anil gupta
2012-08-13, 19:11
Michael Segel
2012-08-13, 19:39
anil gupta
2012-08-13, 20:14
anil gupta
2012-08-13, 20:24
Michael Segel
2012-08-14, 00:17
anil gupta
2012-08-14, 01:05
Michael Segel
2012-08-14, 01:59
Stack
2012-08-15, 21:52
anil gupta
2012-08-15, 22:13
|
-
Bulk loading job failed when one region server went down in the clusteranil gupta 2012-03-30, 23:05
Hi All,
I am using cdh3u2 and i have 7 worker nodes(VM's spread across two machines) which are running Datanode, Tasktracker, and Region Server(1200 MB heap size). I was loading data into HBase using Bulk Loader with a custom mapper. I was loading around 34 million records and I have loaded the same set of data in the same environment many times before without any problem. This time while loading the data, one of the region server(but the DN and TT kept on running on that node ) failed and then after numerous failures of map-tasks the loding job failed. Is there any setting/configuration which can make Bulk Loading fault-tolerant to failure of region-servers? -- Thanks & Regards, Anil Gupta
-
Re: Bulk loading job failed when one region server went down in the clusterKevin O'dell 2012-03-31, 01:05
Anil,
Can you please attach the RS logs from the failure? On Fri, Mar 30, 2012 at 7:05 PM, anil gupta <[EMAIL PROTECTED]> wrote: > Hi All, > > I am using cdh3u2 and i have 7 worker nodes(VM's spread across two > machines) which are running Datanode, Tasktracker, and Region Server(1200 > MB heap size). I was loading data into HBase using Bulk Loader with a > custom mapper. I was loading around 34 million records and I have loaded > the same set of data in the same environment many times before without any > problem. This time while loading the data, one of the region server(but the > DN and TT kept on running on that node ) failed and then after numerous > failures of map-tasks the loding job failed. Is there any > setting/configuration which can make Bulk Loading fault-tolerant to failure > of region-servers? > > -- > Thanks & Regards, > Anil Gupta -- Kevin O'Dell Customer Operations Engineer, Cloudera
-
Re: Bulk loading job failed when one region server went down in the clusteranil gupta 2012-03-31, 01:24
Hi Kevin,
Here is dropbox link to the log file of region server which failed: http://dl.dropbox.com/u/64149128/hbase-hbase-regionserver-ihub-dn-b1.out IMHO, the problem starts from the line #3009 which says: 12/03/30 15:38:32 FATAL regionserver.HRegionServer: ABORTING region server serverName=ihub-dn-b1,60020,1332955859363, load=(requests=0, regions=44, usedHeap=446, maxHeap=1197): Unhandled exception: org.apache.hadoop.hbase.YouAreDeadException: Server REPORT rejected; currently processing ihub-dn-b1,60020,1332955859363 as dead server I have already tested fault tolerance of HBase by manually bringing down a RS while querying a Table and it worked fine and I was expecting the same today(even though the RS went down by itself today) when i was loading the data. But, it didn't work out well. Thanks for your time. Let me know if you need more details. ~Anil On Fri, Mar 30, 2012 at 6:05 PM, Kevin O'dell <[EMAIL PROTECTED]>wrote: > Anil, > > Can you please attach the RS logs from the failure? > > On Fri, Mar 30, 2012 at 7:05 PM, anil gupta <[EMAIL PROTECTED]> wrote: > > Hi All, > > > > I am using cdh3u2 and i have 7 worker nodes(VM's spread across two > > machines) which are running Datanode, Tasktracker, and Region Server(1200 > > MB heap size). I was loading data into HBase using Bulk Loader with a > > custom mapper. I was loading around 34 million records and I have loaded > > the same set of data in the same environment many times before without > any > > problem. This time while loading the data, one of the region server(but > the > > DN and TT kept on running on that node ) failed and then after numerous > > failures of map-tasks the loding job failed. Is there any > > setting/configuration which can make Bulk Loading fault-tolerant to > failure > > of region-servers? > > > > -- > > Thanks & Regards, > > Anil Gupta > > > > -- > Kevin O'Dell > Customer Operations Engineer, Cloudera > -- Thanks & Regards, Anil Gupta
-
Re: Bulk loading job failed when one region server went down in the clusterKevin O'dell 2012-04-03, 14:34
Anil,
I am sorry for the delayed response. Reviewing the logs it appears: 2/03/30 15:38:31 INFO zookeeper.ClientCnxn: Client session timed out, have not heard from server in 59311ms for sessionid 0x136557f99c90065, closing socket connection and attempting reconnect 12/03/30 15:38:32 FATAL regionserver.HRegionServer: ABORTING region server serverName=ihub-dn-b1,60020,1332955859363, load=(requests=0, regions=44, usedHeap=446, maxHeap=1197): Unhandled exception: org.apache.hadoop.hbase.YouAreDeadException: Server REPORT rejected; currently processing ihub-dn-b1,60020,1332955859363 as dead server It appears to be a classic overworked RS. You were doing too much for the RS and it did not respond in time, the Master marked it as dead, when the RS responded Master said no your are already dead and aborted the server. This is why you see the YouAreDeadException. This is probably due to the shared resources of the VM infrastructure you are running. You will either need to devote more resources or add more nodes(most likely physical) to the cluster if you would like to keep running these jobs. On Fri, Mar 30, 2012 at 9:24 PM, anil gupta <[EMAIL PROTECTED]> wrote: > Hi Kevin, > > Here is dropbox link to the log file of region server which failed: > http://dl.dropbox.com/u/64149128/hbase-hbase-regionserver-ihub-dn-b1.out > IMHO, the problem starts from the line #3009 which says: 12/03/30 15:38:32 > FATAL regionserver.HRegionServer: ABORTING region server > serverName=ihub-dn-b1,60020,1332955859363, load=(requests=0, regions=44, > usedHeap=446, maxHeap=1197): Unhandled exception: > org.apache.hadoop.hbase.YouAreDeadException: Server REPORT rejected; > currently processing ihub-dn-b1,60020,1332955859363 as dead server > > I have already tested fault tolerance of HBase by manually bringing down a > RS while querying a Table and it worked fine and I was expecting the same > today(even though the RS went down by itself today) when i was loading the > data. But, it didn't work out well. > Thanks for your time. Let me know if you need more details. > > ~Anil > > On Fri, Mar 30, 2012 at 6:05 PM, Kevin O'dell <[EMAIL PROTECTED]>wrote: > >> Anil, >> >> Can you please attach the RS logs from the failure? >> >> On Fri, Mar 30, 2012 at 7:05 PM, anil gupta <[EMAIL PROTECTED]> wrote: >> > Hi All, >> > >> > I am using cdh3u2 and i have 7 worker nodes(VM's spread across two >> > machines) which are running Datanode, Tasktracker, and Region Server(1200 >> > MB heap size). I was loading data into HBase using Bulk Loader with a >> > custom mapper. I was loading around 34 million records and I have loaded >> > the same set of data in the same environment many times before without >> any >> > problem. This time while loading the data, one of the region server(but >> the >> > DN and TT kept on running on that node ) failed and then after numerous >> > failures of map-tasks the loding job failed. Is there any >> > setting/configuration which can make Bulk Loading fault-tolerant to >> failure >> > of region-servers? >> > >> > -- >> > Thanks & Regards, >> > Anil Gupta >> >> >> >> -- >> Kevin O'Dell >> Customer Operations Engineer, Cloudera >> > > > > -- > Thanks & Regards, > Anil Gupta -- Kevin O'Dell Customer Operations Engineer, Cloudera
-
Re: Bulk loading job failed when one region server went down in the clusteranil gupta 2012-04-03, 16:12
Hi Kevin,
I am not really concerned about the RegionServer going down as the same thing can happen when deployed in production. Although, in production we wont be having VM environment and I am aware that my current Dev environment is not good for heavy processing. What i am concerned about is the failure of bulk loading job when the Region Server failed. Does this mean that Bulk loading job is not fault tolerant to Failure of Region Server? I was expecting the job to be successful even though the RegionServer failed because there 6 more RS running in the cluster. Fault Tolerance is one of the biggest selling point of Hadoop platform. Let me know your views. Thanks for your time. Thanks, Anil Gupta On Tue, Apr 3, 2012 at 7:34 AM, Kevin O'dell <[EMAIL PROTECTED]>wrote: > Anil, > > I am sorry for the delayed response. Reviewing the logs it appears: > > 2/03/30 15:38:31 INFO zookeeper.ClientCnxn: Client session timed out, > have not heard from server in 59311ms for sessionid 0x136557f99c90065, > closing socket connection and attempting reconnect > > 12/03/30 15:38:32 FATAL regionserver.HRegionServer: ABORTING region > server serverName=ihub-dn-b1,60020,1332955859363, load=(requests=0, > regions=44, usedHeap=446, maxHeap=1197): Unhandled exception: > org.apache.hadoop.hbase.YouAreDeadException: Server REPORT rejected; > currently processing ihub-dn-b1,60020,1332955859363 as dead server > > It appears to be a classic overworked RS. You were doing too much > for the RS and it did not respond in time, the Master marked it as > dead, when the RS responded Master said no your are already dead and > aborted the server. This is why you see the YouAreDeadException. > This is probably due to the shared resources of the VM infrastructure > you are running. You will either need to devote more resources or add > more nodes(most likely physical) to the cluster if you would like to > keep running these jobs. > > On Fri, Mar 30, 2012 at 9:24 PM, anil gupta <[EMAIL PROTECTED]> wrote: > > Hi Kevin, > > > > Here is dropbox link to the log file of region server which failed: > > http://dl.dropbox.com/u/64149128/hbase-hbase-regionserver-ihub-dn-b1.out > > IMHO, the problem starts from the line #3009 which says: 12/03/30 > 15:38:32 > > FATAL regionserver.HRegionServer: ABORTING region server > > serverName=ihub-dn-b1,60020,1332955859363, load=(requests=0, regions=44, > > usedHeap=446, maxHeap=1197): Unhandled exception: > > org.apache.hadoop.hbase.YouAreDeadException: Server REPORT rejected; > > currently processing ihub-dn-b1,60020,1332955859363 as dead server > > > > I have already tested fault tolerance of HBase by manually bringing down > a > > RS while querying a Table and it worked fine and I was expecting the same > > today(even though the RS went down by itself today) when i was loading > the > > data. But, it didn't work out well. > > Thanks for your time. Let me know if you need more details. > > > > ~Anil > > > > On Fri, Mar 30, 2012 at 6:05 PM, Kevin O'dell <[EMAIL PROTECTED] > >wrote: > > > >> Anil, > >> > >> Can you please attach the RS logs from the failure? > >> > >> On Fri, Mar 30, 2012 at 7:05 PM, anil gupta <[EMAIL PROTECTED]> > wrote: > >> > Hi All, > >> > > >> > I am using cdh3u2 and i have 7 worker nodes(VM's spread across two > >> > machines) which are running Datanode, Tasktracker, and Region > Server(1200 > >> > MB heap size). I was loading data into HBase using Bulk Loader with a > >> > custom mapper. I was loading around 34 million records and I have > loaded > >> > the same set of data in the same environment many times before without > >> any > >> > problem. This time while loading the data, one of the region > server(but > >> the > >> > DN and TT kept on running on that node ) failed and then after > numerous > >> > failures of map-tasks the loding job failed. Is there any > >> > setting/configuration which can make Bulk Loading fault-tolerant to > >> failure > >> > of region-servers? > >> >
-
Re: Bulk loading job failed when one region server went down in the clusteranil gupta 2012-08-07, 17:59
Hi HBase Folks,
I ran the bulk loader yesterday night to load data in a table. During the bulk loading job one of the region server crashed and the entire job failed. It takes around 2.5 hours for this job to finish and the job failed when it was at around 50% complete. After the failure that table was also corrupted in HBase. My cluster has 8 region servers. Is bulk loading not fault tolerant to failure of region servers? I am using this old email chain because at that time my question went unanswered. Please share your views. Thanks, Anil Gupta On Tue, Apr 3, 2012 at 9:12 AM, anil gupta <[EMAIL PROTECTED]> wrote: > Hi Kevin, > > I am not really concerned about the RegionServer going down as the same > thing can happen when deployed in production. Although, in production we > wont be having VM environment and I am aware that my current Dev > environment is not good for heavy processing. What i am concerned about is > the failure of bulk loading job when the Region Server failed. Does this > mean that Bulk loading job is not fault tolerant to Failure of Region > Server? I was expecting the job to be successful even though the > RegionServer failed because there 6 more RS running in the cluster. Fault > Tolerance is one of the biggest selling point of Hadoop platform. Let me > know your views. > Thanks for your time. > > Thanks, > Anil Gupta > > > On Tue, Apr 3, 2012 at 7:34 AM, Kevin O'dell <[EMAIL PROTECTED]>wrote: > >> Anil, >> >> I am sorry for the delayed response. Reviewing the logs it appears: >> >> 2/03/30 15:38:31 INFO zookeeper.ClientCnxn: Client session timed out, >> have not heard from server in 59311ms for sessionid 0x136557f99c90065, >> closing socket connection and attempting reconnect >> >> 12/03/30 15:38:32 FATAL regionserver.HRegionServer: ABORTING region >> server serverName=ihub-dn-b1,60020,1332955859363, load=(requests=0, >> regions=44, usedHeap=446, maxHeap=1197): Unhandled exception: >> org.apache.hadoop.hbase.YouAreDeadException: Server REPORT rejected; >> currently processing ihub-dn-b1,60020,1332955859363 as dead server >> >> It appears to be a classic overworked RS. You were doing too much >> for the RS and it did not respond in time, the Master marked it as >> dead, when the RS responded Master said no your are already dead and >> aborted the server. This is why you see the YouAreDeadException. >> This is probably due to the shared resources of the VM infrastructure >> you are running. You will either need to devote more resources or add >> more nodes(most likely physical) to the cluster if you would like to >> keep running these jobs. >> >> On Fri, Mar 30, 2012 at 9:24 PM, anil gupta <[EMAIL PROTECTED]> wrote: >> > Hi Kevin, >> > >> > Here is dropbox link to the log file of region server which failed: >> > >> http://dl.dropbox.com/u/64149128/hbase-hbase-regionserver-ihub-dn-b1.out >> > IMHO, the problem starts from the line #3009 which says: 12/03/30 >> 15:38:32 >> > FATAL regionserver.HRegionServer: ABORTING region server >> > serverName=ihub-dn-b1,60020,1332955859363, load=(requests=0, regions=44, >> > usedHeap=446, maxHeap=1197): Unhandled exception: >> > org.apache.hadoop.hbase.YouAreDeadException: Server REPORT rejected; >> > currently processing ihub-dn-b1,60020,1332955859363 as dead server >> > >> > I have already tested fault tolerance of HBase by manually bringing >> down a >> > RS while querying a Table and it worked fine and I was expecting the >> same >> > today(even though the RS went down by itself today) when i was loading >> the >> > data. But, it didn't work out well. >> > Thanks for your time. Let me know if you need more details. >> > >> > ~Anil >> > >> > On Fri, Mar 30, 2012 at 6:05 PM, Kevin O'dell <[EMAIL PROTECTED] >> >wrote: >> > >> >> Anil, >> >> >> >> Can you please attach the RS logs from the failure? >> >> >> >> On Fri, Mar 30, 2012 at 7:05 PM, anil gupta <[EMAIL PROTECTED]> >> wrote: >> >> > Hi All, >> >> > >> >> > I am using cdh3u2 and i have 7 worker nodes(VM's spread across two Thanks & Regards, Anil Gupta
-
Re: Bulk loading job failed when one region server went down in the clusterKevin O'dell 2012-08-13, 13:51
Anil,
Do you have root cause on the RS failure? I have never heard of one RS failure causing a whole job to fail. On Tue, Aug 7, 2012 at 1:59 PM, anil gupta <[EMAIL PROTECTED]> wrote: > Hi HBase Folks, > > I ran the bulk loader yesterday night to load data in a table. During the > bulk loading job one of the region server crashed and the entire job > failed. It takes around 2.5 hours for this job to finish and the job failed > when it was at around 50% complete. After the failure that table was also > corrupted in HBase. My cluster has 8 region servers. > > Is bulk loading not fault tolerant to failure of region servers? > > I am using this old email chain because at that time my question went > unanswered. Please share your views. > > Thanks, > Anil Gupta > > On Tue, Apr 3, 2012 at 9:12 AM, anil gupta <[EMAIL PROTECTED]> wrote: > > > Hi Kevin, > > > > I am not really concerned about the RegionServer going down as the same > > thing can happen when deployed in production. Although, in production we > > wont be having VM environment and I am aware that my current Dev > > environment is not good for heavy processing. What i am concerned about > is > > the failure of bulk loading job when the Region Server failed. Does this > > mean that Bulk loading job is not fault tolerant to Failure of Region > > Server? I was expecting the job to be successful even though the > > RegionServer failed because there 6 more RS running in the cluster. Fault > > Tolerance is one of the biggest selling point of Hadoop platform. Let me > > know your views. > > Thanks for your time. > > > > Thanks, > > Anil Gupta > > > > > > On Tue, Apr 3, 2012 at 7:34 AM, Kevin O'dell <[EMAIL PROTECTED] > >wrote: > > > >> Anil, > >> > >> I am sorry for the delayed response. Reviewing the logs it appears: > >> > >> 2/03/30 15:38:31 INFO zookeeper.ClientCnxn: Client session timed out, > >> have not heard from server in 59311ms for sessionid 0x136557f99c90065, > >> closing socket connection and attempting reconnect > >> > >> 12/03/30 15:38:32 FATAL regionserver.HRegionServer: ABORTING region > >> server serverName=ihub-dn-b1,60020,1332955859363, load=(requests=0, > >> regions=44, usedHeap=446, maxHeap=1197): Unhandled exception: > >> org.apache.hadoop.hbase.YouAreDeadException: Server REPORT rejected; > >> currently processing ihub-dn-b1,60020,1332955859363 as dead server > >> > >> It appears to be a classic overworked RS. You were doing too much > >> for the RS and it did not respond in time, the Master marked it as > >> dead, when the RS responded Master said no your are already dead and > >> aborted the server. This is why you see the YouAreDeadException. > >> This is probably due to the shared resources of the VM infrastructure > >> you are running. You will either need to devote more resources or add > >> more nodes(most likely physical) to the cluster if you would like to > >> keep running these jobs. > >> > >> On Fri, Mar 30, 2012 at 9:24 PM, anil gupta <[EMAIL PROTECTED]> > wrote: > >> > Hi Kevin, > >> > > >> > Here is dropbox link to the log file of region server which failed: > >> > > >> > http://dl.dropbox.com/u/64149128/hbase-hbase-regionserver-ihub-dn-b1.out > >> > IMHO, the problem starts from the line #3009 which says: 12/03/30 > >> 15:38:32 > >> > FATAL regionserver.HRegionServer: ABORTING region server > >> > serverName=ihub-dn-b1,60020,1332955859363, load=(requests=0, > regions=44, > >> > usedHeap=446, maxHeap=1197): Unhandled exception: > >> > org.apache.hadoop.hbase.YouAreDeadException: Server REPORT rejected; > >> > currently processing ihub-dn-b1,60020,1332955859363 as dead server > >> > > >> > I have already tested fault tolerance of HBase by manually bringing > >> down a > >> > RS while querying a Table and it worked fine and I was expecting the > >> same > >> > today(even though the RS went down by itself today) when i was loading > >> the > >> > data. But, it didn't work out well. > >> > Thanks for your time. Let me know if you need more details. Kevin O'Dell Customer Operations Engineer, Cloudera
-
Re: Bulk loading job failed when one region server went down in the clusterMichael Segel 2012-08-13, 13:58
Yes, it can.
You can see RS failure causing a cascading RS failure. Of course YMMV and it depends on which version you are running. OP is on CHD3u2 which still had some issues. CDH3u4 is the latest and he should upgrade. (Or go to CHD4...) HTH -Mike On Aug 13, 2012, at 8:51 AM, Kevin O'dell <[EMAIL PROTECTED]> wrote: > Anil, > > Do you have root cause on the RS failure? I have never heard of one RS > failure causing a whole job to fail. > > On Tue, Aug 7, 2012 at 1:59 PM, anil gupta <[EMAIL PROTECTED]> wrote: > >> Hi HBase Folks, >> >> I ran the bulk loader yesterday night to load data in a table. During the >> bulk loading job one of the region server crashed and the entire job >> failed. It takes around 2.5 hours for this job to finish and the job failed >> when it was at around 50% complete. After the failure that table was also >> corrupted in HBase. My cluster has 8 region servers. >> >> Is bulk loading not fault tolerant to failure of region servers? >> >> I am using this old email chain because at that time my question went >> unanswered. Please share your views. >> >> Thanks, >> Anil Gupta >> >> On Tue, Apr 3, 2012 at 9:12 AM, anil gupta <[EMAIL PROTECTED]> wrote: >> >>> Hi Kevin, >>> >>> I am not really concerned about the RegionServer going down as the same >>> thing can happen when deployed in production. Although, in production we >>> wont be having VM environment and I am aware that my current Dev >>> environment is not good for heavy processing. What i am concerned about >> is >>> the failure of bulk loading job when the Region Server failed. Does this >>> mean that Bulk loading job is not fault tolerant to Failure of Region >>> Server? I was expecting the job to be successful even though the >>> RegionServer failed because there 6 more RS running in the cluster. Fault >>> Tolerance is one of the biggest selling point of Hadoop platform. Let me >>> know your views. >>> Thanks for your time. >>> >>> Thanks, >>> Anil Gupta >>> >>> >>> On Tue, Apr 3, 2012 at 7:34 AM, Kevin O'dell <[EMAIL PROTECTED] >>> wrote: >>> >>>> Anil, >>>> >>>> I am sorry for the delayed response. Reviewing the logs it appears: >>>> >>>> 2/03/30 15:38:31 INFO zookeeper.ClientCnxn: Client session timed out, >>>> have not heard from server in 59311ms for sessionid 0x136557f99c90065, >>>> closing socket connection and attempting reconnect >>>> >>>> 12/03/30 15:38:32 FATAL regionserver.HRegionServer: ABORTING region >>>> server serverName=ihub-dn-b1,60020,1332955859363, load=(requests=0, >>>> regions=44, usedHeap=446, maxHeap=1197): Unhandled exception: >>>> org.apache.hadoop.hbase.YouAreDeadException: Server REPORT rejected; >>>> currently processing ihub-dn-b1,60020,1332955859363 as dead server >>>> >>>> It appears to be a classic overworked RS. You were doing too much >>>> for the RS and it did not respond in time, the Master marked it as >>>> dead, when the RS responded Master said no your are already dead and >>>> aborted the server. This is why you see the YouAreDeadException. >>>> This is probably due to the shared resources of the VM infrastructure >>>> you are running. You will either need to devote more resources or add >>>> more nodes(most likely physical) to the cluster if you would like to >>>> keep running these jobs. >>>> >>>> On Fri, Mar 30, 2012 at 9:24 PM, anil gupta <[EMAIL PROTECTED]> >> wrote: >>>>> Hi Kevin, >>>>> >>>>> Here is dropbox link to the log file of region server which failed: >>>>> >>>> >> http://dl.dropbox.com/u/64149128/hbase-hbase-regionserver-ihub-dn-b1.out >>>>> IMHO, the problem starts from the line #3009 which says: 12/03/30 >>>> 15:38:32 >>>>> FATAL regionserver.HRegionServer: ABORTING region server >>>>> serverName=ihub-dn-b1,60020,1332955859363, load=(requests=0, >> regions=44, >>>>> usedHeap=446, maxHeap=1197): Unhandled exception: >>>>> org.apache.hadoop.hbase.YouAreDeadException: Server REPORT rejected; >>>>> currently processing ihub-dn-b1,60020,1332955859363 as dead server
-
Re: Bulk loading job failed when one region server went down in the clusteranil gupta 2012-08-13, 19:11
Hi Guys,
Sorry for not mentioning the version I am currently running. My current version is HBase 0.92.1(cdh4) and running Hadoop2.0.0-Alpha with YARN for MR. My original post was for HBase0.92. Here are some more details of my current setup: I am running a 8 slave, 4 admin node cluster on CentOS6.0 VM's installed on VMware Hyprevisor 5.0. Each of my VM is having 3.2 GB of memory and 500 HDFS space. I use this cluster for POC(Proof of Concepts). I am not looking for any performance benchmarking from this set-up. Due to some major bugs in YARN i am unable to make work in a proper way in memory less than 4GB. I am already having discussion regarding them on Hadoop Mailing List. Here is the log of failed mapper: http://pastebin.com/f83xE2wv The problem is that when i start a Bulk loading job in YARN, 8 Map processes start on each slave and then all of my slaves are hammered badly due to this. Since the slaves are getting hammered badly then RegionServer gets lease expired or YourAreDeadExpcetion. Here is the log of RS which caused the job to fail: http://pastebin.com/9ZQx0DtD I am aware that this is happening due to underperforming hardware(Two slaves are using one 7200 rpm Hard Drive in my setup) and some major bugs regarding running YARN in less than 4 GB memory. My only concern is the failure of entire MR job and its fault tolerance to RS failures. I am not really concerned about RS failure since HBase is fault tolerant. Please let me know if you need anything else. Thanks, Anil On Mon, Aug 13, 2012 at 6:58 AM, Michael Segel <[EMAIL PROTECTED]>wrote: > Yes, it can. > You can see RS failure causing a cascading RS failure. Of course YMMV and > it depends on which version you are running. > > OP is on CHD3u2 which still had some issues. CDH3u4 is the latest and he > should upgrade. > > (Or go to CHD4...) > > HTH > > -Mike > > On Aug 13, 2012, at 8:51 AM, Kevin O'dell <[EMAIL PROTECTED]> > wrote: > > > Anil, > > > > Do you have root cause on the RS failure? I have never heard of one RS > > failure causing a whole job to fail. > > > > On Tue, Aug 7, 2012 at 1:59 PM, anil gupta <[EMAIL PROTECTED]> > wrote: > > > >> Hi HBase Folks, > >> > >> I ran the bulk loader yesterday night to load data in a table. During > the > >> bulk loading job one of the region server crashed and the entire job > >> failed. It takes around 2.5 hours for this job to finish and the job > failed > >> when it was at around 50% complete. After the failure that table was > also > >> corrupted in HBase. My cluster has 8 region servers. > >> > >> Is bulk loading not fault tolerant to failure of region servers? > >> > >> I am using this old email chain because at that time my question went > >> unanswered. Please share your views. > >> > >> Thanks, > >> Anil Gupta > >> > >> On Tue, Apr 3, 2012 at 9:12 AM, anil gupta <[EMAIL PROTECTED]> > wrote: > >> > >>> Hi Kevin, > >>> > >>> I am not really concerned about the RegionServer going down as the same > >>> thing can happen when deployed in production. Although, in production > we > >>> wont be having VM environment and I am aware that my current Dev > >>> environment is not good for heavy processing. What i am concerned > about > >> is > >>> the failure of bulk loading job when the Region Server failed. Does > this > >>> mean that Bulk loading job is not fault tolerant to Failure of Region > >>> Server? I was expecting the job to be successful even though the > >>> RegionServer failed because there 6 more RS running in the cluster. > Fault > >>> Tolerance is one of the biggest selling point of Hadoop platform. Let > me > >>> know your views. > >>> Thanks for your time. > >>> > >>> Thanks, > >>> Anil Gupta > >>> > >>> > >>> On Tue, Apr 3, 2012 at 7:34 AM, Kevin O'dell <[EMAIL PROTECTED] > >>> wrote: > >>> > >>>> Anil, > >>>> > >>>> I am sorry for the delayed response. Reviewing the logs it appears: > >>>> > >>>> 2/03/30 15:38:31 INFO zookeeper.ClientCnxn: Client session timed out, Thanks & Regards, Anil Gupta
-
Re: Bulk loading job failed when one region server went down in the clusterMichael Segel 2012-08-13, 19:39
Anil,
Do you know what happens when you have an airplane that has too heavy a cargo when it tries to take off? You run out of runway and you crash and burn. Looking at your post, why are you starting 8 map processes on each slave? That's tunable and you clearly do not have enough memory in each VM to support 8 slots on a node. Here you swap, you swap you cause HBase to crash and burn. 3.2GB of memory means that no more than 1 slot per slave and even then... you're going to be very tight. Not to mention that you will need to loosen up on your timings since its all virtual and you have way too much i/o per drive going on. My suggestion is that you go back and tune your system before thinking about running anything. HTH -Mike On Aug 13, 2012, at 2:11 PM, anil gupta <[EMAIL PROTECTED]> wrote: > Hi Guys, > > Sorry for not mentioning the version I am currently running. My current > version is HBase 0.92.1(cdh4) and running Hadoop2.0.0-Alpha with YARN for > MR. My original post was for HBase0.92. Here are some more details of my > current setup: > I am running a 8 slave, 4 admin node cluster on CentOS6.0 VM's installed on > VMware Hyprevisor 5.0. Each of my VM is having 3.2 GB of memory and 500 > HDFS space. > I use this cluster for POC(Proof of Concepts). I am not looking for any > performance benchmarking from this set-up. Due to some major bugs in YARN i > am unable to make work in a proper way in memory less than 4GB. I am > already having discussion regarding them on Hadoop Mailing List. > > Here is the log of failed mapper: http://pastebin.com/f83xE2wv > > The problem is that when i start a Bulk loading job in YARN, 8 Map > processes start on each slave and then all of my slaves are hammered badly > due to this. Since the slaves are getting hammered badly then RegionServer > gets lease expired or YourAreDeadExpcetion. Here is the log of RS which > caused the job to fail: http://pastebin.com/9ZQx0DtD > > I am aware that this is happening due to underperforming hardware(Two > slaves are using one 7200 rpm Hard Drive in my setup) and some major bugs > regarding running YARN in less than 4 GB memory. My only concern is the > failure of entire MR job and its fault tolerance to RS failures. I am not > really concerned about RS failure since HBase is fault tolerant. > > Please let me know if you need anything else. > > Thanks, > Anil > > > > On Mon, Aug 13, 2012 at 6:58 AM, Michael Segel <[EMAIL PROTECTED]>wrote: > >> Yes, it can. >> You can see RS failure causing a cascading RS failure. Of course YMMV and >> it depends on which version you are running. >> >> OP is on CHD3u2 which still had some issues. CDH3u4 is the latest and he >> should upgrade. >> >> (Or go to CHD4...) >> >> HTH >> >> -Mike >> >> On Aug 13, 2012, at 8:51 AM, Kevin O'dell <[EMAIL PROTECTED]> >> wrote: >> >>> Anil, >>> >>> Do you have root cause on the RS failure? I have never heard of one RS >>> failure causing a whole job to fail. >>> >>> On Tue, Aug 7, 2012 at 1:59 PM, anil gupta <[EMAIL PROTECTED]> >> wrote: >>> >>>> Hi HBase Folks, >>>> >>>> I ran the bulk loader yesterday night to load data in a table. During >> the >>>> bulk loading job one of the region server crashed and the entire job >>>> failed. It takes around 2.5 hours for this job to finish and the job >> failed >>>> when it was at around 50% complete. After the failure that table was >> also >>>> corrupted in HBase. My cluster has 8 region servers. >>>> >>>> Is bulk loading not fault tolerant to failure of region servers? >>>> >>>> I am using this old email chain because at that time my question went >>>> unanswered. Please share your views. >>>> >>>> Thanks, >>>> Anil Gupta >>>> >>>> On Tue, Apr 3, 2012 at 9:12 AM, anil gupta <[EMAIL PROTECTED]> >> wrote: >>>> >>>>> Hi Kevin, >>>>> >>>>> I am not really concerned about the RegionServer going down as the same >>>>> thing can happen when deployed in production. Although, in production
-
Re: Bulk loading job failed when one region server went down in the clusteranil gupta 2012-08-13, 20:14
Hi Mike,
I tried doing that by setting up properties in mapred-site.xml but Yarn doesnt seems to work with "mapreduce.tasktracker. map.tasks.maximum" property. Here is a reference to a discussion to same problem: https://groups.google.com/a/cloudera.org/forum/?fromgroups#!searchin/cdh-user/yarn$20anil/cdh-user/J564g9A8tPE/ZpslzOkIGZYJ[1-25] I have also posted about the same problem in Hadoop mailing list. I already admitted in my previous email that YARN is having major issues when we want to control it in low memory environment. I was just trying to get views HBase experts on bulk load failures since we will be relying heavily on Fault Tolerance. If HBase Bulk Loader is fault tolerant to failure of RS in a viable environment then I dont have any issue. I hope this clears up my purpose of posting on this topic. Thanks, Anil On Mon, Aug 13, 2012 at 12:39 PM, Michael Segel <[EMAIL PROTECTED]>wrote: > Anil, > > Do you know what happens when you have an airplane that has too heavy a > cargo when it tries to take off? > You run out of runway and you crash and burn. > > Looking at your post, why are you starting 8 map processes on each slave? > That's tunable and you clearly do not have enough memory in each VM to > support 8 slots on a node. > Here you swap, you swap you cause HBase to crash and burn. > > 3.2GB of memory means that no more than 1 slot per slave and even then... > you're going to be very tight. Not to mention that you will need to loosen > up on your timings since its all virtual and you have way too much i/o per > drive going on. > > > My suggestion is that you go back and tune your system before thinking > about running anything. > > HTH > > -Mike > > On Aug 13, 2012, at 2:11 PM, anil gupta <[EMAIL PROTECTED]> wrote: > > > Hi Guys, > > > > Sorry for not mentioning the version I am currently running. My current > > version is HBase 0.92.1(cdh4) and running Hadoop2.0.0-Alpha with YARN for > > MR. My original post was for HBase0.92. Here are some more details of my > > current setup: > > I am running a 8 slave, 4 admin node cluster on CentOS6.0 VM's installed > on > > VMware Hyprevisor 5.0. Each of my VM is having 3.2 GB of memory and 500 > > HDFS space. > > I use this cluster for POC(Proof of Concepts). I am not looking for any > > performance benchmarking from this set-up. Due to some major bugs in > YARN i > > am unable to make work in a proper way in memory less than 4GB. I am > > already having discussion regarding them on Hadoop Mailing List. > > > > Here is the log of failed mapper: http://pastebin.com/f83xE2wv > > > > The problem is that when i start a Bulk loading job in YARN, 8 Map > > processes start on each slave and then all of my slaves are hammered > badly > > due to this. Since the slaves are getting hammered badly then > RegionServer > > gets lease expired or YourAreDeadExpcetion. Here is the log of RS which > > caused the job to fail: http://pastebin.com/9ZQx0DtD > > > > I am aware that this is happening due to underperforming hardware(Two > > slaves are using one 7200 rpm Hard Drive in my setup) and some major bugs > > regarding running YARN in less than 4 GB memory. My only concern is the > > failure of entire MR job and its fault tolerance to RS failures. I am not > > really concerned about RS failure since HBase is fault tolerant. > > > > Please let me know if you need anything else. > > > > Thanks, > > Anil > > > > > > > > On Mon, Aug 13, 2012 at 6:58 AM, Michael Segel < > [EMAIL PROTECTED]>wrote: > > > >> Yes, it can. > >> You can see RS failure causing a cascading RS failure. Of course YMMV > and > >> it depends on which version you are running. > >> > >> OP is on CHD3u2 which still had some issues. CDH3u4 is the latest and he > >> should upgrade. > >> > >> (Or go to CHD4...) > >> > >> HTH > >> > >> -Mike > >> > >> On Aug 13, 2012, at 8:51 AM, Kevin O'dell <[EMAIL PROTECTED]> > >> wrote: > >> > >>> Anil, > >>> > >>> Do you have root cause on the RS failure? I have never heard of one RS Thanks & Regards, Anil Gupta
-
Re: Bulk loading job failed when one region server went down in the clusteranil gupta 2012-08-13, 20:24
Hi Mike,
Here is the link to my email on Hadoop list regarding YARN problem: http://mail-archives.apache.org/mod_mbox/hadoop-common-user/201208.mbox/%3CCAF1+Vs8oF4VsHbg14B7SGzBB_8Ty7GC9Lw3nm1bM0v+[EMAIL PROTECTED]%3E Somehow the link for cloudera mail in last email does not seems to work. Here is the new link: https://groups.google.com/a/cloudera.org/forum/?fromgroups#!searchin/cdh-user/yarn$20anil/cdh-user/J564g9A8tPE/ZpslzOkIGZYJ%5B1-25%5D Thanks for your help, Anil Gupta On Mon, Aug 13, 2012 at 1:14 PM, anil gupta <[EMAIL PROTECTED]> wrote: > Hi Mike, > > I tried doing that by setting up properties in mapred-site.xml but Yarn > doesnt seems to work with "mapreduce.tasktracker. > map.tasks.maximum" property. Here is a reference to a discussion to same > problem: > > https://groups.google.com/a/cloudera.org/forum/?fromgroups#!searchin/cdh-user/yarn$20anil/cdh-user/J564g9A8tPE/ZpslzOkIGZYJ[1-25] > I have also posted about the same problem in Hadoop mailing list. > > I already admitted in my previous email that YARN is having major issues > when we want to control it in low memory environment. I was just trying to > get views HBase experts on bulk load failures since we will be relying > heavily on Fault Tolerance. > If HBase Bulk Loader is fault tolerant to failure of RS in a viable > environment then I dont have any issue. I hope this clears up my purpose > of posting on this topic. > > Thanks, > Anil > > On Mon, Aug 13, 2012 at 12:39 PM, Michael Segel <[EMAIL PROTECTED] > > wrote: > >> Anil, >> >> Do you know what happens when you have an airplane that has too heavy a >> cargo when it tries to take off? >> You run out of runway and you crash and burn. >> >> Looking at your post, why are you starting 8 map processes on each slave? >> That's tunable and you clearly do not have enough memory in each VM to >> support 8 slots on a node. >> Here you swap, you swap you cause HBase to crash and burn. >> >> 3.2GB of memory means that no more than 1 slot per slave and even then... >> you're going to be very tight. Not to mention that you will need to loosen >> up on your timings since its all virtual and you have way too much i/o per >> drive going on. >> >> >> My suggestion is that you go back and tune your system before thinking >> about running anything. >> >> HTH >> >> -Mike >> >> On Aug 13, 2012, at 2:11 PM, anil gupta <[EMAIL PROTECTED]> wrote: >> >> > Hi Guys, >> > >> > Sorry for not mentioning the version I am currently running. My current >> > version is HBase 0.92.1(cdh4) and running Hadoop2.0.0-Alpha with YARN >> for >> > MR. My original post was for HBase0.92. Here are some more details of my >> > current setup: >> > I am running a 8 slave, 4 admin node cluster on CentOS6.0 VM's >> installed on >> > VMware Hyprevisor 5.0. Each of my VM is having 3.2 GB of memory and 500 >> > HDFS space. >> > I use this cluster for POC(Proof of Concepts). I am not looking for any >> > performance benchmarking from this set-up. Due to some major bugs in >> YARN i >> > am unable to make work in a proper way in memory less than 4GB. I am >> > already having discussion regarding them on Hadoop Mailing List. >> > >> > Here is the log of failed mapper: http://pastebin.com/f83xE2wv >> > >> > The problem is that when i start a Bulk loading job in YARN, 8 Map >> > processes start on each slave and then all of my slaves are hammered >> badly >> > due to this. Since the slaves are getting hammered badly then >> RegionServer >> > gets lease expired or YourAreDeadExpcetion. Here is the log of RS which >> > caused the job to fail: http://pastebin.com/9ZQx0DtD >> > >> > I am aware that this is happening due to underperforming hardware(Two >> > slaves are using one 7200 rpm Hard Drive in my setup) and some major >> bugs >> > regarding running YARN in less than 4 GB memory. My only concern is the >> > failure of entire MR job and its fault tolerance to RS failures. I am >> not >> > really concerned about RS failure since HBase is fault tolerant. Thanks & Regards, Anil Gupta
-
Re: Bulk loading job failed when one region server went down in the clusterMichael Segel 2012-08-14, 00:17
Not sure why you're having an issue in getting an answer.
Even if you're not a YARN expert, google is your friend. See: http://books.google.com/books?id=Wu_xeGdU4G8C&pg=PA323&lpg=PA323&dq=Hadoop+YARN+setting+number+of+slots&source=bl&ots=i7xQYwQf-u&sig=ceuDmiOkbqTqok_HfIr3udvm6C0&hl=en&sa=X&ei=8JYpUNeZJMnxygGzqIGwCw&ved=0CEQQ6AEwAQ#v=onepage&q=Hadoop%20YARN%20setting%20number%20of%20slots&f=false This is a web page from Tom White's 3rd Edition. The bottom line... -=- The considerations for how much memory to dedicate to a node manager for running containers are similar to the those discussed in “Memory” on page 307. Each Hadoop daemon uses 1,000 MB, so for a datanode and a node manager, the total is 2,000 MB. Set aside enough for other processes that are running on the machine, and the remainder can be dedicated to the node manager’s containers by setting the configuration property yarn.nodemanager.resource.memory-mb to the total allocation in MB. (The default is 8,192 MB.) -=- Taken per fair use. Page 323 As you can see you need to drop this down to something like 1GB if you even have enough memory for that. Again set yarn.nodemanager.resource.memory-mb to a more realistic value. 8GB on a 3 GB node? Yeah that would really hose you, especially if you're trying to run HBase too. Even here... You really don't have enough memory to do it all. (Maybe enough to do a small test) Good luck. On Aug 13, 2012, at 3:24 PM, anil gupta <[EMAIL PROTECTED]> wrote: > Hi Mike, > > Here is the link to my email on Hadoop list regarding YARN problem: > http://mail-archives.apache.org/mod_mbox/hadoop-common-user/201208.mbox/%3CCAF1+Vs8oF4VsHbg14B7SGzBB_8Ty7GC9Lw3nm1bM0v+[EMAIL PROTECTED]%3E > > Somehow the link for cloudera mail in last email does not seems to work. > Here is the new link: > https://groups.google.com/a/cloudera.org/forum/?fromgroups#!searchin/cdh-user/yarn$20anil/cdh-user/J564g9A8tPE/ZpslzOkIGZYJ%5B1-25%5D > > Thanks for your help, > Anil Gupta > > On Mon, Aug 13, 2012 at 1:14 PM, anil gupta <[EMAIL PROTECTED]> wrote: > >> Hi Mike, >> >> I tried doing that by setting up properties in mapred-site.xml but Yarn >> doesnt seems to work with "mapreduce.tasktracker. >> map.tasks.maximum" property. Here is a reference to a discussion to same >> problem: >> >> https://groups.google.com/a/cloudera.org/forum/?fromgroups#!searchin/cdh-user/yarn$20anil/cdh-user/J564g9A8tPE/ZpslzOkIGZYJ[1-25] >> I have also posted about the same problem in Hadoop mailing list. >> >> I already admitted in my previous email that YARN is having major issues >> when we want to control it in low memory environment. I was just trying to >> get views HBase experts on bulk load failures since we will be relying >> heavily on Fault Tolerance. >> If HBase Bulk Loader is fault tolerant to failure of RS in a viable >> environment then I dont have any issue. I hope this clears up my purpose >> of posting on this topic. >> >> Thanks, >> Anil >> >> On Mon, Aug 13, 2012 at 12:39 PM, Michael Segel <[EMAIL PROTECTED] >>> wrote: >> >>> Anil, >>> >>> Do you know what happens when you have an airplane that has too heavy a >>> cargo when it tries to take off? >>> You run out of runway and you crash and burn. >>> >>> Looking at your post, why are you starting 8 map processes on each slave? >>> That's tunable and you clearly do not have enough memory in each VM to >>> support 8 slots on a node. >>> Here you swap, you swap you cause HBase to crash and burn. >>> >>> 3.2GB of memory means that no more than 1 slot per slave and even then... >>> you're going to be very tight. Not to mention that you will need to loosen >>> up on your timings since its all virtual and you have way too much i/o per >>> drive going on. >>> >>> >>> My suggestion is that you go back and tune your system before thinking >>> about running anything. >>> >>> HTH >>> >>> -Mike >>> >>> On Aug 13, 2012, at 2:11 PM, anil gupta <[EMAIL PROTECTED]> wrote:
-
Re: Bulk loading job failed when one region server went down in the clusteranil gupta 2012-08-14, 01:05
Hi Mike,
You hit the nail on the that i need to lower down the memory by setting yarn.nodemanager.resource.memory-mb. Here's another major bug of YARN you are talking about. I already tried setting that property to 1500 MB in yarn-site.xml and setting yarn.app.mapreduce.am.resource.mb to 1000 MB in mapred-site.xml. If i do this change then the YARN job does not runs at all even though the configuration is right. It's a bug and i have to file a JIRA for it. So, i was only left with the option to let it run with incorrect YARN conf since my objective is to load data into HBase rather than playing with YARN. MapReduce is only used for bulk loading in my cluster. Here is a link to the mailing list email regarding running YARN with lesser memory: http://permalink.gmane.org/gmane.comp.jakarta.lucene.hadoop.user/33164 It would be great if you can answer this simple question of mine: Is HBase Bulk Loading fault tolerant to Region Server failures in a viable/decent environment? Thanks, Anil Gupta On Mon, Aug 13, 2012 at 5:17 PM, Michael Segel <[EMAIL PROTECTED]>wrote: > Not sure why you're having an issue in getting an answer. > Even if you're not a YARN expert, google is your friend. > > See: > > http://books.google.com/books?id=Wu_xeGdU4G8C&pg=PA323&lpg=PA323&dq=Hadoop+YARN+setting+number+of+slots&source=bl&ots=i7xQYwQf-u&sig=ceuDmiOkbqTqok_HfIr3udvm6C0&hl=en&sa=X&ei=8JYpUNeZJMnxygGzqIGwCw&ved=0CEQQ6AEwAQ#v=onepage&q=Hadoop%20YARN%20setting%20number%20of%20slots&f=false > > This is a web page from Tom White's 3rd Edition. > > The bottom line... > -=- > The considerations for how much memory to dedicate to a node manager for > running containers are similar to the those discussed in > > “Memory” on page 307. Each Hadoop daemon uses 1,000 MB, so for a datanode > and a node manager, the total is 2,000 MB. Set aside enough for other > processes that are running on the machine, and the remainder can be > dedicated to the node manager’s containers by setting the configuration > property yarn.nodemanager.resource.memory-mb to the total allocation in MB. > (The default is 8,192 MB.) > -=- > > Taken per fair use. Page 323 > > As you can see you need to drop this down to something like 1GB if you > even have enough memory for that. > Again set yarn.nodemanager.resource.memory-mb to a more realistic value. > > 8GB on a 3 GB node? Yeah that would really hose you, especially if you're > trying to run HBase too. > > Even here... You really don't have enough memory to do it all. (Maybe > enough to do a small test) > > > > Good luck. > > On Aug 13, 2012, at 3:24 PM, anil gupta <[EMAIL PROTECTED]> wrote: > > > > Hi Mike, > > > > Here is the link to my email on Hadoop list regarding YARN problem: > > > http://mail-archives.apache.org/mod_mbox/hadoop-common-user/201208.mbox/%3CCAF1+Vs8oF4VsHbg14B7SGzBB_8Ty7GC9Lw3nm1bM0v+[EMAIL PROTECTED]%3E > > > > Somehow the link for cloudera mail in last email does not seems to work. > > Here is the new link: > > > https://groups.google.com/a/cloudera.org/forum/?fromgroups#!searchin/cdh-user/yarn$20anil/cdh-user/J564g9A8tPE/ZpslzOkIGZYJ%5B1-25%5D > > > > Thanks for your help, > > Anil Gupta > > > > On Mon, Aug 13, 2012 at 1:14 PM, anil gupta <[EMAIL PROTECTED]> > wrote: > > > >> Hi Mike, > >> > >> I tried doing that by setting up properties in mapred-site.xml but Yarn > >> doesnt seems to work with "mapreduce.tasktracker. > >> map.tasks.maximum" property. Here is a reference to a discussion to same > >> problem: > >> > >> > https://groups.google.com/a/cloudera.org/forum/?fromgroups#!searchin/cdh-user/yarn$20anil/cdh-user/J564g9A8tPE/ZpslzOkIGZYJ[1-25] > >> I have also posted about the same problem in Hadoop mailing list. > >> > >> I already admitted in my previous email that YARN is having major issues > >> when we want to control it in low memory environment. I was just trying > to > >> get views HBase experts on bulk load failures since we will be relying > >> heavily on Fault Tolerance. Thanks & Regards, Anil Gupta
-
Re: Bulk loading job failed when one region server went down in the clusterMichael Segel 2012-08-14, 01:59
Anil,
I don't know if you can call it a bug if you don't have enough memory available. I mean if you don't use HBase, then you may have more leeway in terms of swap. You can also do more tuning of HBase to handle the additional latency found in a Virtual environment. Why don't you rebuild your vm's to be slightly larger in terms of memory? On Aug 13, 2012, at 8:05 PM, anil gupta <[EMAIL PROTECTED]> wrote: > Hi Mike, > > You hit the nail on the that i need to lower down the memory by setting > yarn.nodemanager.resource.memory-mb. Here's another major bug of YARN you > are talking about. I already tried setting that property to 1500 MB in > yarn-site.xml and setting yarn.app.mapreduce.am.resource.mb to 1000 MB in > mapred-site.xml. If i do this change then the YARN job does not runs at all > even though the configuration is right. It's a bug and i have to file a > JIRA for it. So, i was only left with the option to let it run with > incorrect YARN conf since my objective is to load data into HBase rather > than playing with YARN. MapReduce is only used for bulk loading in my > cluster. > > Here is a link to the mailing list email regarding running YARN with lesser > memory: > http://permalink.gmane.org/gmane.comp.jakarta.lucene.hadoop.user/33164 > > It would be great if you can answer this simple question of mine: Is HBase > Bulk Loading fault tolerant to Region Server failures in a viable/decent > environment? > > Thanks, > Anil Gupta > > On Mon, Aug 13, 2012 at 5:17 PM, Michael Segel <[EMAIL PROTECTED]>wrote: > >> Not sure why you're having an issue in getting an answer. >> Even if you're not a YARN expert, google is your friend. >> >> See: >> >> http://books.google.com/books?id=Wu_xeGdU4G8C&pg=PA323&lpg=PA323&dq=Hadoop+YARN+setting+number+of+slots&source=bl&ots=i7xQYwQf-u&sig=ceuDmiOkbqTqok_HfIr3udvm6C0&hl=en&sa=X&ei=8JYpUNeZJMnxygGzqIGwCw&ved=0CEQQ6AEwAQ#v=onepage&q=Hadoop%20YARN%20setting%20number%20of%20slots&f=false >> >> This is a web page from Tom White's 3rd Edition. >> >> The bottom line... >> -=- >> The considerations for how much memory to dedicate to a node manager for >> running containers are similar to the those discussed in >> >> “Memory” on page 307. Each Hadoop daemon uses 1,000 MB, so for a datanode >> and a node manager, the total is 2,000 MB. Set aside enough for other >> processes that are running on the machine, and the remainder can be >> dedicated to the node manager’s containers by setting the configuration >> property yarn.nodemanager.resource.memory-mb to the total allocation in MB. >> (The default is 8,192 MB.) >> -=- >> >> Taken per fair use. Page 323 >> >> As you can see you need to drop this down to something like 1GB if you >> even have enough memory for that. >> Again set yarn.nodemanager.resource.memory-mb to a more realistic value. >> >> 8GB on a 3 GB node? Yeah that would really hose you, especially if you're >> trying to run HBase too. >> >> Even here... You really don't have enough memory to do it all. (Maybe >> enough to do a small test) >> >> >> >> Good luck. >> >> On Aug 13, 2012, at 3:24 PM, anil gupta <[EMAIL PROTECTED]> wrote: >> >> >>> Hi Mike, >>> >>> Here is the link to my email on Hadoop list regarding YARN problem: >>> >> http://mail-archives.apache.org/mod_mbox/hadoop-common-user/201208.mbox/%3CCAF1+Vs8oF4VsHbg14B7SGzBB_8Ty7GC9Lw3nm1bM0v+[EMAIL PROTECTED]%3E >>> >>> Somehow the link for cloudera mail in last email does not seems to work. >>> Here is the new link: >>> >> https://groups.google.com/a/cloudera.org/forum/?fromgroups#!searchin/cdh-user/yarn$20anil/cdh-user/J564g9A8tPE/ZpslzOkIGZYJ%5B1-25%5D >>> >>> Thanks for your help, >>> Anil Gupta >>> >>> On Mon, Aug 13, 2012 at 1:14 PM, anil gupta <[EMAIL PROTECTED]> >> wrote: >>> >>>> Hi Mike, >>>> >>>> I tried doing that by setting up properties in mapred-site.xml but Yarn >>>> doesnt seems to work with "mapreduce.tasktracker. >>>> map.tasks.maximum" property. Here is a reference to a discussion to same
-
Re: Bulk loading job failed when one region server went down in the clusterStack 2012-08-15, 21:52
On Mon, Aug 13, 2012 at 6:05 PM, anil gupta <[EMAIL PROTECTED]> wrote:
> It would be great if you can answer this simple question of mine: Is HBase > Bulk Loading fault tolerant to Region Server failures in a viable/decent > environment? > Bulk Loading is an MapReduce job. Bulk Loading is as 'fault tolerant' as MapReduce is (MapReduce jobs have long timeouts -- ten minutes IIRC -- and tasks are retried up to a maximum, 4 by default, but if after all timeouts and retries have expired, the job will fail). You have RSs failing, maybe because you have too many slots allocated to MapReduce for the hardware you are using to PoC (as Michael Segel suggests). Maybe the MR task is not finding the region's new locations in time or maybe the regions are not coming back on line in time for the MR job to complete? The logs you provide for the MR task show us failing to go against a RS who has died but doesn't know it yet (the YouAreDeadException). Try looking at the subsequent map tasks that fail. Why are they failing? For same reason? Look in the master log to see whats happening around log splitting of the failed server? Is it hung up preventing the regions being assigned to new locations? St.Ack
-
Re: Bulk loading job failed when one region server went down in the clusteranil gupta 2012-08-15, 22:13
Hi Stack,
Thanks for answering my question. I admit that i am unable to run MR2(YARN) job in an efficient way on my cluster due to a major bug in YARN which is not letting me set the right configuration for MapReduce jobs. The RS's are dying with LeaseExpiredExceptions or YouAreDeadException because of overload on the slaves due to improper YARN conf . Once the MR job finishes then HBase performance is OK. I am not using this cluster for performance metrics because we wont be using virtualization in our production. My purpose of this email post was to know whether Bulk Loading is fault tolerant to RS failures or not. You answer is sufficient for clearing my doubts. Thanks, Anil On Wed, Aug 15, 2012 at 2:52 PM, Stack <[EMAIL PROTECTED]> wrote: > On Mon, Aug 13, 2012 at 6:05 PM, anil gupta <[EMAIL PROTECTED]> wrote: > > It would be great if you can answer this simple question of mine: Is > HBase > > Bulk Loading fault tolerant to Region Server failures in a viable/decent > > environment? > > > > Bulk Loading is an MapReduce job. Bulk Loading is as 'fault tolerant' > as MapReduce is (MapReduce jobs have long timeouts -- ten minutes IIRC > -- and tasks are retried up to a maximum, 4 by default, but if after > all timeouts and retries have expired, the job will fail). > > You have RSs failing, maybe because you have too many slots allocated > to MapReduce for the hardware you are using to PoC (as Michael Segel > suggests). Maybe the MR task is not finding the region's new > locations in time or maybe the regions are not coming back on line in > time for the MR job to complete? > > The logs you provide for the MR task show us failing to go against a > RS who has died but doesn't know it yet (the YouAreDeadException). > Try looking at the subsequent map tasks that fail. Why are they > failing? For same reason? Look in the master log to see whats > happening around log splitting of the failed server? Is it hung up > preventing the regions being assigned to new locations? > > St.Ack > -- Thanks & Regards, Anil Gupta |