|
Darrell Taylor
2012-05-09, 16:52
Serge Blazhiyevskyy
2012-05-09, 16:56
Darrell Taylor
2012-05-09, 16:58
Serge Blazhiyevskyy
2012-05-09, 17:04
Darrell Taylor
2012-05-09, 19:23
Serge Blazhiyevskyy
2012-05-09, 21:00
Raj Vishwanathan
2012-05-09, 21:23
Darrell Taylor
2012-05-09, 21:27
Darrell Taylor
2012-05-09, 21:40
Serge Blazhiyevskyy
2012-05-09, 21:44
Raj Vishwanathan
2012-05-09, 21:52
Darrell Taylor
2012-05-10, 06:57
Todd Lipcon
2012-05-10, 08:33
Darrell Taylor
2012-05-10, 10:57
Raj Vishwanathan
2012-05-10, 16:58
Darrell Taylor
2012-05-11, 09:29
Todd Lipcon
2012-05-11, 09:32
Harsh J
2012-05-11, 10:36
|
-
High load on datanode startupDarrell Taylor 2012-05-09, 16:52
Hi,
I wonder if someone could give some pointers with a problem I'm having? I have a 7 machine cluster setup for testing and we have been pouring data into it for a week without issue, have learnt several thing along the way and solved all the problems up to now by searching online, but now I'm stuck. One of the data nodes decided to have a load of 70+ this morning, stopping datanode and tasktracker brought it back to normal, but every time I start the datanode again the load shoots through the roof, and all I get in the logs is : STARTUP_MSG: Starting DataNode STARTUP_MSG: host = pl464/10.20.16.64 STARTUP_MSG: args = [] STARTUP_MSG: version = 0.20.2-cdh3u3 STARTUP_MSG: build file:///data/1/tmp/nightly_2012-03-20_13-13-48_3/hadoop-0.20-0.20.2+923.197-1~squeeze -************************************************************/ 2012-05-09 16:12:05,925 INFO org.apache.hadoop.security.UserGroupInformation: JAAS Configuration already set up for Hadoop, not re-installing. 2012-05-09 16:12:06,139 INFO org.apache.hadoop.security.UserGroupInformation: JAAS Configuration already set up for Hadoop, not re-installing. Nothing else. The load seems to max out only 1 of the CPUs, but the machine becomes *very* unresponsive Anybody got any pointers of things I can try? Thanks Darrell.
-
Re: High load on datanode startupSerge Blazhiyevskyy 2012-05-09, 16:56
Take a look at your data distribution for that cluster. Maybe, it is
unbalanced. Run balancer, if it isŠ Regards, Serge hadoopway.blogspot.com On 5/9/12 9:52 AM, "Darrell Taylor" <[EMAIL PROTECTED]> wrote: >Hi, > >I wonder if someone could give some pointers with a problem I'm having? > >I have a 7 machine cluster setup for testing and we have been pouring data >into it for a week without issue, have learnt several thing along the way >and solved all the problems up to now by searching online, but now I'm >stuck. One of the data nodes decided to have a load of 70+ this morning, >stopping datanode and tasktracker brought it back to normal, but every >time >I start the datanode again the load shoots through the roof, and all I get >in the logs is : > >STARTUP_MSG: Starting DataNode > > >STARTUP_MSG: host = pl464/10.20.16.64 > > >STARTUP_MSG: args = [] > > >STARTUP_MSG: version = 0.20.2-cdh3u3 > > >STARTUP_MSG: build >file:///data/1/tmp/nightly_2012-03-20_13-13-48_3/hadoop-0.20-0.20.2+923.19 >7-1~squeeze >-************************************************************/ > > >2012-05-09 16:12:05,925 INFO >org.apache.hadoop.security.UserGroupInformation: JAAS Configuration >already >set up for Hadoop, not re-installing. > >2012-05-09 16:12:06,139 INFO >org.apache.hadoop.security.UserGroupInformation: JAAS Configuration >already >set up for Hadoop, not re-installing. > >Nothing else. > >The load seems to max out only 1 of the CPUs, but the machine becomes >*very* unresponsive > >Anybody got any pointers of things I can try? > >Thanks >Darrell.
-
Re: High load on datanode startupDarrell Taylor 2012-05-09, 16:58
On Wed, May 9, 2012 at 5:56 PM, Serge Blazhiyevskyy <
[EMAIL PROTECTED]> wrote: > Take a look at your data distribution for that cluster. Maybe, it is > unbalanced. > > > Run balancer, if it isŠ > The cluster is balanced, I ran balancer yesterday. Oddly enough the problem started after I had run the balancer. I'm running CDH3 btw. > > Regards, > Serge > > hadoopway.blogspot.com > > > > On 5/9/12 9:52 AM, "Darrell Taylor" <[EMAIL PROTECTED]> wrote: > > >Hi, > > > >I wonder if someone could give some pointers with a problem I'm having? > > > >I have a 7 machine cluster setup for testing and we have been pouring data > >into it for a week without issue, have learnt several thing along the way > >and solved all the problems up to now by searching online, but now I'm > >stuck. One of the data nodes decided to have a load of 70+ this morning, > >stopping datanode and tasktracker brought it back to normal, but every > >time > >I start the datanode again the load shoots through the roof, and all I get > >in the logs is : > > > >STARTUP_MSG: Starting DataNode > > > > > >STARTUP_MSG: host = pl464/10.20.16.64 > > > > > >STARTUP_MSG: args = [] > > > > > >STARTUP_MSG: version = 0.20.2-cdh3u3 > > > > > >STARTUP_MSG: build > >file:///data/1/tmp/nightly_2012-03-20_13-13-48_3/hadoop-0.20-0.20.2+923.19 > >7-1~squeeze > >-************************************************************/ > > > > > >2012-05-09 16:12:05,925 INFO > >org.apache.hadoop.security.UserGroupInformation: JAAS Configuration > >already > >set up for Hadoop, not re-installing. > > > >2012-05-09 16:12:06,139 INFO > >org.apache.hadoop.security.UserGroupInformation: JAAS Configuration > >already > >set up for Hadoop, not re-installing. > > > >Nothing else. > > > >The load seems to max out only 1 of the CPUs, but the machine becomes > >*very* unresponsive > > > >Anybody got any pointers of things I can try? > > > >Thanks > >Darrell. > >
-
Re: High load on datanode startupSerge Blazhiyevskyy 2012-05-09, 17:04
Whats the response from fsck look like? hadoop fsck / It might be the case that some of the blocks are misreplicated Serge Hadoopway.blogspot.com On 5/9/12 9:58 AM, "Darrell Taylor" <[EMAIL PROTECTED]> wrote: >On Wed, May 9, 2012 at 5:56 PM, Serge Blazhiyevskyy < >[EMAIL PROTECTED]> wrote: > >> Take a look at your data distribution for that cluster. Maybe, it is >> unbalanced. >> >> >> Run balancer, if it isŠ >> > >The cluster is balanced, I ran balancer yesterday. Oddly enough the >problem started after I had run the balancer. > >I'm running CDH3 btw. > > > >> >> Regards, >> Serge >> >> hadoopway.blogspot.com >> >> >> >> On 5/9/12 9:52 AM, "Darrell Taylor" <[EMAIL PROTECTED]> wrote: >> >> >Hi, >> > >> >I wonder if someone could give some pointers with a problem I'm having? >> > >> >I have a 7 machine cluster setup for testing and we have been pouring >>data >> >into it for a week without issue, have learnt several thing along the >>way >> >and solved all the problems up to now by searching online, but now I'm >> >stuck. One of the data nodes decided to have a load of 70+ this >>morning, >> >stopping datanode and tasktracker brought it back to normal, but every >> >time >> >I start the datanode again the load shoots through the roof, and all I >>get >> >in the logs is : >> > >> >STARTUP_MSG: Starting DataNode >> > >> > >> >STARTUP_MSG: host = pl464/10.20.16.64 >> > >> > >> >STARTUP_MSG: args = [] >> > >> > >> >STARTUP_MSG: version = 0.20.2-cdh3u3 >> > >> > >> >STARTUP_MSG: build >> >>>file:///data/1/tmp/nightly_2012-03-20_13-13-48_3/hadoop-0.20-0.20.2+923. >>>19 >> >7-1~squeeze >> >-************************************************************/ >> > >> > >> >2012-05-09 16:12:05,925 INFO >> >org.apache.hadoop.security.UserGroupInformation: JAAS Configuration >> >already >> >set up for Hadoop, not re-installing. >> > >> >2012-05-09 16:12:06,139 INFO >> >org.apache.hadoop.security.UserGroupInformation: JAAS Configuration >> >already >> >set up for Hadoop, not re-installing. >> > >> >Nothing else. >> > >> >The load seems to max out only 1 of the CPUs, but the machine becomes >> >*very* unresponsive >> > >> >Anybody got any pointers of things I can try? >> > >> >Thanks >> >Darrell. >> >>
-
Re: High load on datanode startupDarrell Taylor 2012-05-09, 19:23
On Wed, May 9, 2012 at 6:04 PM, Serge Blazhiyevskyy <
[EMAIL PROTECTED]> wrote: > > Whats the response from fsck look like? > > [snip lots of stuff about under replicated blocks] ......Status: HEALTHY Total size: 246858876262 B (Total open files size: 372 B) Total dirs: 14914 Total files: 39248 (Files currently being written: 4) Total blocks (validated): 40657 (avg. block size 6071743 B) (Total open file blocks (not validated): 4) Minimally replicated blocks: 40657 (100.0 %) Over-replicated blocks: 0 (0.0 %) Under-replicated blocks: 1410 (3.4680374 %) Mis-replicated blocks: 0 (0.0 %) Default replication factor: 3 Average block replication: 2.9911454 Corrupt blocks: 0 Missing replicas: 2831 (2.3279145 %) Number of data-nodes: 5 Number of racks: 1 FSCK ended at Wed May 09 19:19:11 UTC 2012 in 980 milliseconds Further information to add to this, it appear to be affecting 2 nodes in the cluster, one more than the other though. In the last couple of hours one of the nodes has also experienced high load, this has now dropped but both of these nodes are now considered dead by the namenode. The first box load is still increasing, currently 234! I think I might have to reboot it via IPMI. > > hadoop fsck / > > > It might be the case that some of the blocks are misreplicated > > > Serge > > Hadoopway.blogspot.com > > > > > > On 5/9/12 9:58 AM, "Darrell Taylor" <[EMAIL PROTECTED]> wrote: > > >On Wed, May 9, 2012 at 5:56 PM, Serge Blazhiyevskyy < > >[EMAIL PROTECTED]> wrote: > > > >> Take a look at your data distribution for that cluster. Maybe, it is > >> unbalanced. > >> > >> > >> Run balancer, if it isŠ > >> > > > >The cluster is balanced, I ran balancer yesterday. Oddly enough the > >problem started after I had run the balancer. > > > >I'm running CDH3 btw. > > > > > > > >> > >> Regards, > >> Serge > >> > >> hadoopway.blogspot.com > >> > >> > >> > >> On 5/9/12 9:52 AM, "Darrell Taylor" <[EMAIL PROTECTED]> wrote: > >> > >> >Hi, > >> > > >> >I wonder if someone could give some pointers with a problem I'm having? > >> > > >> >I have a 7 machine cluster setup for testing and we have been pouring > >>data > >> >into it for a week without issue, have learnt several thing along the > >>way > >> >and solved all the problems up to now by searching online, but now I'm > >> >stuck. One of the data nodes decided to have a load of 70+ this > >>morning, > >> >stopping datanode and tasktracker brought it back to normal, but every > >> >time > >> >I start the datanode again the load shoots through the roof, and all I > >>get > >> >in the logs is : > >> > > >> >STARTUP_MSG: Starting DataNode > >> > > >> > > >> >STARTUP_MSG: host = pl464/10.20.16.64 > >> > > >> > > >> >STARTUP_MSG: args = [] > >> > > >> > > >> >STARTUP_MSG: version = 0.20.2-cdh3u3 > >> > > >> > > >> >STARTUP_MSG: build > >> > >>>file:///data/1/tmp/nightly_2012-03-20_13-13-48_3/hadoop-0.20-0.20.2+923. > >>>19 > >> >7-1~squeeze > >> >-************************************************************/ > >> > > >> > > >> >2012-05-09 16:12:05,925 INFO > >> >org.apache.hadoop.security.UserGroupInformation: JAAS Configuration > >> >already > >> >set up for Hadoop, not re-installing. > >> > > >> >2012-05-09 16:12:06,139 INFO > >> >org.apache.hadoop.security.UserGroupInformation: JAAS Configuration > >> >already > >> >set up for Hadoop, not re-installing. > >> > > >> >Nothing else. > >> > > >> >The load seems to max out only 1 of the CPUs, but the machine becomes > >> >*very* unresponsive > >> > > >> >Anybody got any pointers of things I can try? > >> > > >> >Thanks > >> >Darrell. > >> > >> > >
-
Re: High load on datanode startupSerge Blazhiyevskyy 2012-05-09, 21:00
Looks like you have some under replicated blocks. Does that number
decreases if you fsck multiple times? Regards, Serge On 5/9/12 12:23 PM, "Darrell Taylor" <[EMAIL PROTECTED]> wrote: >On Wed, May 9, 2012 at 6:04 PM, Serge Blazhiyevskyy < >[EMAIL PROTECTED]> wrote: > >> >> Whats the response from fsck look like? >> >> >[snip lots of stuff about under replicated blocks] > >......Status: HEALTHY > Total size: 246858876262 B (Total open files size: 372 B) > Total dirs: 14914 > Total files: 39248 (Files currently being written: 4) > Total blocks (validated): 40657 (avg. block size 6071743 B) (Total >open file blocks (not validated): 4) > Minimally replicated blocks: 40657 (100.0 %) > Over-replicated blocks: 0 (0.0 %) > Under-replicated blocks: 1410 (3.4680374 %) > Mis-replicated blocks: 0 (0.0 %) > Default replication factor: 3 > Average block replication: 2.9911454 > Corrupt blocks: 0 > Missing replicas: 2831 (2.3279145 %) > Number of data-nodes: 5 > Number of racks: 1 >FSCK ended at Wed May 09 19:19:11 UTC 2012 in 980 milliseconds > > >Further information to add to this, it appear to be affecting 2 nodes in >the cluster, one more than the other though. In the last couple of hours >one of the nodes has also experienced high load, this has now dropped but >both of these nodes are now considered dead by the namenode. The first >box >load is still increasing, currently 234! I think I might have to reboot it >via IPMI. > > >> >> hadoop fsck / >> >> >> It might be the case that some of the blocks are misreplicated >> >> >> Serge >> >> Hadoopway.blogspot.com >> >> >> >> >> >> On 5/9/12 9:58 AM, "Darrell Taylor" <[EMAIL PROTECTED]> wrote: >> >> >On Wed, May 9, 2012 at 5:56 PM, Serge Blazhiyevskyy < >> >[EMAIL PROTECTED]> wrote: >> > >> >> Take a look at your data distribution for that cluster. Maybe, it is >> >> unbalanced. >> >> >> >> >> >> Run balancer, if it isŠ >> >> >> > >> >The cluster is balanced, I ran balancer yesterday. Oddly enough the >> >problem started after I had run the balancer. >> > >> >I'm running CDH3 btw. >> > >> > >> > >> >> >> >> Regards, >> >> Serge >> >> >> >> hadoopway.blogspot.com >> >> >> >> >> >> >> >> On 5/9/12 9:52 AM, "Darrell Taylor" <[EMAIL PROTECTED]> wrote: >> >> >> >> >Hi, >> >> > >> >> >I wonder if someone could give some pointers with a problem I'm >>having? >> >> > >> >> >I have a 7 machine cluster setup for testing and we have been >>pouring >> >>data >> >> >into it for a week without issue, have learnt several thing along >>the >> >>way >> >> >and solved all the problems up to now by searching online, but now >>I'm >> >> >stuck. One of the data nodes decided to have a load of 70+ this >> >>morning, >> >> >stopping datanode and tasktracker brought it back to normal, but >>every >> >> >time >> >> >I start the datanode again the load shoots through the roof, and >>all I >> >>get >> >> >in the logs is : >> >> > >> >> >STARTUP_MSG: Starting DataNode >> >> > >> >> > >> >> >STARTUP_MSG: host = pl464/10.20.16.64 >> >> > >> >> > >> >> >STARTUP_MSG: args = [] >> >> > >> >> > >> >> >STARTUP_MSG: version = 0.20.2-cdh3u3 >> >> > >> >> > >> >> >STARTUP_MSG: build >> >> >> >>>>>file:///data/1/tmp/nightly_2012-03-20_13-13-48_3/hadoop-0.20-0.20.2+92 >>>>>3. >> >>>19 >> >> >7-1~squeeze >> >> >-************************************************************/ >> >> > >> >> > >> >> >2012-05-09 16:12:05,925 INFO >> >> >org.apache.hadoop.security.UserGroupInformation: JAAS Configuration >> >> >already >> >> >set up for Hadoop, not re-installing. >> >> > >> >> >2012-05-09 16:12:06,139 INFO >> >> >org.apache.hadoop.security.UserGroupInformation: JAAS Configuration >> >> >already >> >> >set up for Hadoop, not re-installing. >> >> > >> >> >Nothing else. >> >> > >> >> >The load seems to max out only 1 of the CPUs, but the machine >>becomes >> >> >*very* unresponsive >> >> >
-
Re: High load on datanode startupRaj Vishwanathan 2012-05-09, 21:23
When you say 'load', what do you mean? CPU load or something else?
Raj >________________________________ > From: Darrell Taylor <[EMAIL PROTECTED]> >To: [EMAIL PROTECTED] >Sent: Wednesday, May 9, 2012 9:52 AM >Subject: High load on datanode startup > >Hi, > >I wonder if someone could give some pointers with a problem I'm having? > >I have a 7 machine cluster setup for testing and we have been pouring data >into it for a week without issue, have learnt several thing along the way >and solved all the problems up to now by searching online, but now I'm >stuck. One of the data nodes decided to have a load of 70+ this morning, >stopping datanode and tasktracker brought it back to normal, but every time >I start the datanode again the load shoots through the roof, and all I get >in the logs is : > >STARTUP_MSG: Starting DataNode > > >STARTUP_MSG: host = pl464/10.20.16.64 > > >STARTUP_MSG: args = [] > > >STARTUP_MSG: version = 0.20.2-cdh3u3 > > >STARTUP_MSG: build >file:///data/1/tmp/nightly_2012-03-20_13-13-48_3/hadoop-0.20-0.20.2+923.197-1~squeeze >-************************************************************/ > > >2012-05-09 16:12:05,925 INFO >org.apache.hadoop.security.UserGroupInformation: JAAS Configuration already >set up for Hadoop, not re-installing. > >2012-05-09 16:12:06,139 INFO >org.apache.hadoop.security.UserGroupInformation: JAAS Configuration already >set up for Hadoop, not re-installing. > >Nothing else. > >The load seems to max out only 1 of the CPUs, but the machine becomes >*very* unresponsive > >Anybody got any pointers of things I can try? > >Thanks >Darrell. > > >
-
Re: High load on datanode startupDarrell Taylor 2012-05-09, 21:27
On Wed, May 9, 2012 at 10:00 PM, Serge Blazhiyevskyy <
[EMAIL PROTECTED]> wrote: > Looks like you have some under replicated blocks. Does that number > decreases if you fsck multiple times? > Yes, since my last post it's now down to 353.... Status: HEALTHY Total size: 246983628437 B (Total open files size: 372 B) Total dirs: 15172 Total files: 39637 (Files currently being written: 7) Total blocks (validated): 41046 (avg. block size 6017239 B) (Total open file blocks (not validated): 6) Minimally replicated blocks: 41046 (100.0 %) Over-replicated blocks: 0 (0.0 %) Under-replicated blocks: 353 (0.86001074 %) Mis-replicated blocks: 0 (0.0 %) Default replication factor: 3 Average block replication: 3.016981 Corrupt blocks: 0 Missing replicas: 1774 (1.4325514 %) Number of data-nodes: 5 Number of racks: 1 FSCK ended at Wed May 09 21:26:40 UTC 2012 in 904 milliseconds > > > Regards, > Serge > > On 5/9/12 12:23 PM, "Darrell Taylor" <[EMAIL PROTECTED]> wrote: > > >On Wed, May 9, 2012 at 6:04 PM, Serge Blazhiyevskyy < > >[EMAIL PROTECTED]> wrote: > > > >> > >> Whats the response from fsck look like? > >> > >> > >[snip lots of stuff about under replicated blocks] > > > >......Status: HEALTHY > > Total size: 246858876262 B (Total open files size: 372 B) > > Total dirs: 14914 > > Total files: 39248 (Files currently being written: 4) > > Total blocks (validated): 40657 (avg. block size 6071743 B) (Total > >open file blocks (not validated): 4) > > Minimally replicated blocks: 40657 (100.0 %) > > Over-replicated blocks: 0 (0.0 %) > > Under-replicated blocks: 1410 (3.4680374 %) > > Mis-replicated blocks: 0 (0.0 %) > > Default replication factor: 3 > > Average block replication: 2.9911454 > > Corrupt blocks: 0 > > Missing replicas: 2831 (2.3279145 %) > > Number of data-nodes: 5 > > Number of racks: 1 > >FSCK ended at Wed May 09 19:19:11 UTC 2012 in 980 milliseconds > > > > > >Further information to add to this, it appear to be affecting 2 nodes in > >the cluster, one more than the other though. In the last couple of hours > >one of the nodes has also experienced high load, this has now dropped but > >both of these nodes are now considered dead by the namenode. The first > >box > >load is still increasing, currently 234! I think I might have to reboot it > >via IPMI. > > > > > >> > >> hadoop fsck / > >> > >> > >> It might be the case that some of the blocks are misreplicated > >> > >> > >> Serge > >> > >> Hadoopway.blogspot.com > >> > >> > >> > >> > >> > >> On 5/9/12 9:58 AM, "Darrell Taylor" <[EMAIL PROTECTED]> wrote: > >> > >> >On Wed, May 9, 2012 at 5:56 PM, Serge Blazhiyevskyy < > >> >[EMAIL PROTECTED]> wrote: > >> > > >> >> Take a look at your data distribution for that cluster. Maybe, it is > >> >> unbalanced. > >> >> > >> >> > >> >> Run balancer, if it isŠ > >> >> > >> > > >> >The cluster is balanced, I ran balancer yesterday. Oddly enough the > >> >problem started after I had run the balancer. > >> > > >> >I'm running CDH3 btw. > >> > > >> > > >> > > >> >> > >> >> Regards, > >> >> Serge > >> >> > >> >> hadoopway.blogspot.com > >> >> > >> >> > >> >> > >> >> On 5/9/12 9:52 AM, "Darrell Taylor" <[EMAIL PROTECTED]> > wrote: > >> >> > >> >> >Hi, > >> >> > > >> >> >I wonder if someone could give some pointers with a problem I'm > >>having? > >> >> > > >> >> >I have a 7 machine cluster setup for testing and we have been > >>pouring > >> >>data > >> >> >into it for a week without issue, have learnt several thing along > >>the > >> >>way > >> >> >and solved all the problems up to now by searching online, but now > >>I'm > >> >> >stuck. One of the data nodes decided to have a load of 70+ this > >> >>morning, > >> >> >stopping datanode and tasktracker brought it back to normal, but > >>every
-
Re: High load on datanode startupDarrell Taylor 2012-05-09, 21:40
On Wed, May 9, 2012 at 10:23 PM, Raj Vishwanathan <[EMAIL PROTECTED]> wrote:
> When you say 'load', what do you mean? CPU load or something else? > I mean in the unix sense of load average, i.e. top would show a load of (currently) 376. Looking at Ganglia stats for the box it's not CPU load as such, the graphs shows actual CPU usage as 30%, but the number of running processes is simply growing in a linear manner - screen shot of ganglia page here : https://picasaweb.google.com/lh/photo/Q0uFSzyLiriDuDnvyRUikXVR0iWwMibMfH0upnTwi28?feat=directlink > > Raj > > > > >________________________________ > > From: Darrell Taylor <[EMAIL PROTECTED]> > >To: [EMAIL PROTECTED] > >Sent: Wednesday, May 9, 2012 9:52 AM > >Subject: High load on datanode startup > > > >Hi, > > > >I wonder if someone could give some pointers with a problem I'm having? > > > >I have a 7 machine cluster setup for testing and we have been pouring data > >into it for a week without issue, have learnt several thing along the way > >and solved all the problems up to now by searching online, but now I'm > >stuck. One of the data nodes decided to have a load of 70+ this morning, > >stopping datanode and tasktracker brought it back to normal, but every > time > >I start the datanode again the load shoots through the roof, and all I get > >in the logs is : > > > >STARTUP_MSG: Starting DataNode > > > > > >STARTUP_MSG: host = pl464/10.20.16.64 > > > > > >STARTUP_MSG: args = [] > > > > > >STARTUP_MSG: version = 0.20.2-cdh3u3 > > > > > >STARTUP_MSG: build > > >file:///data/1/tmp/nightly_2012-03-20_13-13-48_3/hadoop-0.20-0.20.2+923.197-1~squeeze > >-************************************************************/ > > > > > >2012-05-09 16:12:05,925 INFO > >org.apache.hadoop.security.UserGroupInformation: JAAS Configuration > already > >set up for Hadoop, not re-installing. > > > >2012-05-09 16:12:06,139 INFO > >org.apache.hadoop.security.UserGroupInformation: JAAS Configuration > already > >set up for Hadoop, not re-installing. > > > >Nothing else. > > > >The load seems to max out only 1 of the CPUs, but the machine becomes > >*very* unresponsive > > > >Anybody got any pointers of things I can try? > > > >Thanks > >Darrell. > > > > > > >
-
Re: High load on datanode startupSerge Blazhiyevskyy 2012-05-09, 21:44
I would wait for that number to go down to 0
That could a reason for your CPU utilization Regards, Serge On 5/9/12 2:27 PM, "Darrell Taylor" <[EMAIL PROTECTED]> wrote: >On Wed, May 9, 2012 at 10:00 PM, Serge Blazhiyevskyy < >[EMAIL PROTECTED]> wrote: > >> Looks like you have some under replicated blocks. Does that number >> decreases if you fsck multiple times? >> > >Yes, since my last post it's now down to 353.... > >Status: HEALTHY > Total size: 246983628437 B (Total open files size: 372 B) > Total dirs: 15172 > Total files: 39637 (Files currently being written: 7) > Total blocks (validated): 41046 (avg. block size 6017239 B) (Total >open file blocks (not validated): 6) > Minimally replicated blocks: 41046 (100.0 %) > Over-replicated blocks: 0 (0.0 %) > Under-replicated blocks: 353 (0.86001074 %) > Mis-replicated blocks: 0 (0.0 %) > Default replication factor: 3 > Average block replication: 3.016981 > Corrupt blocks: 0 > Missing replicas: 1774 (1.4325514 %) > Number of data-nodes: 5 > Number of racks: 1 >FSCK ended at Wed May 09 21:26:40 UTC 2012 in 904 milliseconds > > > > >> >> >> Regards, >> Serge >> >> On 5/9/12 12:23 PM, "Darrell Taylor" <[EMAIL PROTECTED]> wrote: >> >> >On Wed, May 9, 2012 at 6:04 PM, Serge Blazhiyevskyy < >> >[EMAIL PROTECTED]> wrote: >> > >> >> >> >> Whats the response from fsck look like? >> >> >> >> >> >[snip lots of stuff about under replicated blocks] >> > >> >......Status: HEALTHY >> > Total size: 246858876262 B (Total open files size: 372 B) >> > Total dirs: 14914 >> > Total files: 39248 (Files currently being written: 4) >> > Total blocks (validated): 40657 (avg. block size 6071743 B) >>(Total >> >open file blocks (not validated): 4) >> > Minimally replicated blocks: 40657 (100.0 %) >> > Over-replicated blocks: 0 (0.0 %) >> > Under-replicated blocks: 1410 (3.4680374 %) >> > Mis-replicated blocks: 0 (0.0 %) >> > Default replication factor: 3 >> > Average block replication: 2.9911454 >> > Corrupt blocks: 0 >> > Missing replicas: 2831 (2.3279145 %) >> > Number of data-nodes: 5 >> > Number of racks: 1 >> >FSCK ended at Wed May 09 19:19:11 UTC 2012 in 980 milliseconds >> > >> > >> >Further information to add to this, it appear to be affecting 2 nodes >>in >> >the cluster, one more than the other though. In the last couple of >>hours >> >one of the nodes has also experienced high load, this has now dropped >>but >> >both of these nodes are now considered dead by the namenode. The first >> >box >> >load is still increasing, currently 234! I think I might have to >>reboot it >> >via IPMI. >> > >> > >> >> >> >> hadoop fsck / >> >> >> >> >> >> It might be the case that some of the blocks are misreplicated >> >> >> >> >> >> Serge >> >> >> >> Hadoopway.blogspot.com >> >> >> >> >> >> >> >> >> >> >> >> On 5/9/12 9:58 AM, "Darrell Taylor" <[EMAIL PROTECTED]> wrote: >> >> >> >> >On Wed, May 9, 2012 at 5:56 PM, Serge Blazhiyevskyy < >> >> >[EMAIL PROTECTED]> wrote: >> >> > >> >> >> Take a look at your data distribution for that cluster. Maybe, it >>is >> >> >> unbalanced. >> >> >> >> >> >> >> >> >> Run balancer, if it isŠ >> >> >> >> >> > >> >> >The cluster is balanced, I ran balancer yesterday. Oddly enough the >> >> >problem started after I had run the balancer. >> >> > >> >> >I'm running CDH3 btw. >> >> > >> >> > >> >> > >> >> >> >> >> >> Regards, >> >> >> Serge >> >> >> >> >> >> hadoopway.blogspot.com >> >> >> >> >> >> >> >> >> >> >> >> On 5/9/12 9:52 AM, "Darrell Taylor" <[EMAIL PROTECTED]> >> wrote: >> >> >> >> >> >> >Hi, >> >> >> > >> >> >> >I wonder if someone could give some pointers with a problem I'm >> >>having? >> >> >> > >> >> >> >I have a 7 machine cluster setup for testing and we have been >> >>pouring >> >> >>data >> >> >> >into it for a week without issue, have learnt several thing along
-
Re: High load on datanode startupRaj Vishwanathan 2012-05-09, 21:52
The picture either too small or too pixelated for my eyes :-)
Can you login to the box and send the output of top? If the system is unresponsive, it has to be something more than an unbalanced hdfs cluster, methinks. Raj >________________________________ > From: Darrell Taylor <[EMAIL PROTECTED]> >To: [EMAIL PROTECTED]; Raj Vishwanathan <[EMAIL PROTECTED]> >Sent: Wednesday, May 9, 2012 2:40 PM >Subject: Re: High load on datanode startup > >On Wed, May 9, 2012 at 10:23 PM, Raj Vishwanathan <[EMAIL PROTECTED]> wrote: > >> When you say 'load', what do you mean? CPU load or something else? >> > >I mean in the unix sense of load average, i.e. top would show a load of >(currently) 376. > >Looking at Ganglia stats for the box it's not CPU load as such, the graphs >shows actual CPU usage as 30%, but the number of running processes is >simply growing in a linear manner - screen shot of ganglia page here : > >https://picasaweb.google.com/lh/photo/Q0uFSzyLiriDuDnvyRUikXVR0iWwMibMfH0upnTwi28?feat=directlink > > > >> >> Raj >> >> >> >> >________________________________ >> > From: Darrell Taylor <[EMAIL PROTECTED]> >> >To: [EMAIL PROTECTED] >> >Sent: Wednesday, May 9, 2012 9:52 AM >> >Subject: High load on datanode startup >> > >> >Hi, >> > >> >I wonder if someone could give some pointers with a problem I'm having? >> > >> >I have a 7 machine cluster setup for testing and we have been pouring data >> >into it for a week without issue, have learnt several thing along the way >> >and solved all the problems up to now by searching online, but now I'm >> >stuck. One of the data nodes decided to have a load of 70+ this morning, >> >stopping datanode and tasktracker brought it back to normal, but every >> time >> >I start the datanode again the load shoots through the roof, and all I get >> >in the logs is : >> > >> >STARTUP_MSG: Starting DataNode >> > >> > >> >STARTUP_MSG: host = pl464/10.20.16.64 >> > >> > >> >STARTUP_MSG: args = [] >> > >> > >> >STARTUP_MSG: version = 0.20.2-cdh3u3 >> > >> > >> >STARTUP_MSG: build >> >> >file:///data/1/tmp/nightly_2012-03-20_13-13-48_3/hadoop-0.20-0.20.2+923.197-1~squeeze >> >-************************************************************/ >> > >> > >> >2012-05-09 16:12:05,925 INFO >> >org.apache.hadoop.security.UserGroupInformation: JAAS Configuration >> already >> >set up for Hadoop, not re-installing. >> > >> >2012-05-09 16:12:06,139 INFO >> >org.apache.hadoop.security.UserGroupInformation: JAAS Configuration >> already >> >set up for Hadoop, not re-installing. >> > >> >Nothing else. >> > >> >The load seems to max out only 1 of the CPUs, but the machine becomes >> >*very* unresponsive >> > >> >Anybody got any pointers of things I can try? >> > >> >Thanks >> >Darrell. >> > >> > >> > >> > > >
-
Re: High load on datanode startupDarrell Taylor 2012-05-10, 06:57
On Wed, May 9, 2012 at 10:52 PM, Raj Vishwanathan <[EMAIL PROTECTED]> wrote:
> The picture either too small or too pixelated for my eyes :-) > There should be a zoom option in the top right of the page that allows you to view it full size > > Can you login to the box and send the output of top? If the system is > unresponsive, it has to be something more than an unbalanced hdfs cluster, > methinks. > Sorry, I'm unable to login to the box, it's completely unresponsive. > > Raj > > > > >________________________________ > > From: Darrell Taylor <[EMAIL PROTECTED]> > >To: [EMAIL PROTECTED]; Raj Vishwanathan <[EMAIL PROTECTED]> > >Sent: Wednesday, May 9, 2012 2:40 PM > >Subject: Re: High load on datanode startup > > > >On Wed, May 9, 2012 at 10:23 PM, Raj Vishwanathan <[EMAIL PROTECTED]> > wrote: > > > >> When you say 'load', what do you mean? CPU load or something else? > >> > > > >I mean in the unix sense of load average, i.e. top would show a load of > >(currently) 376. > > > >Looking at Ganglia stats for the box it's not CPU load as such, the graphs > >shows actual CPU usage as 30%, but the number of running processes is > >simply growing in a linear manner - screen shot of ganglia page here : > > > > > https://picasaweb.google.com/lh/photo/Q0uFSzyLiriDuDnvyRUikXVR0iWwMibMfH0upnTwi28?feat=directlink > > > > > > > >> > >> Raj > >> > >> > >> > >> >________________________________ > >> > From: Darrell Taylor <[EMAIL PROTECTED]> > >> >To: [EMAIL PROTECTED] > >> >Sent: Wednesday, May 9, 2012 9:52 AM > >> >Subject: High load on datanode startup > >> > > >> >Hi, > >> > > >> >I wonder if someone could give some pointers with a problem I'm having? > >> > > >> >I have a 7 machine cluster setup for testing and we have been pouring > data > >> >into it for a week without issue, have learnt several thing along the > way > >> >and solved all the problems up to now by searching online, but now I'm > >> >stuck. One of the data nodes decided to have a load of 70+ this > morning, > >> >stopping datanode and tasktracker brought it back to normal, but every > >> time > >> >I start the datanode again the load shoots through the roof, and all I > get > >> >in the logs is : > >> > > >> >STARTUP_MSG: Starting DataNode > >> > > >> > > >> >STARTUP_MSG: host = pl464/10.20.16.64 > >> > > >> > > >> >STARTUP_MSG: args = [] > >> > > >> > > >> >STARTUP_MSG: version = 0.20.2-cdh3u3 > >> > > >> > > >> >STARTUP_MSG: build > >> > >> > >file:///data/1/tmp/nightly_2012-03-20_13-13-48_3/hadoop-0.20-0.20.2+923.197-1~squeeze > >> >-************************************************************/ > >> > > >> > > >> >2012-05-09 16:12:05,925 INFO > >> >org.apache.hadoop.security.UserGroupInformation: JAAS Configuration > >> already > >> >set up for Hadoop, not re-installing. > >> > > >> >2012-05-09 16:12:06,139 INFO > >> >org.apache.hadoop.security.UserGroupInformation: JAAS Configuration > >> already > >> >set up for Hadoop, not re-installing. > >> > > >> >Nothing else. > >> > > >> >The load seems to max out only 1 of the CPUs, but the machine becomes > >> >*very* unresponsive > >> > > >> >Anybody got any pointers of things I can try? > >> > > >> >Thanks > >> >Darrell. > >> > > >> > > >> > > >> > > > > > > >
-
Re: High load on datanode startupTodd Lipcon 2012-05-10, 08:33
That's real weird..
If you can reproduce this after a reboot, I'd recommend letting the DN run for a minute, and then capturing a "jstack <pid of dn>" as well as the output of "top -H -p <pid of dn> -b -n 5" and send it to the list. What JVM/JDK are you using? What OS version? -Todd On Wed, May 9, 2012 at 11:57 PM, Darrell Taylor <[EMAIL PROTECTED]> wrote: > On Wed, May 9, 2012 at 10:52 PM, Raj Vishwanathan <[EMAIL PROTECTED]> wrote: > >> The picture either too small or too pixelated for my eyes :-) >> > > There should be a zoom option in the top right of the page that allows you > to view it full size > > >> >> Can you login to the box and send the output of top? If the system is >> unresponsive, it has to be something more than an unbalanced hdfs cluster, >> methinks. >> > > Sorry, I'm unable to login to the box, it's completely unresponsive. > > >> >> Raj >> >> >> >> >________________________________ >> > From: Darrell Taylor <[EMAIL PROTECTED]> >> >To: [EMAIL PROTECTED]; Raj Vishwanathan <[EMAIL PROTECTED]> >> >Sent: Wednesday, May 9, 2012 2:40 PM >> >Subject: Re: High load on datanode startup >> > >> >On Wed, May 9, 2012 at 10:23 PM, Raj Vishwanathan <[EMAIL PROTECTED]> >> wrote: >> > >> >> When you say 'load', what do you mean? CPU load or something else? >> >> >> > >> >I mean in the unix sense of load average, i.e. top would show a load of >> >(currently) 376. >> > >> >Looking at Ganglia stats for the box it's not CPU load as such, the graphs >> >shows actual CPU usage as 30%, but the number of running processes is >> >simply growing in a linear manner - screen shot of ganglia page here : >> > >> > >> https://picasaweb.google.com/lh/photo/Q0uFSzyLiriDuDnvyRUikXVR0iWwMibMfH0upnTwi28?feat=directlink >> > >> > >> > >> >> >> >> Raj >> >> >> >> >> >> >> >> >________________________________ >> >> > From: Darrell Taylor <[EMAIL PROTECTED]> >> >> >To: [EMAIL PROTECTED] >> >> >Sent: Wednesday, May 9, 2012 9:52 AM >> >> >Subject: High load on datanode startup >> >> > >> >> >Hi, >> >> > >> >> >I wonder if someone could give some pointers with a problem I'm having? >> >> > >> >> >I have a 7 machine cluster setup for testing and we have been pouring >> data >> >> >into it for a week without issue, have learnt several thing along the >> way >> >> >and solved all the problems up to now by searching online, but now I'm >> >> >stuck. One of the data nodes decided to have a load of 70+ this >> morning, >> >> >stopping datanode and tasktracker brought it back to normal, but every >> >> time >> >> >I start the datanode again the load shoots through the roof, and all I >> get >> >> >in the logs is : >> >> > >> >> >STARTUP_MSG: Starting DataNode >> >> > >> >> > >> >> >STARTUP_MSG: host = pl464/10.20.16.64 >> >> > >> >> > >> >> >STARTUP_MSG: args = [] >> >> > >> >> > >> >> >STARTUP_MSG: version = 0.20.2-cdh3u3 >> >> > >> >> > >> >> >STARTUP_MSG: build >> >> >> >> >> >file:///data/1/tmp/nightly_2012-03-20_13-13-48_3/hadoop-0.20-0.20.2+923.197-1~squeeze >> >> >-************************************************************/ >> >> > >> >> > >> >> >2012-05-09 16:12:05,925 INFO >> >> >org.apache.hadoop.security.UserGroupInformation: JAAS Configuration >> >> already >> >> >set up for Hadoop, not re-installing. >> >> > >> >> >2012-05-09 16:12:06,139 INFO >> >> >org.apache.hadoop.security.UserGroupInformation: JAAS Configuration >> >> already >> >> >set up for Hadoop, not re-installing. >> >> > >> >> >Nothing else. >> >> > >> >> >The load seems to max out only 1 of the CPUs, but the machine becomes >> >> >*very* unresponsive >> >> > >> >> >Anybody got any pointers of things I can try? >> >> > >> >> >Thanks >> >> >Darrell. >> >> > >> >> > >> >> > >> >> >> > >> > >> > >> -- Todd Lipcon Software Engineer, Cloudera
-
Re: High load on datanode startupDarrell Taylor 2012-05-10, 10:57
On Thu, May 10, 2012 at 9:33 AM, Todd Lipcon <[EMAIL PROTECTED]> wrote:
> That's real weird.. > > If you can reproduce this after a reboot, I'd recommend letting the DN > run for a minute, and then capturing a "jstack <pid of dn>" as well as > the output of "top -H -p <pid of dn> -b -n 5" and send it to the list. What I did after the reboot this morning was to move the my dn, nn, and mapred directories out of the the way, create a new one, formatted it, and restarted the node, it's now happy. I'll try moving the directories back later and do the jstack as you suggest. > > What JVM/JDK are you using? What OS version? > root@pl446:/# dpkg --get-selections | grep java java-common install libjaxp1.3-java install libjaxp1.3-java-gcj install libmysql-java install libxerces2-java install libxerces2-java-gcj install sun-java6-bin install sun-java6-javadb install sun-java6-jdk install sun-java6-jre install root@pl446:/# java -version java version "1.6.0_26" Java(TM) SE Runtime Environment (build 1.6.0_26-b03) Java HotSpot(TM) 64-Bit Server VM (build 20.1-b02, mixed mode) root@pl446:/# cat /etc/issue Debian GNU/Linux 6.0 \n \l > > -Todd > > > On Wed, May 9, 2012 at 11:57 PM, Darrell Taylor > <[EMAIL PROTECTED]> wrote: > > On Wed, May 9, 2012 at 10:52 PM, Raj Vishwanathan <[EMAIL PROTECTED]> > wrote: > > > >> The picture either too small or too pixelated for my eyes :-) > >> > > > > There should be a zoom option in the top right of the page that allows > you > > to view it full size > > > > > >> > >> Can you login to the box and send the output of top? If the system is > >> unresponsive, it has to be something more than an unbalanced hdfs > cluster, > >> methinks. > >> > > > > Sorry, I'm unable to login to the box, it's completely unresponsive. > > > > > >> > >> Raj > >> > >> > >> > >> >________________________________ > >> > From: Darrell Taylor <[EMAIL PROTECTED]> > >> >To: [EMAIL PROTECTED]; Raj Vishwanathan <[EMAIL PROTECTED] > > > >> >Sent: Wednesday, May 9, 2012 2:40 PM > >> >Subject: Re: High load on datanode startup > >> > > >> >On Wed, May 9, 2012 at 10:23 PM, Raj Vishwanathan <[EMAIL PROTECTED]> > >> wrote: > >> > > >> >> When you say 'load', what do you mean? CPU load or something else? > >> >> > >> > > >> >I mean in the unix sense of load average, i.e. top would show a load of > >> >(currently) 376. > >> > > >> >Looking at Ganglia stats for the box it's not CPU load as such, the > graphs > >> >shows actual CPU usage as 30%, but the number of running processes is > >> >simply growing in a linear manner - screen shot of ganglia page here : > >> > > >> > > >> > https://picasaweb.google.com/lh/photo/Q0uFSzyLiriDuDnvyRUikXVR0iWwMibMfH0upnTwi28?feat=directlink > >> > > >> > > >> > > >> >> > >> >> Raj > >> >> > >> >> > >> >> > >> >> >________________________________ > >> >> > From: Darrell Taylor <[EMAIL PROTECTED]> > >> >> >To: [EMAIL PROTECTED] > >> >> >Sent: Wednesday, May 9, 2012 9:52 AM > >> >> >Subject: High load on datanode startup > >> >> > > >> >> >Hi, > >> >> > > >> >> >I wonder if someone could give some pointers with a problem I'm > having? > >> >> > > >> >> >I have a 7 machine cluster setup for testing and we have been > pouring > >> data > >> >> >into it for a week without issue, have learnt several thing along > the > >> way > >> >> >and solved all the problems up to now by searching online, but now > I'm > >> >> >stuck. One of the data nodes decided to have a load of 70+ this > >> morning, > >> >> >stopping datanode and tasktracker brought it back to normal, but > every > >> >> time > >> >> >I start the datanode again the load shoots through the roof, and
-
Re: High load on datanode startupRaj Vishwanathan 2012-05-10, 16:58
Darrell
Are the new dn,nn and mapred directories on the same physical disk? Nothing on NFS , correct? Could you be having some hardware issue? Any clue in /var/log/messages or dmesg? A non responsive system indicates a CPU that is really busy either doing something or waiting for something and the fact that it happens only on some nodes indicates a local problem. Raj >________________________________ > From: Darrell Taylor <[EMAIL PROTECTED]> >To: [EMAIL PROTECTED] >Cc: Raj Vishwanathan <[EMAIL PROTECTED]> >Sent: Thursday, May 10, 2012 3:57 AM >Subject: Re: High load on datanode startup > >On Thu, May 10, 2012 at 9:33 AM, Todd Lipcon <[EMAIL PROTECTED]> wrote: > >> That's real weird.. >> >> If you can reproduce this after a reboot, I'd recommend letting the DN >> run for a minute, and then capturing a "jstack <pid of dn>" as well as >> the output of "top -H -p <pid of dn> -b -n 5" and send it to the list. > > >What I did after the reboot this morning was to move the my dn, nn, and >mapred directories out of the the way, create a new one, formatted it, and >restarted the node, it's now happy. > >I'll try moving the directories back later and do the jstack as you suggest. > > >> >> What JVM/JDK are you using? What OS version? >> > >root@pl446:/# dpkg --get-selections | grep java >java-common install >libjaxp1.3-java install >libjaxp1.3-java-gcj install >libmysql-java install >libxerces2-java install >libxerces2-java-gcj install >sun-java6-bin install >sun-java6-javadb install >sun-java6-jdk install >sun-java6-jre install > >root@pl446:/# java -version >java version "1.6.0_26" >Java(TM) SE Runtime Environment (build 1.6.0_26-b03) >Java HotSpot(TM) 64-Bit Server VM (build 20.1-b02, mixed mode) > >root@pl446:/# cat /etc/issue >Debian GNU/Linux 6.0 \n \l > > > >> >> -Todd >> >> >> On Wed, May 9, 2012 at 11:57 PM, Darrell Taylor >> <[EMAIL PROTECTED]> wrote: >> > On Wed, May 9, 2012 at 10:52 PM, Raj Vishwanathan <[EMAIL PROTECTED]> >> wrote: >> > >> >> The picture either too small or too pixelated for my eyes :-) >> >> >> > >> > There should be a zoom option in the top right of the page that allows >> you >> > to view it full size >> > >> > >> >> >> >> Can you login to the box and send the output of top? If the system is >> >> unresponsive, it has to be something more than an unbalanced hdfs >> cluster, >> >> methinks. >> >> >> > >> > Sorry, I'm unable to login to the box, it's completely unresponsive. >> > >> > >> >> >> >> Raj >> >> >> >> >> >> >> >> >________________________________ >> >> > From: Darrell Taylor <[EMAIL PROTECTED]> >> >> >To: [EMAIL PROTECTED]; Raj Vishwanathan <[EMAIL PROTECTED] >> > >> >> >Sent: Wednesday, May 9, 2012 2:40 PM >> >> >Subject: Re: High load on datanode startup >> >> > >> >> >On Wed, May 9, 2012 at 10:23 PM, Raj Vishwanathan <[EMAIL PROTECTED]> >> >> wrote: >> >> > >> >> >> When you say 'load', what do you mean? CPU load or something else? >> >> >> >> >> > >> >> >I mean in the unix sense of load average, i.e. top would show a load of >> >> >(currently) 376. >> >> > >> >> >Looking at Ganglia stats for the box it's not CPU load as such, the >> graphs >> >> >shows actual CPU usage as 30%, but the number of running processes is >> >> >simply growing in a linear manner - screen shot of ganglia page here : >> >> > >> >> > >> >> >> https://picasaweb.google.com/lh/photo/Q0uFSzyLiriDuDnvyRUikXVR0iWwMibMfH0upnTwi28?feat=directlink >> >> > >> >> > >> >> > >> >> >> >> >> >> Raj >> >> >> >> >> >> >> >> >> >> >> >> >________________________________ >> >> >> > From: Darrell Taylor <[EMAIL PROTECTED]> >> >> >> >To: [EMAIL PROTECTED]
-
Re: High load on datanode startupDarrell Taylor 2012-05-11, 09:29
On Thu, May 10, 2012 at 5:58 PM, Raj Vishwanathan <[EMAIL PROTECTED]> wrote:
> Darrell > > Are the new dn,nn and mapred directories on the same physical disk? > Nothing on NFS , correct? > Yes, that's correct > > Could you be having some hardware issue? Any clue in /var/log/messages or > dmesg? > Hardware is good, all logs are clean. > > A non responsive system indicates a CPU that is really busy either doing > something or waiting for something and the fact that it happens only on > some nodes indicates a local problem. > Yes, it was a very strange problem, which I seemed to have solved (for now). So, yesterday I upgraded the cluster to cdh4, and I found some of the nodes started to display similar behaviour but was able to catch then early enough to do something about it, the solution was to remove the hadoop-env.sh that I had copied over from the cdh3 install, the only thing I had added to this file was the following which I did to get pig/hbase talking : export HADOOP_CLASSPATH="`/usr/bin/hbase classpath`:$HADOOP_CLASSPATH" What I saw on the machine was thousands of recursive processes in ps of the form 'bash /usr/bin/hbase classpath...', Stopping everything didn't clean the processes up so had to kill them manually with some grep/xargs foo. Once this was all cleaned up and the hadoop-env.sh file removed the nodes seem to be happy again. Darrell. > > Raj > > > > >________________________________ > > From: Darrell Taylor <[EMAIL PROTECTED]> > >To: [EMAIL PROTECTED] > >Cc: Raj Vishwanathan <[EMAIL PROTECTED]> > >Sent: Thursday, May 10, 2012 3:57 AM > >Subject: Re: High load on datanode startup > > > >On Thu, May 10, 2012 at 9:33 AM, Todd Lipcon <[EMAIL PROTECTED]> wrote: > > > >> That's real weird.. > >> > >> If you can reproduce this after a reboot, I'd recommend letting the DN > >> run for a minute, and then capturing a "jstack <pid of dn>" as well as > >> the output of "top -H -p <pid of dn> -b -n 5" and send it to the list. > > > > > >What I did after the reboot this morning was to move the my dn, nn, and > >mapred directories out of the the way, create a new one, formatted it, and > >restarted the node, it's now happy. > > > >I'll try moving the directories back later and do the jstack as you > suggest. > > > > > >> > >> What JVM/JDK are you using? What OS version? > >> > > > >root@pl446:/# dpkg --get-selections | grep java > >java-common install > >libjaxp1.3-java install > >libjaxp1.3-java-gcj install > >libmysql-java install > >libxerces2-java install > >libxerces2-java-gcj install > >sun-java6-bin install > >sun-java6-javadb install > >sun-java6-jdk install > >sun-java6-jre install > > > >root@pl446:/# java -version > >java version "1.6.0_26" > >Java(TM) SE Runtime Environment (build 1.6.0_26-b03) > >Java HotSpot(TM) 64-Bit Server VM (build 20.1-b02, mixed mode) > > > >root@pl446:/# cat /etc/issue > >Debian GNU/Linux 6.0 \n \l > > > > > > > >> > >> -Todd > >> > >> > >> On Wed, May 9, 2012 at 11:57 PM, Darrell Taylor > >> <[EMAIL PROTECTED]> wrote: > >> > On Wed, May 9, 2012 at 10:52 PM, Raj Vishwanathan <[EMAIL PROTECTED]> > >> wrote: > >> > > >> >> The picture either too small or too pixelated for my eyes :-) > >> >> > >> > > >> > There should be a zoom option in the top right of the page that allows > >> you > >> > to view it full size > >> > > >> > > >> >> > >> >> Can you login to the box and send the output of top? If the system is > >> >> unresponsive, it has to be something more than an unbalanced hdfs > >> cluster, > >> >> methinks. > >> >> > >> > > >> > Sorry, I'm unable to login to the box, it's completely unresponsive. > >> > > >> > > >> >> > >> >> Raj
-
Re: High load on datanode startupTodd Lipcon 2012-05-11, 09:32
On Fri, May 11, 2012 at 2:29 AM, Darrell Taylor
<[EMAIL PROTECTED]> wrote: > > What I saw on the machine was thousands of recursive processes in ps of the > form 'bash /usr/bin/hbase classpath...', Stopping everything didn't clean > the processes up so had to kill them manually with some grep/xargs foo. > Once this was all cleaned up and the hadoop-env.sh file removed the nodes > seem to be happy again. Ah -- maybe the issue is that... my guess is that "hbase classpath" is now trying to include the Hadoop dependencies using "hadoop classpath". But "hadoop classpath" was recursing right back because of that setting in hadoop-env. Basically you made a fork bomb - that explains the shape of the graph in Ganglia perfectly. -Todd > > Darrell. > > >> >> Raj >> >> >> >> >________________________________ >> > From: Darrell Taylor <[EMAIL PROTECTED]> >> >To: [EMAIL PROTECTED] >> >Cc: Raj Vishwanathan <[EMAIL PROTECTED]> >> >Sent: Thursday, May 10, 2012 3:57 AM >> >Subject: Re: High load on datanode startup >> > >> >On Thu, May 10, 2012 at 9:33 AM, Todd Lipcon <[EMAIL PROTECTED]> wrote: >> > >> >> That's real weird.. >> >> >> >> If you can reproduce this after a reboot, I'd recommend letting the DN >> >> run for a minute, and then capturing a "jstack <pid of dn>" as well as >> >> the output of "top -H -p <pid of dn> -b -n 5" and send it to the list. >> > >> > >> >What I did after the reboot this morning was to move the my dn, nn, and >> >mapred directories out of the the way, create a new one, formatted it, and >> >restarted the node, it's now happy. >> > >> >I'll try moving the directories back later and do the jstack as you >> suggest. >> > >> > >> >> >> >> What JVM/JDK are you using? What OS version? >> >> >> > >> >root@pl446:/# dpkg --get-selections | grep java >> >java-common install >> >libjaxp1.3-java install >> >libjaxp1.3-java-gcj install >> >libmysql-java install >> >libxerces2-java install >> >libxerces2-java-gcj install >> >sun-java6-bin install >> >sun-java6-javadb install >> >sun-java6-jdk install >> >sun-java6-jre install >> > >> >root@pl446:/# java -version >> >java version "1.6.0_26" >> >Java(TM) SE Runtime Environment (build 1.6.0_26-b03) >> >Java HotSpot(TM) 64-Bit Server VM (build 20.1-b02, mixed mode) >> > >> >root@pl446:/# cat /etc/issue >> >Debian GNU/Linux 6.0 \n \l >> > >> > >> > >> >> >> >> -Todd >> >> >> >> >> >> On Wed, May 9, 2012 at 11:57 PM, Darrell Taylor >> >> <[EMAIL PROTECTED]> wrote: >> >> > On Wed, May 9, 2012 at 10:52 PM, Raj Vishwanathan <[EMAIL PROTECTED]> >> >> wrote: >> >> > >> >> >> The picture either too small or too pixelated for my eyes :-) >> >> >> >> >> > >> >> > There should be a zoom option in the top right of the page that allows >> >> you >> >> > to view it full size >> >> > >> >> > >> >> >> >> >> >> Can you login to the box and send the output of top? If the system is >> >> >> unresponsive, it has to be something more than an unbalanced hdfs >> >> cluster, >> >> >> methinks. >> >> >> >> >> > >> >> > Sorry, I'm unable to login to the box, it's completely unresponsive. >> >> > >> >> > >> >> >> >> >> >> Raj >> >> >> >> >> >> >> >> >> >> >> >> >________________________________ >> >> >> > From: Darrell Taylor <[EMAIL PROTECTED]> >> >> >> >To: [EMAIL PROTECTED]; Raj Vishwanathan < >> [EMAIL PROTECTED] >> >> > >> >> >> >Sent: Wednesday, May 9, 2012 2:40 PM >> >> >> >Subject: Re: High load on datanode startup >> >> >> > >> >> >> >On Wed, May 9, 2012 at 10:23 PM, Raj Vishwanathan < >> [EMAIL PROTECTED]> >> >> >> wrote: >> >> >> > >> >> >> >> When you say 'load', what do you mean? CPU load or something else? >> >> >> >> > Todd Lipcon Software Engineer, Cloudera
-
Re: High load on datanode startupHarsh J 2012-05-11, 10:36
Doesn't look like the $HBASE_HOME/bin/hbase script runs
"$HADOOP_HOME/bin/hadoop classpath" directly. Its classpath builder seems to add $HADOOP_HOME items manually via listing/etc.. Perhaps if hbase-env.sh has a HBASE_CLASSPATH that imports `hadoop classpath`, and the hadoop-env.sh has a `hbase classpath` this issue could happen. I do know that `hbase classpath` may take very long and/or hang over network calls if there's a target/build directory inside of $HBASE_HOME, which causes it to use maven to generate a classpath instead of using a cached file/local gen. Generally doing mvn clean solves that up for me, whenever it happens over my installs. On Fri, May 11, 2012 at 3:02 PM, Todd Lipcon <[EMAIL PROTECTED]> wrote: > On Fri, May 11, 2012 at 2:29 AM, Darrell Taylor > <[EMAIL PROTECTED]> wrote: >> >> What I saw on the machine was thousands of recursive processes in ps of the >> form 'bash /usr/bin/hbase classpath...', Stopping everything didn't clean >> the processes up so had to kill them manually with some grep/xargs foo. >> Once this was all cleaned up and the hadoop-env.sh file removed the nodes >> seem to be happy again. > > Ah -- maybe the issue is that... my guess is that "hbase classpath" is > now trying to include the Hadoop dependencies using "hadoop > classpath". But "hadoop classpath" was recursing right back because of > that setting in hadoop-env. Basically you made a fork bomb - that > explains the shape of the graph in Ganglia perfectly. > > -Todd > >> >> Darrell. >> >> >>> >>> Raj >>> >>> >>> >>> >________________________________ >>> > From: Darrell Taylor <[EMAIL PROTECTED]> >>> >To: [EMAIL PROTECTED] >>> >Cc: Raj Vishwanathan <[EMAIL PROTECTED]> >>> >Sent: Thursday, May 10, 2012 3:57 AM >>> >Subject: Re: High load on datanode startup >>> > >>> >On Thu, May 10, 2012 at 9:33 AM, Todd Lipcon <[EMAIL PROTECTED]> wrote: >>> > >>> >> That's real weird.. >>> >> >>> >> If you can reproduce this after a reboot, I'd recommend letting the DN >>> >> run for a minute, and then capturing a "jstack <pid of dn>" as well as >>> >> the output of "top -H -p <pid of dn> -b -n 5" and send it to the list. >>> > >>> > >>> >What I did after the reboot this morning was to move the my dn, nn, and >>> >mapred directories out of the the way, create a new one, formatted it, and >>> >restarted the node, it's now happy. >>> > >>> >I'll try moving the directories back later and do the jstack as you >>> suggest. >>> > >>> > >>> >> >>> >> What JVM/JDK are you using? What OS version? >>> >> >>> > >>> >root@pl446:/# dpkg --get-selections | grep java >>> >java-common install >>> >libjaxp1.3-java install >>> >libjaxp1.3-java-gcj install >>> >libmysql-java install >>> >libxerces2-java install >>> >libxerces2-java-gcj install >>> >sun-java6-bin install >>> >sun-java6-javadb install >>> >sun-java6-jdk install >>> >sun-java6-jre install >>> > >>> >root@pl446:/# java -version >>> >java version "1.6.0_26" >>> >Java(TM) SE Runtime Environment (build 1.6.0_26-b03) >>> >Java HotSpot(TM) 64-Bit Server VM (build 20.1-b02, mixed mode) >>> > >>> >root@pl446:/# cat /etc/issue >>> >Debian GNU/Linux 6.0 \n \l >>> > >>> > >>> > >>> >> >>> >> -Todd >>> >> >>> >> >>> >> On Wed, May 9, 2012 at 11:57 PM, Darrell Taylor >>> >> <[EMAIL PROTECTED]> wrote: >>> >> > On Wed, May 9, 2012 at 10:52 PM, Raj Vishwanathan <[EMAIL PROTECTED]> >>> >> wrote: >>> >> > >>> >> >> The picture either too small or too pixelated for my eyes :-) >>> >> >> >>> >> > >>> >> > There should be a zoom option in the top right of the page that allows >>> >> you >>> >> > to view it full size >>> >> > >>> >> > >>> >> >> >>> >> >> Can you login to the box and send the output of top? If the system is Harsh J |