|
|
-
Re: Problem while submitting jobs to NM started with ephemeral ports.Prashant Sharma 2011-10-17, 09:05
also I tried commenting out two last two properties in yarn-site
mentioned above. And keeping the following property in mapred-site <property> <name> mapreduce.shuffle.port</name> <value>0</value> </property> I got this exception while running a wordcount. mapreduce.Job (Job.java:printTaskEvents(1315)) - Task Id : attempt_1318840789401_0005_r_000000_0, Status : FAILED org.apache.hadoop.mapreduce.task.reduce.Shuffle$ShuffleError: error in shuffle in fetcher#5 at org.apache.hadoop.mapreduce.task.reduce.Shuffle.run(Shuffle.java:126) at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:365) at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:147) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:396) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1152) at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:142) Caused by: java.io.IOException: Exceeded MAX_FAILED_UNIQUE_FETCHES; bailing-out. at org.apache.hadoop.mapreduce.task.reduce.ShuffleScheduler.checkReducerHealth(ShuffleScheduler.java:253) at org.apache.hadoop.mapreduce.task.reduce.ShuffleScheduler.copyFailed(ShuffleScheduler.java:187) at org.apache.hadoop.mapreduce.task.reduce.Fetcher.copyFromHost(Fetcher.java:227) at org.apache.hadoop.mapreduce.task.reduce.Fetcher.run(Fetcher.java:149) And everything works out of the box otherwise. Thanks, Prashant. On Mon, Oct 17, 2011 at 2:03 PM, Prashant Sharma <[EMAIL PROTECTED]> wrote: > I am using following properties in yarn-site > > <property> > <name>yarn.nodemanager.aux-services</name> > <value>mapreduce.shuffle</value> > </property> > <property> > <name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name> > <value>org.apache.hadoop.mapred.ShuffleHandler</value> > </property> > <property> > <name>yarn.nodemanager.address</name> > <value>localhost:0</value> > </property> > <property> > <name>yarn.nodemanager.localizer.address</name> > <value>localhost:0</value> > </property> > > Everything runs fine. (means all daemons are started perfectly) But > when you try to submit the job. Job is stuck and NM logs says trying > to connect to 'localhost:0'. Localization takes forever. Why? > > Please see the NM logs below. > > http://pastebin.com/QfQDZeqF > > Thanks, > Prashant > |