Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Hadoop >> mail # user >> Job end notification does not always work (Hadoop 2.x)


Copy link to this message
-
Re: Job end notification does not always work (Hadoop 2.x)
If the AM fails before doing the job end notification, at any stage of the
execution for whatever reason, the job end notification will never be
deliver. There is not way to fix this unless the notification is done by a
Yarn service. The 2 'candidate' services for doing this would be the RM and
the HS. The job notification URL is in the job conf. The RM never sees the
job conf, that rules out the RM out unless we add, at AM registration time
the possibility to specify a callback URL. The HS has access to the job
conf, but the HS is currently a 'passive' service.

thx

On Sat, Jun 22, 2013 at 3:48 PM, Arun C Murthy <[EMAIL PROTECTED]> wrote:

> Prashanth,
>
>  Please file a jira.
>
>  One thing to be aware of - AMs get restarted a certain number of times
> for fault-tolerance - which means we can't just assume that failure of a
> single AM is equivalent to failure of the job.
>
>  Only the ResourceManager is in the appropriate position to judge failure
> of AM v/s failure-of-job.
>
> hth,
> Arun
>
> On Jun 22, 2013, at 2:44 PM, Prashant Kommireddi <[EMAIL PROTECTED]>
> wrote:
>
> Thanks Ravi.
>
> Well, in this case its a no-effort :) A failure of AM init should be
> considered as failure of the job? I looked at the code and best-effort
> makes sense with respect to retry logic etc. You make a good point that
> there would be no notification in case AM OOMs, but I do feel AM init
> failure should send a notification by other means.
>
>
>
> On Sat, Jun 22, 2013 at 2:38 PM, Ravi Prakash <[EMAIL PROTECTED]> wrote:
>
>> Hi Prashant,
>>
>> I would tend to agree with you. Although job-end notification is only a
>> "best-effort" mechanism (i.e. we cannot always guarantee notification for
>> example when the AM OOMs), I agree with you that we can do more. If you
>> feel strongly about this, please create a JIRA and possibly upload a patch.
>>
>> Thanks
>> Ravi
>>
>>
>>   ------------------------------
>>  *From:* Prashant Kommireddi <[EMAIL PROTECTED]>
>> *To:* "[EMAIL PROTECTED]" <[EMAIL PROTECTED]>
>> *Sent:* Thursday, June 20, 2013 9:45 PM
>> *Subject:* Job end notification does not always work (Hadoop 2.x)
>>
>> Hello,
>>
>> I came across an issue that occurs with the job notification callbacks in
>> MR2. It works fine if the Application master has started, but does not send
>> a callback if the initializing of AM fails.
>>
>> Here is the code from MRAppMaster.java
>>
>> .....
>> .......
>>
>>       // set job classloader if configured
>>       MRApps.setJobClassLoader(conf);
>>       initAndStartAppMaster(appMaster, conf, jobUserName);
>>     } catch (Throwable t) {
>>       LOG.fatal("Error starting MRAppMaster", t);
>>       System.exit(1);
>>     }
>>   }
>>
>> protected static void initAndStartAppMaster(final MRAppMaster appMaster,
>>       final YarnConfiguration conf, String jobUserName) throws IOException,
>>       InterruptedException {
>>     UserGroupInformation.setConfiguration(conf);
>>     UserGroupInformation appMasterUgi = UserGroupInformation
>>         .createRemoteUser(jobUserName);
>>     appMasterUgi.doAs(new PrivilegedExceptionAction<Object>() {
>>       @Override
>>       public Object run() throws Exception {
>>         appMaster.init(conf);
>>         appMaster.start();
>>         if(appMaster.errorHappenedShutDown) {
>>           throw new IOException("Was asked to shut down.");
>>         }
>>         return null;
>>       }
>>     });
>>   }
>>
>> appMaster.init(conf) does not dispatch JobFinishEventHandler which is
>> responsible for sending a HTTP callback (via shutDownJob()). If there was
>> an exception at this time, the process would simply terminate (via
>> System.exit(1) )
>>
>> appMaster.start() however rightly uses the JobFinishEventHandler and
>> things work fine.
>>
>> Shouldn't a failure on init(..) also send a callback suggesting the job
>> failed?
>>
>> Thanks,
>> Prashant
>>
>>
>>
>>
>
>  --
> Arun C. Murthy
> Hortonworks Inc.
> http://hortonworks.com/
>
>
>
--
Alejandro