Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
MapReduce >> mail # user >> determining what files made up a failing task


Copy link to this message
-
Re: determining what files made up a failing task
Mat,

There is no need to know the input data which caused the task and finally
the job to fail.

Set the 'mapreduce.map.failures.maxpercent` and
'mapreduce.reduce.failures.maxpercent' to the failure tolerance for the job
to complete irrespective of some task failures.

Again, this is one of the hidden features of Hadoop. Though it was
intyroduced back in 2007 (HADOOP-1144).

If you would like to really nail the problem, then you could use the
IsolationRunner. Here is more information on it.

http://hadoop.apache.org/common/docs/r0.20.205.0/mapred_tutorial.html#IsolationRunner

Regards,
Praveen

On Sun, Dec 4, 2011 at 2:42 PM, Bejoy Ks <[EMAIL PROTECTED]> wrote:

> Hi Mat
>         I'm not sure of an implicit mechanism in hadoop that logs the
> input splits(file names) each mapper is processing. To analyze that you may
> have to do some custom logging. Just log the input file name on the start
> of map method. The full file path in hdfs can be obtained from the input
> Split as follows
>
> //get the file split being processed
> FileSplit filsp = (FileSplit)context.getInputSplit();
> //get the full path of the file being processed
> log.debug(filsp.getPath());
>
> This works with new map reduce API. In old map reduce API you can get the
> information from JobConf job as
> job.get("map.input.file");
> This line of code you can include in your configure method in case of old
> API.
>
> Hope it helps!...
>
> Regards
> Bejoy.K.S
>
>
> On Sun, Dec 4, 2011 at 4:05 AM, Mat Kelcey <[EMAIL PROTECTED]>wrote:
>
>> Hi folks,
>>
>> I have a Hadoop 0.20.2 map only job with thousands of inputs tasks;
>> I'm using the org.apache.nutch.tools.arc.ArcInputFormat input format
>> so each task corresponds to a single file in HDFS
>>
>> Most of the way into the job it hits a task that causes the input
>> format to OOM. After 4 attempts it fails the job.
>> Now this is obviously not great but for the purpose of my job I'd be
>> happy to just throw this input file away, it's only one of thousands
>> and I don't need exact results.
>>
>> The trouble is I can't work out where what file this task corresponds to?
>>
>> The closest I can find is that the job history file lists a STATE_STRING
>> ( eg
>> STATE_STRING="hdfs://ip-10-115-29-44\.ec2\.internal:9000/user/hadoop/arc_files\.aa/2009/09/17/0/1253240925734_0\.arc\.gz:0+100425468"
>> )
>>
>> but this is _only_ for the successfully completed ones, for the failed
>> one I'm actually interested in there is nothing
>> MapAttempt TASK_TYPE="MAP" TASKID="task_201112030459_0011_m_004130"
>> TASK_ATTEMPT_ID="attempt_201112030459_0011_m_004130_0"
>> TASK_STATUS="FAILED" FINISH_TIME="1322901661261"
>> HOSTNAME="ip-10-218-57-227\.ec2\.internal" ERROR="Error: null" .
>>
>> I grepped through all the hadoop logs and couldn't find anything that
>> relates this task to the files in it's split
>> Any ideas where this info might be recorded?
>>
>> Cheers,
>> Mat
>>
>
>
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB