This is a common concern. There is a MR jira raised for the same.
One way I use to find which inputs went to map task is as below, :
a) Get the input spit locations from the task log;
b) Got to the location and from data node logs grep for the attempt id , you will get the block id from it.
c) On the input path to the mr do a :
hadoop fsck <input_path> -locations -blocks -files
This will contain the block report you can search the blk id in this report to get the filename.
(fsck is bit expensive operation for namenode, so watch out for the location for which you are doing fsck. )
Would like to know if anyone else has a better way.
Thanks and Regards
From: Kester, Scott [mailto:[EMAIL PROTECTED]]
Sent: Wednesday, February 16, 2011 8:22 PM
To: [EMAIL PROTECTED]
Subject: How to find input file associated with failed map task?
This may be better asked on one of the other hadoop lists, but as the job in question is done with Pig I thought I would start here. I have a nightly job that runs against around 1000 gzip log files. Around once a week one of the map tasks will fail reporting some form of gzip error/corruption of the input file. The job still completes as successful as we have set mapred.max.map.failures.percent = 1 to allow a few input files to fail without aborting the entire job.
Sometimes I can find the name of the corrupt input file in the logs available for the map task from the Map/Reduce Administration page on port 50030 of the name node. However most of the time the name is not in these logs. I can find the map task id of the form attempt_201102141346_0097_m_000000_0, but would like to know how if possible to find the name of the corrupted input file. Is there a Pig/Haddop file/log somewhere that associates the attempt id with the input file?