Maybe this message can solve your problem as well:
@Shi Yu:
Yes there are built in functions to get the input file Path in the Mapper
(you can use these for counters by putting the file name in the counter
name), however there are some issues if you use MultipleInputs to your job.
Here's some sample code I wrote to work around the issue (execute in a
Mapper):
Path filePath = null;
Object obj = reporter.getInputSplit();
if(!(obj instanceof FileSplit)) {
Class clazz = obj.getClass();
try {
Method inputSplitMethod = clazz.getDeclaredMethod(
"getInputSplit", new Class[0]);
inputSplitMethod.setAccessible(true);
Object inputSplit = inputSplitMethod.invoke(obj, new Object[0]);
if(inputSplit instanceof FileSplit) {
filePath = ((FileSplit)inputSplit).getPath();
}
} catch(Exception e) {
throw new IOException(
"Could not find input FileSplit in Mapper", e);
}
} else {
FileSplit fs = (FileSplit)obj;
filePath = fs.getPath();
}
if(filePath == null) {
throw new IOException(
"Could not find input FileSplit in Mapper");
}
if(LOG.isDebugEnabled()) LOG.debug("filePath: " + filePath);
Using Cloudera Hadoop 0.20.1+169.113
Subversion -r 6c765a47a9291470d3d8814c98155115d109d71
I also logged this with Cloudera, please vote for it if you want this fixed:
http://getsatisfaction.com/cloudera/topics/hadoop_getting_taggedinputsplit_instead_of_filesplit_with_multipleinputs
Cheers,
Matt
On 10/22/10 6:01 PM, "Shi Yu"<[EMAIL PROTECTED]> wrote:
> > My late thanks to the nice advice. I have tried this, it works. However,
> > to produce the line number I had to rescan the files again, add new line
> > numbers and then resave them as new files. It took a long time because
> > they are very big. Are there any built in functions that could
> > automatically provide the current filename (if there are multiple files)
> > and the line numbers in Map/Reduce?
> >
> > Shi
> >
> > On 2010-10-20 21:16, Hieu Khac Le wrote:
>
>> >> How about using the line number as the key and the string at that line as
>> >> value.
>> >>
>> >> -------
>> >> Please excuse typos and brief nature of this email sent from my mobile device
>> >>
>> >> On Oct 20, 2010, at 9:07 PM, Shi Yu<[EMAIL PROTECTED]> wrote:
>> >>
>> >>
>>
>>> >>> Hi,
>>> >>>
>>> >>> I have a problem of comparing two huge files (100G each) consist of string
>>> >>> sequence. It is more like the file text compare problem. I would like to
>>> >>> find out how many strings are different within these two files in the
>>> >>> natural order. Can this task be modeled as a map/reduce job? Currently I
>>> >>> have no idea how to control the split of map and make sure the two input
>>> >>> threads in one map task are pointing to the same positions in the files.
>>> >>>
>>> >>>
>>> >>> Shi
>>> >>>
>>>
> >
>
On 2010-10-26 14:43, Oleg Ruchovets wrote:
> Hi ,
> Running a hadoop job which manipulates ~ 4000 files (files ar gz) , and
> suppose one of this gz was corrupted. From web console /log files I can see
> which task got exception ,but to isolate which files was corrupted it is
> really hard. Is it a way to know which files was produced by which hadoop
> task?
>
> Thanks in advance
> Oleg.
>
>