Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Hadoop, mail # user - Map tasks processing some files multiple times


Copy link to this message
-
Re: Map tasks processing some files multiple times
Hemanth Yamijala 2012-12-06, 08:25
David,

You are using FileNameTextInputFormat. This is not in Hadoop source, as far
as I can see. Can you please confirm where this is being used from ? It
seems like the isSplittable method of this input format may need checking.

Another thing, given you are adding the same input format for all files, do
you need MultipleInputs ?

Thanks
Hemanth
On Thu, Dec 6, 2012 at 1:06 PM, David Parks <[EMAIL PROTECTED]> wrote:

> I believe I just tracked down the problem, maybe you can help confirm if
> you’re familiar with this.****
>
> ** **
>
> I see that FileInputFormat is specifying that gzip files (.gz extension)
> from s3n filesystem are being reported as *splittable*, and I see that
> it’s creating multiple input splits for these files. I’m mapping the files
> directly off S3:****
>
> ** **
>
>        Path lsDir = *new* Path(
> "s3n://fruggmapreduce/input/catalogs/linkshare_catalogs/*~*");****
>
>        MultipleInputs.*addInputPath*(job, lsDir, FileNameTextInputFormat.*
> class*, LinkShareCatalogImportMapper.*class*);****
>
> ** **
>
> I see in the map phase, based on my counters, that it’s actually
> processing the entire file (I set up a counter per file input). So the 2
> files which were processed twice had 2 splits (I now see that in some debug
> logs I created), and the 1 file that was processed 3 times had 3 splits
> (the rest were smaller and were only assigned one split by default anyway).
> ****
>
> ** **
>
> Am I wrong in expecting all files on the s3n filesystem to come through as
> not-splittable? This seems to be a bug in hadoop code if I’m right.****
>
> ** **
>
> David****
>
> ** **
>
> ** **
>
> *From:* Raj Vishwanathan [mailto:[EMAIL PROTECTED]]
> *Sent:* Thursday, December 06, 2012 1:45 PM
> *To:* [EMAIL PROTECTED]
> *Subject:* Re: Map tasks processing some files multiple times****
>
> ** **
>
> Could it be due to spec-ex? Does it make a diffrerence in the end?****
>
> ** **
>
> Raj****
>
> ** **
> ------------------------------
>
> *From:* David Parks <[EMAIL PROTECTED]>
> *To:* [EMAIL PROTECTED]
> *Sent:* Wednesday, December 5, 2012 10:15 PM
> *Subject:* Map tasks processing some files multiple times****
>
> ** **
>
> I’ve got a job that reads in 167 files from S3, but 2 of the files are
> being mapped twice and 1 of the files is mapped 3 times.****
>
>  ****
>
> This is the code I use to set up the mapper:****
>
>  ****
>
>        Path lsDir = *new* Path(
> "s3n://fruggmapreduce/input/catalogs/linkshare_catalogs/*~*");****
>
>        *for*(FileStatus f :
> lsDir.getFileSystem(getConf()).globStatus(lsDir)) log.info("Identified
> linkshare catalog: " + f.getPath().toString());****
>
>        *if*( lsDir.getFileSystem(getConf()).globStatus(lsDir).length > 0
> ){****
>
>               MultipleInputs.*addInputPath*(job, lsDir,
> FileNameTextInputFormat.*class*, LinkShareCatalogImportMapper.*class*);***
> *
>
>        }****
>
>  ****
>
> I can see from the logs that it sees only 1 copy of each of these files,
> and correctly identifies 167 files.****
>
>  ****
>
> I also have the following confirmation that it found the 167 files
> correctly:****
>
>  ****
>
> 2012-12-06 04:56:41,213 INFO
> org.apache.hadoop.mapreduce.lib.input.FileInputFormat (main): Total input
> paths to process : 167****
>
>  ****
>
> When I look through the syslogs I can see that the file in question was
> opened by two different map attempts:****
>
>  ****
>
> ./task-attempts/job_201212060351_0001/*
> attempt_201212060351_0001_m_000005_0*/syslog:2012-12-06 03:56:05,265 INFO
> org.apache.hadoop.fs.s3native.NativeS3FileSystem (main): Opening
> 's3n://fruggmapreduce/input/catalogs/linkshare_catalogs/linkshare~CD%20Universe~85.csv.gz'
> for reading****
>
> ./task-attempts/job_201212060351_0001/*
> attempt_201212060351_0001_m_000173_0*/syslog:2012-12-06 03:53:18,765 INFO
> org.apache.hadoop.fs.s3native.NativeS3FileSystem (main): Opening
> 's3n://fruggmapreduce/input/catalogs/linkshare_catalogs/linkshare~CD%20Universe~85.csv.gz'