Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Plain View
Hadoop >> mail # user >> Map tasks processing some files multiple times

David Parks 2012-12-06, 06:15
Copy link to this message
Re: Map tasks processing some files multiple times
Could it be due to spec-ex? Does it make a diffrerence in the end?


> From: David Parks <[EMAIL PROTECTED]>
>Sent: Wednesday, December 5, 2012 10:15 PM
>Subject: Map tasks processing some files multiple times
>I’ve got a job that reads in 167 files from S3, but 2 of the files are being mapped twice and 1 of the files is mapped 3 times.

>This is the code I use to set up the mapper:

>       Path lsDir = newPath("s3n://fruggmapreduce/input/catalogs/linkshare_catalogs/*~*");
>       for(FileStatus f : lsDir.getFileSystem(getConf()).globStatus(lsDir)) log.info("Identified linkshare catalog: "+ f.getPath().toString());
>       if( lsDir.getFileSystem(getConf()).globStatus(lsDir).length> 0 ){
>              MultipleInputs.addInputPath(job, lsDir, FileNameTextInputFormat.class, LinkShareCatalogImportMapper.class);
>       }

>I can see from the logs that it sees only 1 copy of each of these files, and correctly identifies 167 files.

>I also have the following confirmation that it found the 167 files correctly:

>2012-12-06 04:56:41,213 INFO org.apache.hadoop.mapreduce.lib.input.FileInputFormat (main): Total input paths to process : 167

>When I look through the syslogs I can see that the file in question was opened by two different map attempts:

>./task-attempts/job_201212060351_0001/attempt_201212060351_0001_m_000005_0/syslog:2012-12-06 03:56:05,265 INFO org.apache.hadoop.fs.s3native.NativeS3FileSystem (main): Opening 's3n://fruggmapreduce/input/catalogs/linkshare_catalogs/linkshare~CD%20Universe~85.csv.gz' for reading
>./task-attempts/job_201212060351_0001/attempt_201212060351_0001_m_000173_0/syslog:2012-12-06 03:53:18,765 INFO org.apache.hadoop.fs.s3native.NativeS3FileSystem (main): Opening 's3n://fruggmapreduce/input/catalogs/linkshare_catalogs/linkshare~CD%20Universe~85.csv.gz' for reading

>This is only happening to these 3 files, all others seem to be fine. For the life of me I can’t see a reason why these files might be processed multiple times.

>Notably, map attempt 173 is more map attempts than should be possible. There are 167 input files (from S3, gzipped), thus there should be 167 map attempts. But I see a total of 176 map tasks.

>Any thoughts/ideas/guesses?

David Parks 2012-12-06, 07:36
Hemanth Yamijala 2012-12-06, 08:25
David Parks 2012-12-07, 03:57