Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Hadoop >> mail # user >> Map tasks processing some files multiple times


Copy link to this message
-
Re: Map tasks processing some files multiple times
Could it be due to spec-ex? Does it make a diffrerence in the end?

Raj

>________________________________
> From: David Parks <[EMAIL PROTECTED]>
>To: [EMAIL PROTECTED]
>Sent: Wednesday, December 5, 2012 10:15 PM
>Subject: Map tasks processing some files multiple times
>
>
>I’ve got a job that reads in 167 files from S3, but 2 of the files are being mapped twice and 1 of the files is mapped 3 times.

>This is the code I use to set up the mapper:

>       Path lsDir = newPath("s3n://fruggmapreduce/input/catalogs/linkshare_catalogs/*~*");
>       for(FileStatus f : lsDir.getFileSystem(getConf()).globStatus(lsDir)) log.info("Identified linkshare catalog: "+ f.getPath().toString());
>       if( lsDir.getFileSystem(getConf()).globStatus(lsDir).length> 0 ){
>              MultipleInputs.addInputPath(job, lsDir, FileNameTextInputFormat.class, LinkShareCatalogImportMapper.class);
>       }

>I can see from the logs that it sees only 1 copy of each of these files, and correctly identifies 167 files.

>I also have the following confirmation that it found the 167 files correctly:

>2012-12-06 04:56:41,213 INFO org.apache.hadoop.mapreduce.lib.input.FileInputFormat (main): Total input paths to process : 167

>When I look through the syslogs I can see that the file in question was opened by two different map attempts:

>./task-attempts/job_201212060351_0001/attempt_201212060351_0001_m_000005_0/syslog:2012-12-06 03:56:05,265 INFO org.apache.hadoop.fs.s3native.NativeS3FileSystem (main): Opening 's3n://fruggmapreduce/input/catalogs/linkshare_catalogs/linkshare~CD%20Universe~85.csv.gz' for reading
>./task-attempts/job_201212060351_0001/attempt_201212060351_0001_m_000173_0/syslog:2012-12-06 03:53:18,765 INFO org.apache.hadoop.fs.s3native.NativeS3FileSystem (main): Opening 's3n://fruggmapreduce/input/catalogs/linkshare_catalogs/linkshare~CD%20Universe~85.csv.gz' for reading

>This is only happening to these 3 files, all others seem to be fine. For the life of me I can’t see a reason why these files might be processed multiple times.

>Notably, map attempt 173 is more map attempts than should be possible. There are 167 input files (from S3, gzipped), thus there should be 167 map attempts. But I see a total of 176 map tasks.

>Any thoughts/ideas/guesses?

>
>
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB