Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Plain View
Hadoop >> mail # user >> Map tasks processing some files multiple times


+
David Parks 2012-12-06, 06:15
+
Raj Vishwanathan 2012-12-06, 06:45
+
David Parks 2012-12-06, 07:36
Copy link to this message
-
Re: Map tasks processing some files multiple times
David,

You are using FileNameTextInputFormat. This is not in Hadoop source, as far
as I can see. Can you please confirm where this is being used from ? It
seems like the isSplittable method of this input format may need checking.

Another thing, given you are adding the same input format for all files, do
you need MultipleInputs ?

Thanks
Hemanth
On Thu, Dec 6, 2012 at 1:06 PM, David Parks <[EMAIL PROTECTED]> wrote:

> I believe I just tracked down the problem, maybe you can help confirm if
> you’re familiar with this.****
>
> ** **
>
> I see that FileInputFormat is specifying that gzip files (.gz extension)
> from s3n filesystem are being reported as *splittable*, and I see that
> it’s creating multiple input splits for these files. I’m mapping the files
> directly off S3:****
>
> ** **
>
>        Path lsDir = *new* Path(
> "s3n://fruggmapreduce/input/catalogs/linkshare_catalogs/*~*");****
>
>        MultipleInputs.*addInputPath*(job, lsDir, FileNameTextInputFormat.*
> class*, LinkShareCatalogImportMapper.*class*);****
>
> ** **
>
> I see in the map phase, based on my counters, that it’s actually
> processing the entire file (I set up a counter per file input). So the 2
> files which were processed twice had 2 splits (I now see that in some debug
> logs I created), and the 1 file that was processed 3 times had 3 splits
> (the rest were smaller and were only assigned one split by default anyway).
> ****
>
> ** **
>
> Am I wrong in expecting all files on the s3n filesystem to come through as
> not-splittable? This seems to be a bug in hadoop code if I’m right.****
>
> ** **
>
> David****
>
> ** **
>
> ** **
>
> *From:* Raj Vishwanathan [mailto:[EMAIL PROTECTED]]
> *Sent:* Thursday, December 06, 2012 1:45 PM
> *To:* [EMAIL PROTECTED]
> *Subject:* Re: Map tasks processing some files multiple times****
>
> ** **
>
> Could it be due to spec-ex? Does it make a diffrerence in the end?****
>
> ** **
>
> Raj****
>
> ** **
> ------------------------------
>
> *From:* David Parks <[EMAIL PROTECTED]>
> *To:* [EMAIL PROTECTED]
> *Sent:* Wednesday, December 5, 2012 10:15 PM
> *Subject:* Map tasks processing some files multiple times****
>
> ** **
>
> I’ve got a job that reads in 167 files from S3, but 2 of the files are
> being mapped twice and 1 of the files is mapped 3 times.****
>
>  ****
>
> This is the code I use to set up the mapper:****
>
>  ****
>
>        Path lsDir = *new* Path(
> "s3n://fruggmapreduce/input/catalogs/linkshare_catalogs/*~*");****
>
>        *for*(FileStatus f :
> lsDir.getFileSystem(getConf()).globStatus(lsDir)) log.info("Identified
> linkshare catalog: " + f.getPath().toString());****
>
>        *if*( lsDir.getFileSystem(getConf()).globStatus(lsDir).length > 0
> ){****
>
>               MultipleInputs.*addInputPath*(job, lsDir,
> FileNameTextInputFormat.*class*, LinkShareCatalogImportMapper.*class*);***
> *
>
>        }****
>
>  ****
>
> I can see from the logs that it sees only 1 copy of each of these files,
> and correctly identifies 167 files.****
>
>  ****
>
> I also have the following confirmation that it found the 167 files
> correctly:****
>
>  ****
>
> 2012-12-06 04:56:41,213 INFO
> org.apache.hadoop.mapreduce.lib.input.FileInputFormat (main): Total input
> paths to process : 167****
>
>  ****
>
> When I look through the syslogs I can see that the file in question was
> opened by two different map attempts:****
>
>  ****
>
> ./task-attempts/job_201212060351_0001/*
> attempt_201212060351_0001_m_000005_0*/syslog:2012-12-06 03:56:05,265 INFO
> org.apache.hadoop.fs.s3native.NativeS3FileSystem (main): Opening
> 's3n://fruggmapreduce/input/catalogs/linkshare_catalogs/linkshare~CD%20Universe~85.csv.gz'
> for reading****
>
> ./task-attempts/job_201212060351_0001/*
> attempt_201212060351_0001_m_000173_0*/syslog:2012-12-06 03:53:18,765 INFO
> org.apache.hadoop.fs.s3native.NativeS3FileSystem (main): Opening
> 's3n://fruggmapreduce/input/catalogs/linkshare_catalogs/linkshare~CD%20Universe~85.csv.gz'
+
David Parks 2012-12-07, 03:57
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB