Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Plain View
HDFS >> mail # user >> Re: Map tasks processing some files multiple times


Copy link to this message
-
Re: Map tasks processing some files multiple times
Glad it helps. Could you also explain the reason for using MultipleInputs ?
On Thu, Dec 6, 2012 at 2:59 PM, David Parks <[EMAIL PROTECTED]> wrote:

> Figured it out, it is, as usual, with my code. I had wrapped
> TextInputFormat to replace the LongWritable key with a key representing the
> file name. It was a bit tricky to do because of changing the generics from
> <LongWritable, Text> to <Text, Text> and I goofed up and mis-directed a
> call to isSplittable, which was causing the issue.****
>
> ** **
>
> It now works fine. Thanks very much for the response, it gave me pause to
> think enough to work out what I had done.****
>
> ** **
>
> Dave****
>
> ** **
>
> ** **
>
> *From:* Hemanth Yamijala [mailto:[EMAIL PROTECTED]]
> *Sent:* Thursday, December 06, 2012 3:25 PM
>
> *To:* [EMAIL PROTECTED]
> *Subject:* Re: Map tasks processing some files multiple times****
>
> ** **
>
> David,****
>
> ** **
>
> You are using FileNameTextInputFormat. This is not in Hadoop source, as
> far as I can see. Can you please confirm where this is being used from ? It
> seems like the isSplittable method of this input format may need checking.
> ****
>
> ** **
>
> Another thing, given you are adding the same input format for all files,
> do you need MultipleInputs ?****
>
> ** **
>
> Thanks****
>
> Hemanth****
>
> ** **
>
> On Thu, Dec 6, 2012 at 1:06 PM, David Parks <[EMAIL PROTECTED]>
> wrote:****
>
> I believe I just tracked down the problem, maybe you can help confirm if
> you’re familiar with this.****
>
>  ****
>
> I see that FileInputFormat is specifying that gzip files (.gz extension)
> from s3n filesystem are being reported as *splittable*, and I see that
> it’s creating multiple input splits for these files. I’m mapping the files
> directly off S3:****
>
>  ****
>
>        Path lsDir = *new* Path(
> "s3n://fruggmapreduce/input/catalogs/linkshare_catalogs/*~*");****
>
>        MultipleInputs.*addInputPath*(job, lsDir, FileNameTextInputFormat.*
> class*, LinkShareCatalogImportMapper.*class*);****
>
>  ****
>
> I see in the map phase, based on my counters, that it’s actually
> processing the entire file (I set up a counter per file input). So the 2
> files which were processed twice had 2 splits (I now see that in some debug
> logs I created), and the 1 file that was processed 3 times had 3 splits
> (the rest were smaller and were only assigned one split by default anyway).
> ****
>
>  ****
>
> Am I wrong in expecting all files on the s3n filesystem to come through as
> not-splittable? This seems to be a bug in hadoop code if I’m right.****
>
>  ****
>
> David****
>
>  ****
>
>  ****
>
> *From:* Raj Vishwanathan [mailto:[EMAIL PROTECTED]]
> *Sent:* Thursday, December 06, 2012 1:45 PM
> *To:* [EMAIL PROTECTED]
> *Subject:* Re: Map tasks processing some files multiple times****
>
>  ****
>
> Could it be due to spec-ex? Does it make a diffrerence in the end?****
>
>  ****
>
> Raj****
>
>  ****
> ------------------------------
>
> *From:* David Parks <[EMAIL PROTECTED]>
> *To:* [EMAIL PROTECTED]
> *Sent:* Wednesday, December 5, 2012 10:15 PM
> *Subject:* Map tasks processing some files multiple times****
>
>  ****
>
> I’ve got a job that reads in 167 files from S3, but 2 of the files are
> being mapped twice and 1 of the files is mapped 3 times.****
>
>  ****
>
> This is the code I use to set up the mapper:****
>
>  ****
>
>        Path lsDir = *new* Path(
> "s3n://fruggmapreduce/input/catalogs/linkshare_catalogs/*~*");****
>
>        *for*(FileStatus f :
> lsDir.getFileSystem(getConf()).globStatus(lsDir)) log.info("Identified
> linkshare catalog: " + f.getPath().toString());****
>
>        *if*( lsDir.getFileSystem(getConf()).globStatus(lsDir).length > 0
> ){****
>
>               MultipleInputs.*addInputPath*(job, lsDir,
> FileNameTextInputFormat.*class*, LinkShareCatalogImportMapper.*class*);***
> *
>
>        }****
>
>  ****
>
> I can see from the logs that it sees only 1 copy of each of these files,
> and correctly identifies 167 files.****
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB