|
|
-
Re: Map tasks processing some files multiple timesHemanth Yamijala 2012-12-06, 14:43
Glad it helps. Could you also explain the reason for using MultipleInputs ?
On Thu, Dec 6, 2012 at 2:59 PM, David Parks <[EMAIL PROTECTED]> wrote: > Figured it out, it is, as usual, with my code. I had wrapped > TextInputFormat to replace the LongWritable key with a key representing the > file name. It was a bit tricky to do because of changing the generics from > <LongWritable, Text> to <Text, Text> and I goofed up and mis-directed a > call to isSplittable, which was causing the issue.**** > > ** ** > > It now works fine. Thanks very much for the response, it gave me pause to > think enough to work out what I had done.**** > > ** ** > > Dave**** > > ** ** > > ** ** > > *From:* Hemanth Yamijala [mailto:[EMAIL PROTECTED]] > *Sent:* Thursday, December 06, 2012 3:25 PM > > *To:* [EMAIL PROTECTED] > *Subject:* Re: Map tasks processing some files multiple times**** > > ** ** > > David,**** > > ** ** > > You are using FileNameTextInputFormat. This is not in Hadoop source, as > far as I can see. Can you please confirm where this is being used from ? It > seems like the isSplittable method of this input format may need checking. > **** > > ** ** > > Another thing, given you are adding the same input format for all files, > do you need MultipleInputs ?**** > > ** ** > > Thanks**** > > Hemanth**** > > ** ** > > On Thu, Dec 6, 2012 at 1:06 PM, David Parks <[EMAIL PROTECTED]> > wrote:**** > > I believe I just tracked down the problem, maybe you can help confirm if > you’re familiar with this.**** > > **** > > I see that FileInputFormat is specifying that gzip files (.gz extension) > from s3n filesystem are being reported as *splittable*, and I see that > it’s creating multiple input splits for these files. I’m mapping the files > directly off S3:**** > > **** > > Path lsDir = *new* Path( > "s3n://fruggmapreduce/input/catalogs/linkshare_catalogs/*~*");**** > > MultipleInputs.*addInputPath*(job, lsDir, FileNameTextInputFormat.* > class*, LinkShareCatalogImportMapper.*class*);**** > > **** > > I see in the map phase, based on my counters, that it’s actually > processing the entire file (I set up a counter per file input). So the 2 > files which were processed twice had 2 splits (I now see that in some debug > logs I created), and the 1 file that was processed 3 times had 3 splits > (the rest were smaller and were only assigned one split by default anyway). > **** > > **** > > Am I wrong in expecting all files on the s3n filesystem to come through as > not-splittable? This seems to be a bug in hadoop code if I’m right.**** > > **** > > David**** > > **** > > **** > > *From:* Raj Vishwanathan [mailto:[EMAIL PROTECTED]] > *Sent:* Thursday, December 06, 2012 1:45 PM > *To:* [EMAIL PROTECTED] > *Subject:* Re: Map tasks processing some files multiple times**** > > **** > > Could it be due to spec-ex? Does it make a diffrerence in the end?**** > > **** > > Raj**** > > **** > ------------------------------ > > *From:* David Parks <[EMAIL PROTECTED]> > *To:* [EMAIL PROTECTED] > *Sent:* Wednesday, December 5, 2012 10:15 PM > *Subject:* Map tasks processing some files multiple times**** > > **** > > I’ve got a job that reads in 167 files from S3, but 2 of the files are > being mapped twice and 1 of the files is mapped 3 times.**** > > **** > > This is the code I use to set up the mapper:**** > > **** > > Path lsDir = *new* Path( > "s3n://fruggmapreduce/input/catalogs/linkshare_catalogs/*~*");**** > > *for*(FileStatus f : > lsDir.getFileSystem(getConf()).globStatus(lsDir)) log.info("Identified > linkshare catalog: " + f.getPath().toString());**** > > *if*( lsDir.getFileSystem(getConf()).globStatus(lsDir).length > 0 > ){**** > > MultipleInputs.*addInputPath*(job, lsDir, > FileNameTextInputFormat.*class*, LinkShareCatalogImportMapper.*class*);*** > * > > }**** > > **** > > I can see from the logs that it sees only 1 copy of each of these files, > and correctly identifies 167 files.**** |