-Re: hadoop 1.0.3 equivalent of MultipleTextOutputFormat
Alejandro Abdelnur 2013-01-31, 18:44
Hi Tony, from what i understand your prob is not with MTOF but with you
wanting to run 2 jobs using the same output directory, the second job will
fail because the output dir already existed. My take would be tweaking your
jobs to use a temp output dir, and moving them to the required (final)
location upon completion.
On Thu, Jan 31, 2013 at 8:22 AM, Tony Burton <[EMAIL PROTECTED]>wrote:
> Hi everyone,
> Some of you might recall this topic, which I worked on with the list's
> help back in August last year - see email trail below. Despite initial
> success of the discovery, I had the shelve the approach as I ended up using
> a different solution (for reasons I forget!) with the implementation that
> was ultimately used for that particular project.
> I'm now in a position to be working on a similar new task, where I've
> successfully implemented the combination of LazyOutputFormat and
> MultipleOutputs using hadoop 1.0.3 to write out to multiple custom output
> locations. However, I've hit another snag which I'm hoping you might help
> me work through.
> I'm going to be running daily tasks to extract data from XML files
> (specifically, the data stored in certain nodes of the XML), stored on AWS
> S3 using object names with the following format:
> s3://inputbucket/data/2013/1/13/<list of xml data files.bz2>
> I want to extract items from the XML and write out as follows:
> s3://outputbucket/path/<xml node name>/20130113/<output from MR job>
> For one day of data, this works fine. I pass in s3://inputbucket/data and
> s3://outputbucket/path as input and output arguments, along with my run
> date (20130113) which gets manipulated and appended where appropriate to
> form the precise read and write locations, for example
> FileInputFormat.setInputhPath(job, " s3://inputbucket/data");
> FileOutputFormat.setOutputPath(job, "s3://outputbucket/path");
> Then MultipleOutputs adds on my XML node names underneath
> s3://outputbucket/path automatically.
> However, for the next day's run, the job gets to
> FileOutputFormat.setOutputPath and sees that the output path
> (s3://outputbucket/path) already exists, and throws a
> FileAlreadyExistsException from FileOutputFormat.checkOutputSpecs() - even
> though my ultimate subdirectory, to be constructed by MultipleOutputs does
> not already exist.
> Is there any way around this? I'm given hope by this, from
> "public class FileAlreadyExistsException extends IOException - Used when
> target file already exists for any operation *and is not configured to be
> overwritten*" (my emphasis). Is it possible to deconfigure the overwrite
> If not, I suppose one other way ahead is to create my own FileOutputFormat
> where the checkOutputSpecs() is a bit less fussy; another might be to write
> to a "temp" directory and programmatically move it to the desired output
> when the job completes successfully, although this is getting to feel a bit
> "hacky" to me.
> Thanks for any feedback!
> From: Harsh J [[EMAIL PROTECTED]]
> Sent: 31 August 2012 10:47
> To: [EMAIL PROTECTED]
> Subject: Re: hadoop 1.0.3 equivalent of MultipleTextOutputFormat
> Good finding, that OF slipped my mind. We can mention on the
> MultipleOutputs javadocs for the new API to use the LazyOutputFormat for
> the job-level config. Please file a JIRA for this under MAPREDUCE project
> on the Apache JIRA?
> On Fri, Aug 31, 2012 at 2:32 PM, Tony Burton <[EMAIL PROTECTED]>
> > Hi Harsh,
> > I tried using NullOutputFormat as you suggested, however simply using
> > job.setOutputFormatClass(NullOutputFormat.class);
> > resulted in no output at all. Although I've not tried overriding
> getOutputCommitter in NullOutputFormat as you suggested, I discovered
> LazyOutputFormat which only writes when it has to, "the output file is