Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Plain View
HDFS >> mail # user >> RE: hadoop 1.0.3 equivalent of MultipleTextOutputFormat

Tony Burton 2013-01-31, 16:22
Copy link to this message
Re: hadoop 1.0.3 equivalent of MultipleTextOutputFormat
Hi Tony, from what i understand your prob is not with MTOF but with you
wanting to run 2 jobs using the same output directory, the second job will
fail because the output dir already existed. My take would be tweaking your
jobs to use a temp output dir, and moving them to the required (final)
location upon completion.


On Thu, Jan 31, 2013 at 8:22 AM, Tony Burton <[EMAIL PROTECTED]>wrote:

> Hi everyone,
> Some of you might recall this topic, which I worked on with the list's
> help back in August last year - see email trail below. Despite initial
> success of the discovery, I had the shelve the approach as I ended up using
> a different solution (for reasons I forget!) with the implementation that
> was ultimately used for that particular project.
> I'm now in a position to be working on a similar new task, where I've
> successfully implemented the combination of LazyOutputFormat and
> MultipleOutputs using hadoop 1.0.3 to write out to multiple custom output
> locations. However, I've hit another snag which I'm hoping you might help
> me work through.
> I'm going to be running daily tasks to extract data from XML files
> (specifically, the data stored in certain nodes of the XML), stored on AWS
> S3 using object names with the following format:
> s3://inputbucket/data/2013/1/13/<list of xml data files.bz2>
> I want to extract items from the XML and write out as follows:
> s3://outputbucket/path/<xml node name>/20130113/<output from MR job>
> For one day of data, this works fine. I pass in s3://inputbucket/data and
> s3://outputbucket/path as input and output arguments, along with my run
> date (20130113) which gets manipulated and appended where appropriate to
> form the precise read and write locations, for example
> FileInputFormat.setInputhPath(job, " s3://inputbucket/data");
> FileOutputFormat.setOutputPath(job, "s3://outputbucket/path");
> Then MultipleOutputs adds on my XML node names underneath
> s3://outputbucket/path automatically.
> However, for the next day's run, the job gets to
> FileOutputFormat.setOutputPath and sees that the output path
> (s3://outputbucket/path) already exists, and throws a
> FileAlreadyExistsException from FileOutputFormat.checkOutputSpecs() - even
> though my ultimate subdirectory, to be constructed by MultipleOutputs does
> not already exist.
> Is there any way around this? I'm given hope by this, from
> http://hadoop.apache.org/docs/r1.0.3/api/org/apache/hadoop/fs/FileAlreadyExistsException.html:
> "public class FileAlreadyExistsException extends IOException - Used when
> target file already exists for any operation *and is not configured to be
> overwritten*" (my emphasis). Is it possible to deconfigure the overwrite
> protection?
> If not, I suppose one other way ahead is to create my own FileOutputFormat
> where the checkOutputSpecs() is a bit less fussy; another might be to write
> to a "temp" directory and programmatically move it to the desired output
> when the job completes successfully, although this is getting to feel a bit
> "hacky" to me.
> Thanks for any feedback!
> Tony
> ________________________________________
> From: Harsh J [[EMAIL PROTECTED]]
> Sent: 31 August 2012 10:47
> Subject: Re: hadoop 1.0.3 equivalent of MultipleTextOutputFormat
> Good finding, that OF slipped my mind. We can mention on the
> MultipleOutputs javadocs for the new API to use the LazyOutputFormat for
> the job-level config. Please file a JIRA for this under MAPREDUCE project
> on the Apache JIRA?
> On Fri, Aug 31, 2012 at 2:32 PM, Tony Burton <[EMAIL PROTECTED]>
> wrote:
> > Hi Harsh,
> >
> > I tried using NullOutputFormat as you suggested, however simply using
> >
> > job.setOutputFormatClass(NullOutputFormat.class);
> >
> > resulted in no output at all. Although I've not tried overriding
> getOutputCommitter in NullOutputFormat as you suggested, I discovered
> LazyOutputFormat which only writes when it has to, "the output file is

Tony Burton 2013-02-01, 15:12