|
|
+
Tony Burton 2013-01-31, 16:22
-
Re: hadoop 1.0.3 equivalent of MultipleTextOutputFormatAlejandro Abdelnur 2013-01-31, 18:44
Hi Tony, from what i understand your prob is not with MTOF but with you
wanting to run 2 jobs using the same output directory, the second job will fail because the output dir already existed. My take would be tweaking your jobs to use a temp output dir, and moving them to the required (final) location upon completion. thx On Thu, Jan 31, 2013 at 8:22 AM, Tony Burton <[EMAIL PROTECTED]>wrote: > Hi everyone, > > Some of you might recall this topic, which I worked on with the list's > help back in August last year - see email trail below. Despite initial > success of the discovery, I had the shelve the approach as I ended up using > a different solution (for reasons I forget!) with the implementation that > was ultimately used for that particular project. > > I'm now in a position to be working on a similar new task, where I've > successfully implemented the combination of LazyOutputFormat and > MultipleOutputs using hadoop 1.0.3 to write out to multiple custom output > locations. However, I've hit another snag which I'm hoping you might help > me work through. > > I'm going to be running daily tasks to extract data from XML files > (specifically, the data stored in certain nodes of the XML), stored on AWS > S3 using object names with the following format: > > s3://inputbucket/data/2013/1/13/<list of xml data files.bz2> > > I want to extract items from the XML and write out as follows: > > s3://outputbucket/path/<xml node name>/20130113/<output from MR job> > > For one day of data, this works fine. I pass in s3://inputbucket/data and > s3://outputbucket/path as input and output arguments, along with my run > date (20130113) which gets manipulated and appended where appropriate to > form the precise read and write locations, for example > > FileInputFormat.setInputhPath(job, " s3://inputbucket/data"); > FileOutputFormat.setOutputPath(job, "s3://outputbucket/path"); > > Then MultipleOutputs adds on my XML node names underneath > s3://outputbucket/path automatically. > > However, for the next day's run, the job gets to > FileOutputFormat.setOutputPath and sees that the output path > (s3://outputbucket/path) already exists, and throws a > FileAlreadyExistsException from FileOutputFormat.checkOutputSpecs() - even > though my ultimate subdirectory, to be constructed by MultipleOutputs does > not already exist. > > Is there any way around this? I'm given hope by this, from > http://hadoop.apache.org/docs/r1.0.3/api/org/apache/hadoop/fs/FileAlreadyExistsException.html: > "public class FileAlreadyExistsException extends IOException - Used when > target file already exists for any operation *and is not configured to be > overwritten*" (my emphasis). Is it possible to deconfigure the overwrite > protection? > > If not, I suppose one other way ahead is to create my own FileOutputFormat > where the checkOutputSpecs() is a bit less fussy; another might be to write > to a "temp" directory and programmatically move it to the desired output > when the job completes successfully, although this is getting to feel a bit > "hacky" to me. > > Thanks for any feedback! > > Tony > > > > > > > > ________________________________________ > From: Harsh J [[EMAIL PROTECTED]] > Sent: 31 August 2012 10:47 > To: [EMAIL PROTECTED] > Subject: Re: hadoop 1.0.3 equivalent of MultipleTextOutputFormat > > Good finding, that OF slipped my mind. We can mention on the > MultipleOutputs javadocs for the new API to use the LazyOutputFormat for > the job-level config. Please file a JIRA for this under MAPREDUCE project > on the Apache JIRA? > > On Fri, Aug 31, 2012 at 2:32 PM, Tony Burton <[EMAIL PROTECTED]> > wrote: > > Hi Harsh, > > > > I tried using NullOutputFormat as you suggested, however simply using > > > > job.setOutputFormatClass(NullOutputFormat.class); > > > > resulted in no output at all. Although I've not tried overriding > getOutputCommitter in NullOutputFormat as you suggested, I discovered > LazyOutputFormat which only writes when it has to, "the output file is Alejandro +
Tony Burton 2013-02-01, 15:12
|