Some of you might recall this topic, which I worked on with the list's help back in August last year - see email trail below. Despite initial success of the discovery, I had the shelve the approach as I ended up using a different solution (for reasons I forget!) with the implementation that was ultimately used for that particular project.
I'm now in a position to be working on a similar new task, where I've successfully implemented the combination of LazyOutputFormat and MultipleOutputs using hadoop 1.0.3 to write out to multiple custom output locations. However, I've hit another snag which I'm hoping you might help me work through.
I'm going to be running daily tasks to extract data from XML files (specifically, the data stored in certain nodes of the XML), stored on AWS S3 using object names with the following format:
s3://inputbucket/data/2013/1/13/<list of xml data files.bz2>
I want to extract items from the XML and write out as follows:
s3://outputbucket/path/<xml node name>/20130113/<output from MR job>
For one day of data, this works fine. I pass in s3://inputbucket/data and s3://outputbucket/path as input and output arguments, along with my run date (20130113) which gets manipulated and appended where appropriate to form the precise read and write locations, for example
FileInputFormat.setInputhPath(job, " s3://inputbucket/data");
Then MultipleOutputs adds on my XML node names underneath s3://outputbucket/path automatically.
However, for the next day's run, the job gets to FileOutputFormat.setOutputPath and sees that the output path (s3://outputbucket/path) already exists, and throws a FileAlreadyExistsException from FileOutputFormat.checkOutputSpecs() - even though my ultimate subdirectory, to be constructed by MultipleOutputs does not already exist.
Is there any way around this? I'm given hope by this, from http://hadoop.apache.org/docs/r1.0.3/api/org/apache/hadoop/fs/FileAlreadyExistsException.html: "public class FileAlreadyExistsException extends IOException - Used when target file already exists for any operation *and is not configured to be overwritten*" (my emphasis). Is it possible to deconfigure the overwrite protection?
If not, I suppose one other way ahead is to create my own FileOutputFormat where the checkOutputSpecs() is a bit less fussy; another might be to write to a "temp" directory and programmatically move it to the desired output when the job completes successfully, although this is getting to feel a bit "hacky" to me.
Thanks for any feedback!
From: Harsh J [[EMAIL PROTECTED]]
Sent: 31 August 2012 10:47
To: [EMAIL PROTECTED]
Subject: Re: hadoop 1.0.3 equivalent of MultipleTextOutputFormat
Good finding, that OF slipped my mind. We can mention on the MultipleOutputs javadocs for the new API to use the LazyOutputFormat for the job-level config. Please file a JIRA for this under MAPREDUCE project on the Apache JIRA?
On Fri, Aug 31, 2012 at 2:32 PM, Tony Burton <[EMAIL PROTECTED]> wrote:
> Hi Harsh,
> I tried using NullOutputFormat as you suggested, however simply using
> resulted in no output at all. Although I've not tried overriding getOutputCommitter in NullOutputFormat as you suggested, I discovered LazyOutputFormat which only writes when it has to, "the output file is created only when the first record is emitted for a given partition" (from "Hadoop: The Definitive Guide").
> Instead of
> use LazyOutputFormat like this:
> LazyOutputFormat.setOutputFormatClass(job, TextOutputFormat.class);
> So now my unnamed MultipleOutputs are handling to segmented results, and LazyOutputFormat is suppressing the default output. Good job!
> From: Harsh J [[EMAIL PROTECTED]]
> Sent: 29 August 2012 17:05
Inbound Email has been scanned for viruses and SPAM
This email and any attachments are confidential, protected by copyright and may be legally privileged. If you are not the intended recipient, then the dissemination or copying of this email is prohibited. If you have received this in error, please notify the sender by replying by email and then delete the email completely from your system. Neither Sporting Index nor the sender accepts responsibility for any virus, or any other defect which might affect any computer or IT system into which the email is received and/or opened. It is the responsibility of the recipient to scan the email and no responsibility is accepted for any loss or damage arising in any way from receipt or use of this email. Sporting Index Ltd is a company registered in England and Wales with company number 2636842, whose registered office is at Gateway House, Milverton Street, London, SE11 4AP. Sporting Index Ltd is authorised and regulated by the UK Financial Services Authority (reg. no. 150404) and Gambling Commission (reg. no. 000-027343-R-308898-001). Any financial promotion contained herein has been issued and approved by Sporting Index Ltd.
Outbound email has been scanned for viruses and SPAM www.sportingindex.com Inbound Email has been scanned for viruses and SPAM