Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
HDFS >> mail # user >> RE: hadoop 1.0.3 equivalent of MultipleTextOutputFormat

Copy link to this message
RE: hadoop 1.0.3 equivalent of MultipleTextOutputFormat

Thanks for the reply Alejandro. Using a temp output directory was my first guess as well. What's the best way to proceed? I've come across FileSystem.rename but it's consistently returning false for whatever Paths I provide. Specifically, I need to copy the following:

s3://<path to data>/<tmp folder>/<object type 1>/part-00000
s3://<path to data>/<tmp folder>/<object type 1>/part-nnnnn
s3://<path to data>/<tmp folder>/<object type 2>/part-00000
s3://<path to data>/<tmp folder>/<object type 2>/part-nnnnn
s3://<path to data>/<tmp folder>/<object type m>/part-nnnnn


s3://<path to data>/<object type 1>/part-00000
s3://<path to data>/<object type 1>/part-nnnnn
s3://<path to data>/<object type 2>/part-00000
s3://<path to data>/<object type 2>/part-nnnnn
s3://<path to data>/<object type m>/part-nnnnn

without doing a copyToLocal.

Any tips? Are there any better alternatives to FileSystem.rename? Or would using the AWS Java SDK be a better solution?


From: Alejandro Abdelnur [mailto:[EMAIL PROTECTED]]
Sent: 31 January 2013 18:45
Subject: Re: hadoop 1.0.3 equivalent of MultipleTextOutputFormat

Hi Tony, from what i understand your prob is not with MTOF but with you wanting to run 2 jobs using the same output directory, the second job will fail because the output dir already existed. My take would be tweaking your jobs to use a temp output dir, and moving them to the required (final) location upon completion.

On Thu, Jan 31, 2013 at 8:22 AM, Tony Burton <[EMAIL PROTECTED]<mailto:[EMAIL PROTECTED]>> wrote:
Hi everyone,

Some of you might recall this topic, which I worked on with the list's help back in August last year - see email trail below. Despite initial success of the discovery, I had the shelve the approach as I ended up using a different solution (for reasons I forget!) with the implementation that was ultimately used for that particular project.

I'm now in a position to be working on a similar new task, where I've successfully implemented the combination of LazyOutputFormat and MultipleOutputs using hadoop 1.0.3 to write out to multiple custom output locations. However, I've hit another snag which I'm hoping you might help me work through.

I'm going to be running daily tasks to extract data from XML files (specifically, the data stored in certain nodes of the XML), stored on AWS S3 using object names with the following format:

s3://inputbucket/data/2013/1/13/<list of xml data files.bz2>

I want to extract items from the XML and write out as follows:

s3://outputbucket/path/<xml node name>/20130113/<output from MR job>

For one day of data, this works fine. I pass in s3://inputbucket/data and s3://outputbucket/path as input and output arguments, along with my run date (20130113) which gets manipulated and appended where appropriate to form the precise read and write locations, for example

FileInputFormat.setInputhPath(job, " s3://inputbucket/data");
FileOutputFormat.setOutputPath(job, "s3://outputbucket/path");

Then MultipleOutputs adds on my XML node names underneath s3://outputbucket/path automatically.

However, for the next day's run, the job gets to FileOutputFormat.setOutputPath and sees that the output path (s3://outputbucket/path) already exists, and throws a FileAlreadyExistsException from FileOutputFormat.checkOutputSpecs() - even though my ultimate subdirectory, to be constructed by MultipleOutputs does not already exist.

Is there any way around this? I'm given hope by this, from http://hadoop.apache.org/docs/r1.0.3/api/org/apache/hadoop/fs/FileAlreadyExistsException.html: "public class FileAlreadyExistsException extends IOException - Used when target file already exists for any operation *and is not configured to be overwritten*" (my emphasis). Is it possible to deconfigure the overwrite protection?

If not, I suppose one other way ahead is to create my own FileOutputFormat where the checkOutputSpecs() is a bit less fussy; another might be to write to a "temp" directory and programmatically move it to the desired output when the job completes successfully, although this is getting to feel a bit "hacky" to me.

Thanks for any feedback!


Sent: 31 August 2012 10:47
Subject: Re: hadoop 1.0.3 equivalent of MultipleTextOutputFormat

Good finding, that OF slipped my mind. We can mention on the MultipleOutputs javadocs for the new API to use the LazyOutputFormat for the job-level config. Please file a JIRA for this under MAPREDUCE project on the Apache JIRA?

On Fri, Aug 31, 2012 at 2:32 PM, Tony Burton <[EMAIL PROTECTED]<mailto:[EMAIL PROTECTED]>> wrote:

Harsh J
Inbound Email has been scanned for viruses and SPAM
This email and any attachments are confidential, protected by copyright and may be legally privileged.  If you are not the intended recipient, then the dissemination or copying of this email is prohibited. If you have received this in error, please notify the sender by replying by email and then delete the email completely from your system.  Neither Sporting Index nor the sender accepts responsibility for any virus, or any other defect which might affect any computer or IT system into which the email is received and/or opened.  It is the responsibility of the recipient to scan the email and no responsibility is accepted for any loss or damage arising in any way from receipt or use of this email.  Sporting Index Ltd is a company registered in England and Wales with company number 2636842, whose registered office is at Gateway House, Milverton Street, London, SE11 4AP.  Sporting Index Ltd is authorised and regulated by the UK Financial Services Authority (reg