Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
MapReduce, mail # user - Re: hadoop 1.0.3 equivalent of MultipleTextOutputFormat


Copy link to this message
-
Re: hadoop 1.0.3 equivalent of MultipleTextOutputFormat
Alejandro Abdelnur 2013-02-08, 19:06
Tony, I think the first step would be to verify if the S3 filesystem
implementation rename works as expected.

Thx
On Fri, Feb 1, 2013 at 7:12 AM, Tony Burton <[EMAIL PROTECTED]>wrote:

> ** **
>
> Thanks for the reply Alejandro. Using a temp output directory was my first
> guess as well. What’s the best way to proceed? I’ve come across
> FileSystem.rename but it’s consistently returning false for whatever Paths
> I provide. Specifically, I need to copy the following:****
>
> ** **
>
> s3://<path to data>/<tmp folder>/<object type 1>/part-00000****
>
> …****
>
> s3://<path to data>/<tmp folder>/<object type 1>/part-nnnnn****
>
> s3://<path to data>/<tmp folder>/<object type 2>/part-00000****
>
> …****
>
> s3://<path to data>/<tmp folder>/<object type 2>/part-nnnnn****
>
> …****
>
> s3://<path to data>/<tmp folder>/<object type m>/part-nnnnn****
>
> ** **
>
> to ****
>
> ** **
>
> s3://<path to data>/<object type 1>/part-00000****
>
> …****
>
> s3://<path to data>/<object type 1>/part-nnnnn****
>
> s3://<path to data>/<object type 2>/part-00000****
>
> …****
>
> s3://<path to data>/<object type 2>/part-nnnnn****
>
> …****
>
> s3://<path to data>/<object type m>/part-nnnnn****
>
> ** **
>
> without doing a copyToLocal.****
>
> ** **
>
> Any tips? Are there any better alternatives to FileSystem.rename? Or would
> using the AWS Java SDK be a better solution?****
>
> ** **
>
> Thanks!****
>
> ** **
>
> Tony****
>
> ** **
>
> ** **
>
> ** **
>
> ** **
>
> ** **
>
> ** **
>
> *From:* Alejandro Abdelnur [mailto:[EMAIL PROTECTED]]
> *Sent:* 31 January 2013 18:45
> *To:* [EMAIL PROTECTED]
>
> *Subject:* Re: hadoop 1.0.3 equivalent of MultipleTextOutputFormat****
>
> ** **
>
> Hi Tony, from what i understand your prob is not with MTOF but with you
> wanting to run 2 jobs using the same output directory, the second job will
> fail because the output dir already existed. My take would be tweaking your
> jobs to use a temp output dir, and moving them to the required (final)
> location upon completion.****
>
> ** **
>
> thx****
>
> ** **
>
> ** **
>
> On Thu, Jan 31, 2013 at 8:22 AM, Tony Burton <[EMAIL PROTECTED]>
> wrote:****
>
> Hi everyone,
>
> Some of you might recall this topic, which I worked on with the list's
> help back in August last year - see email trail below. Despite initial
> success of the discovery, I had the shelve the approach as I ended up using
> a different solution (for reasons I forget!) with the implementation that
> was ultimately used for that particular project.
>
> I'm now in a position to be working on a similar new task, where I've
> successfully implemented the combination of LazyOutputFormat and
> MultipleOutputs using hadoop 1.0.3 to write out to multiple custom output
> locations. However, I've hit another snag which I'm hoping you might help
> me work through.
>
> I'm going to be running daily tasks to extract data from XML files
> (specifically, the data stored in certain nodes of the XML), stored on AWS
> S3 using object names with the following format:
>
> s3://inputbucket/data/2013/1/13/<list of xml data files.bz2>
>
> I want to extract items from the XML and write out as follows:
>
> s3://outputbucket/path/<xml node name>/20130113/<output from MR job>
>
> For one day of data, this works fine. I pass in s3://inputbucket/data and
> s3://outputbucket/path as input and output arguments, along with my run
> date (20130113) which gets manipulated and appended where appropriate to
> form the precise read and write locations, for example
>
> FileInputFormat.setInputhPath(job, " s3://inputbucket/data");
> FileOutputFormat.setOutputPath(job, "s3://outputbucket/path");
>
> Then MultipleOutputs adds on my XML node names underneath
> s3://outputbucket/path automatically.
>
> However, for the next day's run, the job gets to
> FileOutputFormat.setOutputPath and sees that the output path
> (s3://outputbucket/path) already exists, and throws a
> FileAlreadyExistsException from FileOutputFormat.checkOutputSpecs() - even

Alejandro