Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
MapReduce >> mail # user >> Re: hadoop 1.0.3 equivalent of MultipleTextOutputFormat


Copy link to this message
-
Re: hadoop 1.0.3 equivalent of MultipleTextOutputFormat
Tony, I think the first step would be to verify if the S3 filesystem
implementation rename works as expected.

Thx
On Fri, Feb 1, 2013 at 7:12 AM, Tony Burton <[EMAIL PROTECTED]>wrote:

> ** **
>
> Thanks for the reply Alejandro. Using a temp output directory was my first
> guess as well. What’s the best way to proceed? I’ve come across
> FileSystem.rename but it’s consistently returning false for whatever Paths
> I provide. Specifically, I need to copy the following:****
>
> ** **
>
> s3://<path to data>/<tmp folder>/<object type 1>/part-00000****
>
> …****
>
> s3://<path to data>/<tmp folder>/<object type 1>/part-nnnnn****
>
> s3://<path to data>/<tmp folder>/<object type 2>/part-00000****
>
> …****
>
> s3://<path to data>/<tmp folder>/<object type 2>/part-nnnnn****
>
> …****
>
> s3://<path to data>/<tmp folder>/<object type m>/part-nnnnn****
>
> ** **
>
> to ****
>
> ** **
>
> s3://<path to data>/<object type 1>/part-00000****
>
> …****
>
> s3://<path to data>/<object type 1>/part-nnnnn****
>
> s3://<path to data>/<object type 2>/part-00000****
>
> …****
>
> s3://<path to data>/<object type 2>/part-nnnnn****
>
> …****
>
> s3://<path to data>/<object type m>/part-nnnnn****
>
> ** **
>
> without doing a copyToLocal.****
>
> ** **
>
> Any tips? Are there any better alternatives to FileSystem.rename? Or would
> using the AWS Java SDK be a better solution?****
>
> ** **
>
> Thanks!****
>
> ** **
>
> Tony****
>
> ** **
>
> ** **
>
> ** **
>
> ** **
>
> ** **
>
> ** **
>
> *From:* Alejandro Abdelnur [mailto:[EMAIL PROTECTED]]
> *Sent:* 31 January 2013 18:45
> *To:* [EMAIL PROTECTED]
>
> *Subject:* Re: hadoop 1.0.3 equivalent of MultipleTextOutputFormat****
>
> ** **
>
> Hi Tony, from what i understand your prob is not with MTOF but with you
> wanting to run 2 jobs using the same output directory, the second job will
> fail because the output dir already existed. My take would be tweaking your
> jobs to use a temp output dir, and moving them to the required (final)
> location upon completion.****
>
> ** **
>
> thx****
>
> ** **
>
> ** **
>
> On Thu, Jan 31, 2013 at 8:22 AM, Tony Burton <[EMAIL PROTECTED]>
> wrote:****
>
> Hi everyone,
>
> Some of you might recall this topic, which I worked on with the list's
> help back in August last year - see email trail below. Despite initial
> success of the discovery, I had the shelve the approach as I ended up using
> a different solution (for reasons I forget!) with the implementation that
> was ultimately used for that particular project.
>
> I'm now in a position to be working on a similar new task, where I've
> successfully implemented the combination of LazyOutputFormat and
> MultipleOutputs using hadoop 1.0.3 to write out to multiple custom output
> locations. However, I've hit another snag which I'm hoping you might help
> me work through.
>
> I'm going to be running daily tasks to extract data from XML files
> (specifically, the data stored in certain nodes of the XML), stored on AWS
> S3 using object names with the following format:
>
> s3://inputbucket/data/2013/1/13/<list of xml data files.bz2>
>
> I want to extract items from the XML and write out as follows:
>
> s3://outputbucket/path/<xml node name>/20130113/<output from MR job>
>
> For one day of data, this works fine. I pass in s3://inputbucket/data and
> s3://outputbucket/path as input and output arguments, along with my run
> date (20130113) which gets manipulated and appended where appropriate to
> form the precise read and write locations, for example
>
> FileInputFormat.setInputhPath(job, " s3://inputbucket/data");
> FileOutputFormat.setOutputPath(job, "s3://outputbucket/path");
>
> Then MultipleOutputs adds on my XML node names underneath
> s3://outputbucket/path automatically.
>
> However, for the next day's run, the job gets to
> FileOutputFormat.setOutputPath and sees that the output path
> (s3://outputbucket/path) already exists, and throws a
> FileAlreadyExistsException from FileOutputFormat.checkOutputSpecs() - even

Alejandro
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB