Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
HDFS >> mail # user >> RE: hadoop 1.0.3 equivalent of MultipleTextOutputFormat


Copy link to this message
-
Re: hadoop 1.0.3 equivalent of MultipleTextOutputFormat
Hi Tony, from what i understand your prob is not with MTOF but with you
wanting to run 2 jobs using the same output directory, the second job will
fail because the output dir already existed. My take would be tweaking your
jobs to use a temp output dir, and moving them to the required (final)
location upon completion.

thx

On Thu, Jan 31, 2013 at 8:22 AM, Tony Burton <[EMAIL PROTECTED]>wrote:

> Hi everyone,
>
> Some of you might recall this topic, which I worked on with the list's
> help back in August last year - see email trail below. Despite initial
> success of the discovery, I had the shelve the approach as I ended up using
> a different solution (for reasons I forget!) with the implementation that
> was ultimately used for that particular project.
>
> I'm now in a position to be working on a similar new task, where I've
> successfully implemented the combination of LazyOutputFormat and
> MultipleOutputs using hadoop 1.0.3 to write out to multiple custom output
> locations. However, I've hit another snag which I'm hoping you might help
> me work through.
>
> I'm going to be running daily tasks to extract data from XML files
> (specifically, the data stored in certain nodes of the XML), stored on AWS
> S3 using object names with the following format:
>
> s3://inputbucket/data/2013/1/13/<list of xml data files.bz2>
>
> I want to extract items from the XML and write out as follows:
>
> s3://outputbucket/path/<xml node name>/20130113/<output from MR job>
>
> For one day of data, this works fine. I pass in s3://inputbucket/data and
> s3://outputbucket/path as input and output arguments, along with my run
> date (20130113) which gets manipulated and appended where appropriate to
> form the precise read and write locations, for example
>
> FileInputFormat.setInputhPath(job, " s3://inputbucket/data");
> FileOutputFormat.setOutputPath(job, "s3://outputbucket/path");
>
> Then MultipleOutputs adds on my XML node names underneath
> s3://outputbucket/path automatically.
>
> However, for the next day's run, the job gets to
> FileOutputFormat.setOutputPath and sees that the output path
> (s3://outputbucket/path) already exists, and throws a
> FileAlreadyExistsException from FileOutputFormat.checkOutputSpecs() - even
> though my ultimate subdirectory, to be constructed by MultipleOutputs does
> not already exist.
>
> Is there any way around this? I'm given hope by this, from
> http://hadoop.apache.org/docs/r1.0.3/api/org/apache/hadoop/fs/FileAlreadyExistsException.html:
> "public class FileAlreadyExistsException extends IOException - Used when
> target file already exists for any operation *and is not configured to be
> overwritten*" (my emphasis). Is it possible to deconfigure the overwrite
> protection?
>
> If not, I suppose one other way ahead is to create my own FileOutputFormat
> where the checkOutputSpecs() is a bit less fussy; another might be to write
> to a "temp" directory and programmatically move it to the desired output
> when the job completes successfully, although this is getting to feel a bit
> "hacky" to me.
>
> Thanks for any feedback!
>
> Tony
>
>
>
>
>
>
>
> ________________________________________
> From: Harsh J [[EMAIL PROTECTED]]
> Sent: 31 August 2012 10:47
> To: [EMAIL PROTECTED]
> Subject: Re: hadoop 1.0.3 equivalent of MultipleTextOutputFormat
>
> Good finding, that OF slipped my mind. We can mention on the
> MultipleOutputs javadocs for the new API to use the LazyOutputFormat for
> the job-level config. Please file a JIRA for this under MAPREDUCE project
> on the Apache JIRA?
>
> On Fri, Aug 31, 2012 at 2:32 PM, Tony Burton <[EMAIL PROTECTED]>
> wrote:
> > Hi Harsh,
> >
> > I tried using NullOutputFormat as you suggested, however simply using
> >
> > job.setOutputFormatClass(NullOutputFormat.class);
> >
> > resulted in no output at all. Although I've not tried overriding
> getOutputCommitter in NullOutputFormat as you suggested, I discovered
> LazyOutputFormat which only writes when it has to, "the output file is

Alejandro
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB