Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
HDFS >> mail # user >> Re: Incrementally adding to existing output directory


Copy link to this message
-
Re: Incrementally adding to existing output directory
Hi Devaraj,

Thanks for the advice. That did the trick.

Thanks,
Max Lebedev
On Wed, Jul 17, 2013 at 10:51 PM, Devaraj k <[EMAIL PROTECTED]> wrote:

>  It seems, It is not taking the CutomOutputFormat for the Job. You need
> to set the custom output format class using the org.apache.hadoop.mapred.JobConf.setOutputFormat(Class<?
> extends OutputFormat> theClass) API for your Job.****
>
> ** **
>
> If we don’t set OutputFormat for Job, it takes the default as
> TextOutputFormat which internally extends FileOutputFormat, that’s why you
> see in the below exception still it is using the FileOutputFormat.****
>
> ** **
>
> ** **
>
> Thanks****
>
> Devaraj k****
>
> ** **
>
> *From:* Max Lebedev [mailto:[EMAIL PROTECTED]]
> *Sent:* 18 July 2013 01:03
> *To:* [EMAIL PROTECTED]
> *Subject:* Re: Incrementally adding to existing output directory****
>
> ** **
>
> Hi Devaraj,
>
> Thank you very much for your help. I've created a CustomOutputFormat which
> is almost identical to FileOutputFormat as seen here<http://grepcode.com/file/repository.cloudera.com/content/repositories/releases/com.cloudera.hadoop/hadoop-core/0.20.2-320/org/apache/hadoop/mapreduce/lib/output/FileOutputFormat.java>
> except I've removed line 125 which throws the FileAlreadyExistsException.
> However, when I try to run my code, I get this error:****
>
> Exception in thread "main"
> org.apache.hadoop.mapred.FileAlreadyExistsException: Output directory
> outDir already exists****
>
>            at
> org.apache.hadoop.mapreduce.lib.output.FileOutputFormat.checkOutputSpecs(FileOutputFormat.java:137)
> ****
>
>             at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:887)
> ****
>
>             at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:850)
> ****
>
>             at java.security.AccessController.doPrivileged(Native Method)*
> ***
>
>             at javax.security.auth.Subject.doAs(Subject.java:396)****
>
>             at
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1121)
> ****
>
>             at
> org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:850)**
> **
>
>             at org.apache.hadoop.mapreduce.Job.submit(Job.java:500)****
>
>             at
> org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:530)****
>
>             ...****
>
>             at java.lang.reflect.Method.invoke(Method.java:597)****
>
>             at org.apache.hadoop.util.RunJar.main(RunJar.java:156)
>
> In my source code, I've changed "FileOutputFormat.setOutputPath" to
> "CustomOutputFormat.setOutputPath"
>
> Is it the case that FileOutputFormat.checkOutputSpecs is happening
> somewhere else, or have I done something wrong?
> I also don't quite understand your suggestion about MultipleOutputs. Would
> you mind elaborating?
>
> Thanks,
> Max Lebedev****
>
> ** **
>
> On Tue, Jul 16, 2013 at 9:42 PM, Devaraj k <[EMAIL PROTECTED]> wrote:**
> **
>
> Hi Max,****
>
>  ****
>
>   It can be done by customizing the output format class for your Job
> according to your expectations. You could you refer
> OutputFormat.checkOutputSpecs(JobContext context) method which checks the
> ouput specification. We can override this in your custom OutputFormat. You
> can also see MultipleOutputs class for implementation details how it could
> be done.****
>
>  ****
>
> Thanks****
>
> Devaraj k****
>
>  ****
>
> *From:* Max Lebedev [mailto:[EMAIL PROTECTED]]
> *Sent:* 16 July 2013 23:33
> *To:* [EMAIL PROTECTED]
> *Subject:* Incrementally adding to existing output directory****
>
>  ****
>
> Hi****
>
> I'm trying to figure out how to incrementally add to an existing output
> directory using MapReduce.****
>
> I cannot specify the exact output path, as data in the input is sorted
> into categories and then written to different directories based in the
> contents. (in the examples below, token=AAAA or token=BBBB)****
>
> As an example:****
>
> When using MultipleOutput and provided that outDir does not exist yet, the