Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
HDFS >> mail # user >> Re: Incrementally adding to existing output directory


Copy link to this message
-
Re: Incrementally adding to existing output directory
Hi Devaraj,

Thanks for the advice. That did the trick.

Thanks,
Max Lebedev
On Wed, Jul 17, 2013 at 10:51 PM, Devaraj k <[EMAIL PROTECTED]> wrote:

>  It seems, It is not taking the CutomOutputFormat for the Job. You need
> to set the custom output format class using the org.apache.hadoop.mapred.JobConf.setOutputFormat(Class<?
> extends OutputFormat> theClass) API for your Job.****
>
> ** **
>
> If we don’t set OutputFormat for Job, it takes the default as
> TextOutputFormat which internally extends FileOutputFormat, that’s why you
> see in the below exception still it is using the FileOutputFormat.****
>
> ** **
>
> ** **
>
> Thanks****
>
> Devaraj k****
>
> ** **
>
> *From:* Max Lebedev [mailto:[EMAIL PROTECTED]]
> *Sent:* 18 July 2013 01:03
> *To:* [EMAIL PROTECTED]
> *Subject:* Re: Incrementally adding to existing output directory****
>
> ** **
>
> Hi Devaraj,
>
> Thank you very much for your help. I've created a CustomOutputFormat which
> is almost identical to FileOutputFormat as seen here<http://grepcode.com/file/repository.cloudera.com/content/repositories/releases/com.cloudera.hadoop/hadoop-core/0.20.2-320/org/apache/hadoop/mapreduce/lib/output/FileOutputFormat.java>
> except I've removed line 125 which throws the FileAlreadyExistsException.
> However, when I try to run my code, I get this error:****
>
> Exception in thread "main"
> org.apache.hadoop.mapred.FileAlreadyExistsException: Output directory
> outDir already exists****
>
>            at
> org.apache.hadoop.mapreduce.lib.output.FileOutputFormat.checkOutputSpecs(FileOutputFormat.java:137)
> ****
>
>             at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:887)
> ****
>
>             at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:850)
> ****
>
>             at java.security.AccessController.doPrivileged(Native Method)*
> ***
>
>             at javax.security.auth.Subject.doAs(Subject.java:396)****
>
>             at
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1121)
> ****
>
>             at
> org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:850)**
> **
>
>             at org.apache.hadoop.mapreduce.Job.submit(Job.java:500)****
>
>             at
> org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:530)****
>
>             ...****
>
>             at java.lang.reflect.Method.invoke(Method.java:597)****
>
>             at org.apache.hadoop.util.RunJar.main(RunJar.java:156)
>
> In my source code, I've changed "FileOutputFormat.setOutputPath" to
> "CustomOutputFormat.setOutputPath"
>
> Is it the case that FileOutputFormat.checkOutputSpecs is happening
> somewhere else, or have I done something wrong?
> I also don't quite understand your suggestion about MultipleOutputs. Would
> you mind elaborating?
>
> Thanks,
> Max Lebedev****
>
> ** **
>
> On Tue, Jul 16, 2013 at 9:42 PM, Devaraj k <[EMAIL PROTECTED]> wrote:**
> **
>
> Hi Max,****
>
>  ****
>
>   It can be done by customizing the output format class for your Job
> according to your expectations. You could you refer
> OutputFormat.checkOutputSpecs(JobContext context) method which checks the
> ouput specification. We can override this in your custom OutputFormat. You
> can also see MultipleOutputs class for implementation details how it could
> be done.****
>
>  ****
>
> Thanks****
>
> Devaraj k****
>
>  ****
>
> *From:* Max Lebedev [mailto:[EMAIL PROTECTED]]
> *Sent:* 16 July 2013 23:33
> *To:* [EMAIL PROTECTED]
> *Subject:* Incrementally adding to existing output directory****
>
>  ****
>
> Hi****
>
> I'm trying to figure out how to incrementally add to an existing output
> directory using MapReduce.****
>
> I cannot specify the exact output path, as data in the input is sorted
> into categories and then written to different directories based in the
> contents. (in the examples below, token=AAAA or token=BBBB)****
>
> As an example:****
>
> When using MultipleOutput and provided that outDir does not exist yet, the
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB