Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Plain View
Hadoop >> mail # user >> Re: more reduce tasks


+
Vinod Kumar Vavilapalli 2013-01-04, 04:45
+
Pavel Hančar 2013-01-04, 08:35
+
Harsh J 2013-01-05, 07:57
+
Pavel Hančar 2013-01-05, 14:32
+
Chen He 2013-01-04, 04:55
+
bejoy.hadoop@... 2013-01-04, 05:24
+
Chen He 2013-01-04, 05:32
Copy link to this message
-
Re: more reduce tasks
You could create a CustomOutputCommitter and in the commitJob() method
simply read in the part-* files and write them out into a single aggregated
file.

This requires making a CustomOutputFormat class that uses the
CustomOutputCommittter and then setting that
via job.setOutputFormatClass(CustomOutputFormat.class).

See these classes:
http://hadoop.apache.org/docs/r0.20.2/api/org/apache/hadoop/mapred/FileOutputCommitter.html

http://hadoop.apache.org/docs/r0.20.2/api/org/apache/hadoop/mapred/FileOutputFormat.html

http://hadoop.apache.org/docs/r0.20.2/api/org/apache/hadoop/mapred/JobConf.html#setOutputFormat(java.lang.Class)

- Robert

On Thu, Jan 3, 2013 at 3:11 PM, Pavel Hančar <[EMAIL PROTECTED]> wrote:

>   Hello,
> I'd like to use more than one reduce task with Hadoop Streaming and I'd
> like to have only one result. Is it possible? Or should I run one more job
> to merge the result? And is it the same with non-streaming jobs? Below you
> see, I have 5 results for mapred.reduce.tasks=5.
>
> $ hadoop jar
> /packages/run.64/hadoop-0.20.2-cdh3u1/contrib/streaming/hadoop-streaming-0.20.2-cdh3u1.jar
> -D mapred.reduce.tasks=5 -mapper /bin/cat -reducer /tmp/wcc -file /tmp/wcc
> -file /bin/cat -input /user/hadoopnlp/1gb -output 1gb.wc
> .
> .
> .
> 13/01/03 22:00:03 INFO streaming.StreamJob:  map 100%  reduce 100%
> 13/01/03 22:00:07 INFO streaming.StreamJob: Job complete:
> job_201301021717_0038
> 13/01/03 22:00:07 INFO streaming.StreamJob: Output: 1gb.wc
> $ hadoop dfs -cat 1gb.wc/part-*
> 472173052
> 165736187
> 201719914
> 184376668
> 163872819
> $
>
> where /tmp/wcc contains
> #!/bin/bash
> wc -c
>
> Thanks for any answer,
>  Pavel Hančar
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB