Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Plain View
Hadoop >> mail # user >> Re: more reduce tasks


+
Vinod Kumar Vavilapalli 2013-01-04, 04:45
Copy link to this message
-
Re: more reduce tasks
 Hello,
thank you for the answer. Exactly: I want the parallelism but a single
final output. What do you mean by "another stage"? I thought I should
setmapred.reduce.tasks large enough and hadoop will run the reducers
in so
many rounds it will be optimal. But it isn't the case.
  When I tried to run the classical WordCount example, and try to set this
by JobConf.setNumReduceTasks(int n), it seemed to me I had the final output
(there were no word duplicates for the normal words -- only some for
strange words). So why the hadoop doesn't run the final reduce in my simple
streaming example?
  Thank you,
  Pavel Hančar

2013/1/4 Vinod Kumar Vavilapalli <[EMAIL PROTECTED]>

>
> Is it that you want the parallelism but a single final output? Assuming
> your first job's reducers generate a small output, another stage is the way
> to go. If not, second stage won't help. What exactly are your objectives?
>
> Thanks,
> +Vinod
>
> On Jan 3, 2013, at 1:11 PM, Pavel Hančar wrote:
>
>   Hello,
> I'd like to use more than one reduce task with Hadoop Streaming and I'd
> like to have only one result. Is it possible? Or should I run one more job
> to merge the result? And is it the same with non-streaming jobs? Below you
> see, I have 5 results for mapred.reduce.tasks=5.
>
> $ hadoop jar
> /packages/run.64/hadoop-0.20.2-cdh3u1/contrib/streaming/hadoop-streaming-0.20.2-cdh3u1.jar
> -D mapred.reduce.tasks=5 -mapper /bin/cat -reducer /tmp/wcc -file /tmp/wcc
> -file /bin/cat -input /user/hadoopnlp/1gb -output 1gb.wc
> .
> .
> .
> 13/01/03 22:00:03 INFO streaming.StreamJob:  map 100%  reduce 100%
> 13/01/03 22:00:07 INFO streaming.StreamJob: Job complete:
> job_201301021717_0038
> 13/01/03 22:00:07 INFO streaming.StreamJob: Output: 1gb.wc
> $ hadoop dfs -cat 1gb.wc/part-*
> 472173052
> 165736187
> 201719914
> 184376668
> 163872819
> $
>
> where /tmp/wcc contains
> #!/bin/bash
> wc -c
>
> Thanks for any answer,
>  Pavel Hančar
>
>
>
+
Harsh J 2013-01-05, 07:57
+
Pavel Hančar 2013-01-05, 14:32
+
Chen He 2013-01-04, 04:55
+
bejoy.hadoop@... 2013-01-04, 05:24
+
Chen He 2013-01-04, 05:32
+
Robert Dyer 2013-01-04, 05:55
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB