Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Hadoop >> mail # user >> Re: more reduce tasks


Copy link to this message
-
Re: more reduce tasks
What do you mean by a "final reduce"? Not all jobs require that the
final output result be singular, since the reducer phase is provided
to work on a per-partition basis (also why the files are named
part-*). One job consists of only one reduce phase, wherein the
reducers all work independently and complete.

If you need a result assembled together in order of the partitions
created, rely on the above provided solutions such as a second step of
fs -getmerge, or a call of the same in a custom FileOutputCommitter,
etc.

On Fri, Jan 4, 2013 at 2:05 PM, Pavel Hančar <[EMAIL PROTECTED]> wrote:
>   Hello,
> thank you for the answer. Exactly: I want the parallelism but a single final
> output. What do you mean by "another stage"? I thought I should set
> mapred.reduce.tasks large enough and hadoop will run the reducers in so many
> rounds it will be optimal. But it isn't the case.
>   When I tried to run the classical WordCount example, and try to set this
> by JobConf.setNumReduceTasks(int n), it seemed to me I had the final output
> (there were no word duplicates for the normal words -- only some for strange
> words). So why the hadoop doesn't run the final reduce in my simple
> streaming example?
>   Thank you,
>   Pavel Hančar
>
> 2013/1/4 Vinod Kumar Vavilapalli <[EMAIL PROTECTED]>
>>
>>
>> Is it that you want the parallelism but a single final output? Assuming
>> your first job's reducers generate a small output, another stage is the way
>> to go. If not, second stage won't help. What exactly are your objectives?
>>
>> Thanks,
>> +Vinod
>>
>> On Jan 3, 2013, at 1:11 PM, Pavel Hančar wrote:
>>
>>   Hello,
>> I'd like to use more than one reduce task with Hadoop Streaming and I'd
>> like to have only one result. Is it possible? Or should I run one more job
>> to merge the result? And is it the same with non-streaming jobs? Below you
>> see, I have 5 results for mapred.reduce.tasks=5.
>>
>> $ hadoop jar
>> /packages/run.64/hadoop-0.20.2-cdh3u1/contrib/streaming/hadoop-streaming-0.20.2-cdh3u1.jar
>> -D mapred.reduce.tasks=5 -mapper /bin/cat -reducer /tmp/wcc -file /tmp/wcc
>> -file /bin/cat -input /user/hadoopnlp/1gb -output 1gb.wc
>> .
>> .
>> .
>> 13/01/03 22:00:03 INFO streaming.StreamJob:  map 100%  reduce 100%
>> 13/01/03 22:00:07 INFO streaming.StreamJob: Job complete:
>> job_201301021717_0038
>> 13/01/03 22:00:07 INFO streaming.StreamJob: Output: 1gb.wc
>> $ hadoop dfs -cat 1gb.wc/part-*
>> 472173052
>> 165736187
>> 201719914
>> 184376668
>> 163872819
>> $
>>
>> where /tmp/wcc contains
>> #!/bin/bash
>> wc -c
>>
>> Thanks for any answer,
>>  Pavel Hančar
>>
>>
>

--
Harsh J
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB