Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Hadoop, mail # user - Re: more reduce tasks


Copy link to this message
-
Re: more reduce tasks
Harsh J 2013-01-05, 07:57
What do you mean by a "final reduce"? Not all jobs require that the
final output result be singular, since the reducer phase is provided
to work on a per-partition basis (also why the files are named
part-*). One job consists of only one reduce phase, wherein the
reducers all work independently and complete.

If you need a result assembled together in order of the partitions
created, rely on the above provided solutions such as a second step of
fs -getmerge, or a call of the same in a custom FileOutputCommitter,
etc.

On Fri, Jan 4, 2013 at 2:05 PM, Pavel Hančar <[EMAIL PROTECTED]> wrote:
>   Hello,
> thank you for the answer. Exactly: I want the parallelism but a single final
> output. What do you mean by "another stage"? I thought I should set
> mapred.reduce.tasks large enough and hadoop will run the reducers in so many
> rounds it will be optimal. But it isn't the case.
>   When I tried to run the classical WordCount example, and try to set this
> by JobConf.setNumReduceTasks(int n), it seemed to me I had the final output
> (there were no word duplicates for the normal words -- only some for strange
> words). So why the hadoop doesn't run the final reduce in my simple
> streaming example?
>   Thank you,
>   Pavel Hančar
>
> 2013/1/4 Vinod Kumar Vavilapalli <[EMAIL PROTECTED]>
>>
>>
>> Is it that you want the parallelism but a single final output? Assuming
>> your first job's reducers generate a small output, another stage is the way
>> to go. If not, second stage won't help. What exactly are your objectives?
>>
>> Thanks,
>> +Vinod
>>
>> On Jan 3, 2013, at 1:11 PM, Pavel Hančar wrote:
>>
>>   Hello,
>> I'd like to use more than one reduce task with Hadoop Streaming and I'd
>> like to have only one result. Is it possible? Or should I run one more job
>> to merge the result? And is it the same with non-streaming jobs? Below you
>> see, I have 5 results for mapred.reduce.tasks=5.
>>
>> $ hadoop jar
>> /packages/run.64/hadoop-0.20.2-cdh3u1/contrib/streaming/hadoop-streaming-0.20.2-cdh3u1.jar
>> -D mapred.reduce.tasks=5 -mapper /bin/cat -reducer /tmp/wcc -file /tmp/wcc
>> -file /bin/cat -input /user/hadoopnlp/1gb -output 1gb.wc
>> .
>> .
>> .
>> 13/01/03 22:00:03 INFO streaming.StreamJob:  map 100%  reduce 100%
>> 13/01/03 22:00:07 INFO streaming.StreamJob: Job complete:
>> job_201301021717_0038
>> 13/01/03 22:00:07 INFO streaming.StreamJob: Output: 1gb.wc
>> $ hadoop dfs -cat 1gb.wc/part-*
>> 472173052
>> 165736187
>> 201719914
>> 184376668
>> 163872819
>> $
>>
>> where /tmp/wcc contains
>> #!/bin/bash
>> wc -c
>>
>> Thanks for any answer,
>>  Pavel Hančar
>>
>>
>

--
Harsh J