Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Hadoop >> mail # user >> Re: more reduce tasks


Copy link to this message
-
Re: more reduce tasks
Hi Bejoy

Thank you for your idea.

The hadoop patch I said means this merge happens during the output writing
process.

Regards!

Chen
On Jan 3, 2013 11:25 PM, <[EMAIL PROTECTED]> wrote:

> **
> Hi Chen,
>
> You do have an option in hadoop to achieve this if you want the merged
> file in LFS.
>
> 1) Run your job with n number of reducers. And you'll have n files in the
> output dir.
>
> 2) Issue a hadoop fs -getmerge command to get the files in output dir
> merged into a single file in LFS
> (In recent versions use 'hdfs dfs -getmerge')
>
> Regards
> Bejoy KS
>
> Sent from remote device, Please excuse typos
> ------------------------------
> *From: * Chen He <[EMAIL PROTECTED]>
> *Date: *Thu, 3 Jan 2013 22:55:36 -0600
> *To: *<[EMAIL PROTECTED]>
> *ReplyTo: * [EMAIL PROTECTED]
> *Subject: *Re: more reduce tasks
>
> Sounds like you want more reducer to reduce the execution time but only
> want a single output file.
>
> Is this waht you want?
>
> You can use as many as your want (may not be optimal) reducers when you
> are running your reducer. Once the program is done, write a small perl,
> python, or shell program connect those part-* files.
>
> if you do not want to write your own script to connect those files and let
> Hadoop automatically generate a single file.
>
> It may need some patched to current Hadoop. I am not sure they are ready
> or not.
>
> On Thu, Jan 3, 2013 at 10:45 PM, Vinod Kumar Vavilapalli <
> [EMAIL PROTECTED]> wrote:
>
>>
>> Is it that you want the parallelism but a single final output? Assuming
>> your first job's reducers generate a small output, another stage is the way
>> to go. If not, second stage won't help. What exactly are your objectives?
>>
>>  Thanks,
>> +Vinod
>>
>> On Jan 3, 2013, at 1:11 PM, Pavel Hančar wrote:
>>
>>   Hello,
>> I'd like to use more than one reduce task with Hadoop Streaming and I'd
>> like to have only one result. Is it possible? Or should I run one more job
>> to merge the result? And is it the same with non-streaming jobs? Below you
>> see, I have 5 results for mapred.reduce.tasks=5.
>>
>> $ hadoop jar
>> /packages/run.64/hadoop-0.20.2-cdh3u1/contrib/streaming/hadoop-streaming-0.20.2-cdh3u1.jar
>> -D mapred.reduce.tasks=5 -mapper /bin/cat -reducer /tmp/wcc -file /tmp/wcc
>> -file /bin/cat -input /user/hadoopnlp/1gb -output 1gb.wc
>> .
>> .
>> .
>> 13/01/03 22:00:03 INFO streaming.StreamJob:  map 100%  reduce 100%
>> 13/01/03 22:00:07 INFO streaming.StreamJob: Job complete:
>> job_201301021717_0038
>> 13/01/03 22:00:07 INFO streaming.StreamJob: Output: 1gb.wc
>> $ hadoop dfs -cat 1gb.wc/part-*
>> 472173052
>> 165736187
>> 201719914
>> 184376668
>> 163872819
>> $
>>
>> where /tmp/wcc contains
>> #!/bin/bash
>> wc -c
>>
>> Thanks for any answer,
>>  Pavel Hančar
>>
>>
>>
>
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB