Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Hadoop >> mail # user >> Performance tuning of sort


Copy link to this message
-
Re: Performance tuning of sort
Todd,

Why's there a sorting in map task, the sorting here seems useless in my opinion.

On Thu, Jun 17, 2010 at 9:26 AM, Todd Lipcon <[EMAIL PROTECTED]> wrote:
> On Thu, Jun 17, 2010 at 12:43 AM, Jeff Zhang <[EMAIL PROTECTED]> wrote:
>
>> Your understanding of Sort is not right. The key concept of Sort is
>> the TotalOrderPartitioner. Actually before the map-reduce job, client
>> side will do sampling of input data to estimate the distribution of
>> input data. And the mapper do nothing, each reducer will fetch its
>> data according the TotalOrderPartitioner. The data in each reducer is
>> local sorted, and each reducer are sorted ( r0<r1<r2....), so the
>> overall result data is sorted.
>>
>
> The sorting happens on the map side, actually, during the spill process. The
> mapper itself is an identity function, but the map task code does perform a
> sort (on a <partition,key> tuple) as originally described in this thread.
> Reducers just do a merge of mapper outputs.
>
> -Todd
>
>
>>
>>
>>
>> On Thu, Jun 17, 2010 at 12:13 AM, 李钰 <[EMAIL PROTECTED]> wrote:
>> > Hi all,
>> >
>> > I'm doing some tuning of the sort benchmark of hadoop. To be more
>> specified,
>> > running test against the org.apache.hadoop.examples.Sort class. As
>> looking
>> > through the source code, I think the map tasks take responsibility of
>> > sorting the input data, and the reduce tasks just merge the map outputs
>> and
>> > write them into HDFS. But here I've got a question I couldn't understand:
>> > the time cost of the reduce phase of each reduce task, that is writing
>> data
>> > into HDFS, is different from each other. Since the input data and
>> operations
>> > of each reduce task is the same, what reason will cause the execution
>> time
>> > different? Is there anything wrong of my understanding? Does anybody have
>> > any experience on this? Badly need your help, thanks.
>> >
>> > Best Regards,
>> > Carp
>> >
>>
>>
>>
>> --
>> Best Regards
>>
>> Jeff Zhang
>>
>
>
>
> --
> Todd Lipcon
> Software Engineer, Cloudera
>

--
Best Regards

Jeff Zhang
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB