Hadoop, mail # user - Performance tuning of sort

Re: Performance tuning of sort
Jeff Zhang 2010-06-17, 07:43
Your understanding of Sort is not right. The key concept of Sort is
the TotalOrderPartitioner. Actually before the map-reduce job, client
side will do sampling of input data to estimate the distribution of
input data. And the mapper do nothing, each reducer will fetch its
data according the TotalOrderPartitioner. The data in each reducer is
local sorted, and each reducer are sorted ( r0<r1<r2....), so the
overall result data is sorted.

On Thu, Jun 17, 2010 at 12:13 AM, 李钰 <[EMAIL PROTECTED]> wrote:
> Hi all,
> I'm doing some tuning of the sort benchmark of hadoop. To be more specified,
> running test against the org.apache.hadoop.examples.Sort class. As looking
> through the source code, I think the map tasks take responsibility of
> sorting the input data, and the reduce tasks just merge the map outputs and
> write them into HDFS. But here I've got a question I couldn't understand:
> the time cost of the reduce phase of each reduce task, that is writing data
> into HDFS, is different from each other. Since the input data and operations
> of each reduce task is the same, what reason will cause the execution time
> different? Is there anything wrong of my understanding? Does anybody have
> any experience on this? Badly need your help, thanks.
> Best Regards,
> Carp

Best Regards

Jeff Zhang