Mariappan Asokan 2012-10-18, 23:20
I contributed a patch to MAPREDUCE-2454 in order to make the sort stage in
the MR data flow to be pluggable. Some of the benefits it brings are:
1. One can avoid sorting by providing an external sort implementation.
There is a performance benefit for jobs that do not require sorting. Once
the patch for MAPREDUCE-2454 is committed, the problems discussed in
MAPREDUCE-4039 and MAPREDUCE-1928 can be solved. In other words,
MAPREDUCE-4039 and MAPREDUCE-1928 become special cases of sort plugin
2. A full join(inner and outer) done in the reducer can instead be done in
the reduce sort plugin more efficiently when both sides of the join are
huge. The reason is that both sides of the join can be sorted separately
and data coming from the disk in the final merges can be joined right away.
3. One can implement specialized sorting algorithms based on the data
being processed in order to optimize performance.
I have followed the suggestions of developers and incorporated into the
Jira. The patch passed the Apache QA build and tests.
I request all committers to take a look at the patch and make any
suggestions so that it can be committed.