|
|
-
MAPREDUCE-2454Mariappan Asokan 2012-10-18, 23:20
I contributed a patch to MAPREDUCE-2454 in order to make the sort stage in
the MR data flow to be pluggable. Some of the benefits it brings are: 1. One can avoid sorting by providing an external sort implementation. There is a performance benefit for jobs that do not require sorting. Once the patch for MAPREDUCE-2454 is committed, the problems discussed in MAPREDUCE-4039 and MAPREDUCE-1928 can be solved. In other words, MAPREDUCE-4039 and MAPREDUCE-1928 become special cases of sort plugin implementation. 2. A full join(inner and outer) done in the reducer can instead be done in the reduce sort plugin more efficiently when both sides of the join are huge. The reason is that both sides of the join can be sorted separately and data coming from the disk in the final merges can be joined right away. 3. One can implement specialized sorting algorithms based on the data being processed in order to optimize performance. I have followed the suggestions of developers and incorporated into the Jira. The patch passed the Apache QA build and tests. I request all committers to take a look at the patch and make any suggestions so that it can be committed. Thanks. -- Asokan |