-Re: A pluggable external sort for Hadoop MR
Jeff Hammerbacher 2011-04-26, 18:00
Could you please file a JIRA with your proposed enhancement so that the
discussion can be archived there? See
http://wiki.apache.org/hadoop/HowToContribute for more details on how to
contribute to Hadoop.
On Tue, Apr 26, 2011 at 9:46 AM, Asokan, M <[EMAIL PROTECTED]> wrote:
> Hi Chris,
> The overall elapsed time to run a sort depends on many factors other than
> the sort algorithm. If you follow the data flow in MR from the point where
> sorting starts in Map phase to the point where <Key, Value> pairs are
> available for reduction in Reduce phase there are CPU and IO intensive
> activities happening. You are right, passing data to an external process
> adds CPU cycles. However, a well engineered implementation of the overall
> process can cut down the elapsed time. From some of my experiments with a
> prototype implementation, I was able to cut down the elapsed time by about
> 40% to run some huge sorts(500 GB) on a modest cluster of 6 nodes.
> Besides, an external sorter can provide additional functionalities to
> Hadoop. For example, on the Map side, an external sorter process can
> support filtering, reformatting, and aggregation in a single process with
> performance optimized for a multicore system. With the current MR
> framework, filtering and reformatting happen before sorting and all these
> operations are very sequential in nature. On the Reduce side, an external
> sorter can offer even exotic solution like Join since the external sorter
> implementation on the Reduce side is free to work on more than one
> stream(one from Hadoop MR shuffled data and the other from HDFS for
> Thank you very much for your feedback. If you any more questions, please
> let me know.
> -- Asokan
> On 04/26/2011 11:41 AM, Christopher Smith wrote:
> Aren't you worried that the overhead of shoving all that data through an
> external sort facility would outweigh any benefits from the algo?
> On Apr 26, 2011, at 8:34 AM, "Asokan, M" <[EMAIL PROTECTED]><mailto:
> [EMAIL PROTECTED]> wrote:
> Hi All,
> I am submitting this notice of intent to contribute to the Hadoop community
> on behalf of Syncsort, Inc. (www.syncsort.com<http://www.syncsort.com><
> http://www.syncsort.com><http://www.syncsort.com>) an interface for an
> external sorter. Although Hadoop MR (Map/Reduce) provides users with
> pluggable InputFormat, Mapper, Partitioner, Combiner, Reducer, and
> OutputFormat it does not provide a plug-in for an external sorter. There is
> limited support to plug in a sorter class in the Map phase. The merge logic
> in the Reduce phase cannot be changed. Also, the sorting process is tightly
> coupled to the framework.
> The goal of our project is to decouple the sorting process and contribute a
> defined clean interface to allow developers to easily plug in external
> sorters through this interface. THIS INTERFACE WILL BE INDEPENDENT FROM
> SYNCSORT’S PROPRIETARY SOFTWARE PRODUCTS WHICH ARE NOT INTENDED TO BE
> The following are some of the motivating factors for this project (not in
> any order of significance):
> · An external sort plug-in will promote innovative implementations
> by developers who have expertise in sort algorithms.
> · Hadoop developers can experiment with different sort
> implementations (in both the Map and Reduce phases) without modifying the
> framework code.
> · An external implementation of sort can be very well optimized to
> take advantage of OS and hardware architecture compared to the pure Java
> implementation in Hadoop.
> · The Hadoop implementation of sort is not self tuning. Users may
> be overwhelmed by so many parameters to be specified to tune the performance
> of sort.
> · One of the top memory consumers in the MR child JVMs is the sort.
> Users are advised to set a reasonably high value for -mx argument to JVM.
> Failure to do so will result in job termination. If the external sorter is