Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Hadoop >> mail # dev >> A pluggable external sort for Hadoop MR


Copy link to this message
-
Re: A pluggable external sort for Hadoop MR
Hey Asokan,

Could you please file a JIRA with your proposed enhancement so that the
discussion can be archived there? See
http://wiki.apache.org/hadoop/HowToContribute for more details on how to
contribute to Hadoop.

Thanks,
Jeff

On Tue, Apr 26, 2011 at 9:46 AM, Asokan, M <[EMAIL PROTECTED]> wrote:

> Hi Chris,
>   The overall elapsed time to run a sort depends on many factors other than
> the sort algorithm.  If you follow the data flow in MR from the point where
> sorting starts in Map phase to the point where <Key, Value> pairs are
> available for reduction in Reduce phase there are CPU and IO intensive
> activities happening.  You are right, passing data to an external process
> adds CPU cycles.  However, a well engineered implementation of the overall
> process can cut down the elapsed time.  From some of my experiments with a
> prototype implementation, I was able to cut down the elapsed time by about
> 40% to run some huge sorts(500 GB) on a modest cluster of 6 nodes.
>
> Besides, an external sorter can provide additional functionalities to
> Hadoop.  For example, on the Map side, an external sorter process can
> support filtering, reformatting, and aggregation in a single process with
> performance optimized for a multicore system.  With the current MR
> framework, filtering and reformatting happen before sorting and all these
> operations are very sequential in nature.  On the Reduce side, an external
> sorter can offer even exotic solution like Join since the external sorter
> implementation on the Reduce side is free to work on more than one
> stream(one from Hadoop MR shuffled data and the other from HDFS for
> example.)
>
> Thank you very much for your feedback.  If you any more questions, please
> let me know.
>
> -- Asokan
>
> On 04/26/2011 11:41 AM, Christopher Smith wrote:
>
> Aren't you worried that the overhead of shoving all that data through an
> external sort facility would outweigh any benefits from the algo?
>
> --Chris
>
> On Apr 26, 2011, at 8:34 AM, "Asokan, M" <[EMAIL PROTECTED]><mailto:
> [EMAIL PROTECTED]> wrote:
>
>
>
> Hi All,
>
> I am submitting this notice of intent to contribute to the Hadoop community
> on behalf of Syncsort, Inc. (www.syncsort.com<http://www.syncsort.com><
> http://www.syncsort.com><http://www.syncsort.com>) an interface for an
> external sorter.  Although Hadoop MR (Map/Reduce) provides users with
> pluggable InputFormat, Mapper, Partitioner, Combiner, Reducer, and
> OutputFormat it does not provide a plug-in for an external sorter. There is
> limited support to plug in a sorter class in the Map phase.  The merge logic
> in the Reduce phase cannot be changed.  Also, the sorting process is tightly
> coupled to the framework.
>
>
>
> The goal of our project is to decouple the sorting process and contribute a
> defined clean interface to allow developers to easily plug in external
> sorters through this interface.  THIS INTERFACE WILL BE INDEPENDENT FROM
> SYNCSORT’S PROPRIETARY SOFTWARE PRODUCTS WHICH ARE NOT INTENDED TO BE
> CONTRIBUTED.
>
> The following are some of the motivating factors for this project (not in
> any order of significance):
> ·         An external sort plug-in will promote innovative implementations
> by developers who have expertise in sort algorithms.
> ·         Hadoop developers can experiment with different sort
> implementations (in both the Map and Reduce phases) without modifying the
> framework code.
> ·         An external implementation of sort can be very well optimized to
> take advantage of OS and hardware architecture compared to the pure Java
> implementation in Hadoop.
> ·         The Hadoop implementation of sort is not self tuning. Users may
> be overwhelmed by so many parameters to be specified to tune the performance
> of sort.
> ·         One of the top memory consumers in the MR child JVMs is the sort.
>  Users are advised to set a reasonably high value for -mx argument to JVM.
> Failure to do so will result in job termination. If the external sorter is