Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Hadoop >> mail # dev >> A pluggable external sort for Hadoop MR


Copy link to this message
-
Re: A pluggable external sort for Hadoop MR
Hey Asokan,

Could you please file a JIRA with your proposed enhancement so that the
discussion can be archived there? See
http://wiki.apache.org/hadoop/HowToContribute for more details on how to
contribute to Hadoop.

Thanks,
Jeff

On Tue, Apr 26, 2011 at 9:46 AM, Asokan, M <[EMAIL PROTECTED]> wrote:

> Hi Chris,
>   The overall elapsed time to run a sort depends on many factors other than
> the sort algorithm.  If you follow the data flow in MR from the point where
> sorting starts in Map phase to the point where <Key, Value> pairs are
> available for reduction in Reduce phase there are CPU and IO intensive
> activities happening.  You are right, passing data to an external process
> adds CPU cycles.  However, a well engineered implementation of the overall
> process can cut down the elapsed time.  From some of my experiments with a
> prototype implementation, I was able to cut down the elapsed time by about
> 40% to run some huge sorts(500 GB) on a modest cluster of 6 nodes.
>
> Besides, an external sorter can provide additional functionalities to
> Hadoop.  For example, on the Map side, an external sorter process can
> support filtering, reformatting, and aggregation in a single process with
> performance optimized for a multicore system.  With the current MR
> framework, filtering and reformatting happen before sorting and all these
> operations are very sequential in nature.  On the Reduce side, an external
> sorter can offer even exotic solution like Join since the external sorter
> implementation on the Reduce side is free to work on more than one
> stream(one from Hadoop MR shuffled data and the other from HDFS for
> example.)
>
> Thank you very much for your feedback.  If you any more questions, please
> let me know.
>
> -- Asokan
>
> On 04/26/2011 11:41 AM, Christopher Smith wrote:
>
> Aren't you worried that the overhead of shoving all that data through an
> external sort facility would outweigh any benefits from the algo?
>
> --Chris
>
> On Apr 26, 2011, at 8:34 AM, "Asokan, M" <[EMAIL PROTECTED]><mailto:
> [EMAIL PROTECTED]> wrote:
>
>
>
> Hi All,
>
> I am submitting this notice of intent to contribute to the Hadoop community
> on behalf of Syncsort, Inc. (www.syncsort.com<http://www.syncsort.com><
> http://www.syncsort.com><http://www.syncsort.com>) an interface for an
> external sorter.  Although Hadoop MR (Map/Reduce) provides users with
> pluggable InputFormat, Mapper, Partitioner, Combiner, Reducer, and
> OutputFormat it does not provide a plug-in for an external sorter. There is
> limited support to plug in a sorter class in the Map phase.  The merge logic
> in the Reduce phase cannot be changed.  Also, the sorting process is tightly
> coupled to the framework.
>
>
>
> The goal of our project is to decouple the sorting process and contribute a
> defined clean interface to allow developers to easily plug in external
> sorters through this interface.  THIS INTERFACE WILL BE INDEPENDENT FROM
> SYNCSORT’S PROPRIETARY SOFTWARE PRODUCTS WHICH ARE NOT INTENDED TO BE
> CONTRIBUTED.
>
> The following are some of the motivating factors for this project (not in
> any order of significance):
> ·         An external sort plug-in will promote innovative implementations
> by developers who have expertise in sort algorithms.
> ·         Hadoop developers can experiment with different sort
> implementations (in both the Map and Reduce phases) without modifying the
> framework code.
> ·         An external implementation of sort can be very well optimized to
> take advantage of OS and hardware architecture compared to the pure Java
> implementation in Hadoop.
> ·         The Hadoop implementation of sort is not self tuning. Users may
> be overwhelmed by so many parameters to be specified to tune the performance
> of sort.
> ·         One of the top memory consumers in the MR child JVMs is the sort.
>  Users are advised to set a reasonably high value for -mx argument to JVM.
> Failure to do so will result in job termination. If the external sorter is
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB