Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Plain View
Hadoop >> mail # user >> Re: What else can be built on top of YARN.


+
Rahul Bhattacharjee 2013-05-29, 17:30
+
Rahul Bhattacharjee 2013-06-01, 07:47
Copy link to this message
-
RE: What else can be built on top of YARN.
Rahul,

This is a very good question, and one we are grappling with currently in our application port.  I think there are a lot of legacy data-processing applications like ours which would benefit by a port to Hadoop.  However, because we have a great load of C++, it is not necessarily a good fit for MR.  There seem to be two main choices:

·         Run under Hadoop “streams”

·         Run as a custom ApplicationMaster

One of the selling points of our application is its performance and single-code efficiency.  I have concerns about streams:

·         We will lose performance, because of the extra layers of translation and I/O and because streams data is uncompressed

·         The streams model is limited to single-in, single-out

·         We have a very large number and size of files to make available locally, it is unclear that the -files option is going to recursively copy and cache all of it

In contrast, porting our application as a YARN ApplicationMaster appears to offer several benefits (which come at the expense of extra complexity):

·         Negotiation for container resources and scheduling.  Some of our operations are very heavy (load time and memory use), so they need larger containers and will benefit from larger data splits.

·         Direct access to HDFS via JNI without translation layers.

·         Algorithms that are not well-suited to the MR model, such as transitive closure.  They are more naturally expressed as MPI-like algorithms.

·         If warranted, the ability to replace MR shuffle with a C++ data partition (this could be a discussion thread in its own right).

Moving our processing into native Java for a more seamless MR integration is not an option due to the size and complexity of the code base.

It may be that I am completely wrong about the limitations of the streams interface; if so please tell me why.

john

From: Rahul Bhattacharjee [mailto:[EMAIL PROTECTED]]
Sent: Wednesday, May 29, 2013 8:34 AM
To: [EMAIL PROTECTED]
Subject: What else can be built on top of YARN.

Hi all,
I was going through the motivation behind Yarn. Splitting the responsibility of JT is the major concern.Ultimately the base (Yarn) was built in a generic way for building other generic distributed applications too.
I am not able to think of any other parallel processing use case that would be useful to built on top of YARN. I though of a lot of use cases that would be beneficial when run in parallel , but again ,we can do those using map only jobs in MR.
Can someone tell me a scenario , where a application can utilize Yarn features or can be built on top of YARN and at the same time , it cannot be done efficiently using MRv2 jobs.
thanks,
Rahul

NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB