Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Hadoop >> mail # user >> Re: What else can be built on top of YARN.


Copy link to this message
-
RE: What else can be built on top of YARN.
Rahul,

This is a very good question, and one we are grappling with currently in our application port.  I think there are a lot of legacy data-processing applications like ours which would benefit by a port to Hadoop.  However, because we have a great load of C++, it is not necessarily a good fit for MR.  There seem to be two main choices:

·         Run under Hadoop “streams”

·         Run as a custom ApplicationMaster

One of the selling points of our application is its performance and single-code efficiency.  I have concerns about streams:

·         We will lose performance, because of the extra layers of translation and I/O and because streams data is uncompressed

·         The streams model is limited to single-in, single-out

·         We have a very large number and size of files to make available locally, it is unclear that the -files option is going to recursively copy and cache all of it

In contrast, porting our application as a YARN ApplicationMaster appears to offer several benefits (which come at the expense of extra complexity):

·         Negotiation for container resources and scheduling.  Some of our operations are very heavy (load time and memory use), so they need larger containers and will benefit from larger data splits.

·         Direct access to HDFS via JNI without translation layers.

·         Algorithms that are not well-suited to the MR model, such as transitive closure.  They are more naturally expressed as MPI-like algorithms.

·         If warranted, the ability to replace MR shuffle with a C++ data partition (this could be a discussion thread in its own right).

Moving our processing into native Java for a more seamless MR integration is not an option due to the size and complexity of the code base.

It may be that I am completely wrong about the limitations of the streams interface; if so please tell me why.

john

From: Rahul Bhattacharjee [mailto:[EMAIL PROTECTED]]
Sent: Wednesday, May 29, 2013 8:34 AM
To: [EMAIL PROTECTED]
Subject: What else can be built on top of YARN.

Hi all,
I was going through the motivation behind Yarn. Splitting the responsibility of JT is the major concern.Ultimately the base (Yarn) was built in a generic way for building other generic distributed applications too.
I am not able to think of any other parallel processing use case that would be useful to built on top of YARN. I though of a lot of use cases that would be beneficial when run in parallel , but again ,we can do those using map only jobs in MR.
Can someone tell me a scenario , where a application can utilize Yarn features or can be built on top of YARN and at the same time , it cannot be done efficiently using MRv2 jobs.
thanks,
Rahul