This is a very good question, and one we are grappling with currently in our application port. I think there are a lot of legacy data-processing applications like ours which would benefit by a port to Hadoop. However, because we have a great load of C++, it is not necessarily a good fit for MR. There seem to be two main choices:
· Run under Hadoop “streams”
· Run as a custom ApplicationMaster
One of the selling points of our application is its performance and single-code efficiency. I have concerns about streams:
· We will lose performance, because of the extra layers of translation and I/O and because streams data is uncompressed
· The streams model is limited to single-in, single-out
· We have a very large number and size of files to make available locally, it is unclear that the -files option is going to recursively copy and cache all of it
In contrast, porting our application as a YARN ApplicationMaster appears to offer several benefits (which come at the expense of extra complexity):
· Negotiation for container resources and scheduling. Some of our operations are very heavy (load time and memory use), so they need larger containers and will benefit from larger data splits.
· Direct access to HDFS via JNI without translation layers.
· Algorithms that are not well-suited to the MR model, such as transitive closure. They are more naturally expressed as MPI-like algorithms.
· If warranted, the ability to replace MR shuffle with a C++ data partition (this could be a discussion thread in its own right).
Moving our processing into native Java for a more seamless MR integration is not an option due to the size and complexity of the code base.
It may be that I am completely wrong about the limitations of the streams interface; if so please tell me why.
From: Rahul Bhattacharjee [mailto:[EMAIL PROTECTED]]
Sent: Wednesday, May 29, 2013 8:34 AM
To: [EMAIL PROTECTED]
Subject: What else can be built on top of YARN.
I was going through the motivation behind Yarn. Splitting the responsibility of JT is the major concern.Ultimately the base (Yarn) was built in a generic way for building other generic distributed applications too.
I am not able to think of any other parallel processing use case that would be useful to built on top of YARN. I though of a lot of use cases that would be beneficial when run in parallel , but again ,we can do those using map only jobs in MR.
Can someone tell me a scenario , where a application can utilize Yarn features or can be built on top of YARN and at the same time , it cannot be done efficiently using MRv2 jobs.