Historically, many applications/frameworks wanted to take advantage of just the resource management capabilities and failure handling of Hadoop (via JobTracker/TaskTracker), but were forced to used MapReduce even though they didn't have to. Obvious examples are graph processing (Giraph), BSP(Hama), storm/s4 and even a simple tool like DistCp.
There are issues even with map-only jobs.
- You have to fake key-value processing, periodic pings, key-value outputs
- You are limited to map slot capacity in the cluster
- The number of tasks is static, so you cannot grow and shrink your job
- You are forced to sort data all the time (even though this has changed recently)
- You are tied to faking things like OutputCommit even if you don't need to.
That's just for starters. I can definitely think harder and list more ;)
YARN lets you move ahead without those limitations.
+Vinod Kumar Vavilapalli
On May 29, 2013, at 7:34 AM, Rahul Bhattacharjee wrote:
> Hi all,
> I was going through the motivation behind Yarn. Splitting the responsibility of JT is the major concern.Ultimately the base (Yarn) was built in a generic way for building other generic distributed applications too.
> I am not able to think of any other parallel processing use case that would be useful to built on top of YARN. I though of a lot of use cases that would be beneficial when run in parallel , but again ,we can do those using map only jobs in MR.
> Can someone tell me a scenario , where a application can utilize Yarn features or can be built on top of YARN and at the same time , it cannot be done efficiently using MRv2 jobs.