John Lilley 2013-05-23, 21:44
-Re: HTTP file server, map output, and other files
Harsh J 2013-05-24, 06:43
YARN has a ShuffleHandler plugin used for MR purposes, but the APIs
used here aren't "general"/public so you'd have to build your own
utilities to do that. Its not too difficult to achieve but a general
API would certainly be nice.
Tez (Incubating) aims to solve some of this for users writing YARN
apps in a general way, but it isn't consumable yet. You can follow Tez
on the Apache Incubator at
P.s. As mentioned, YARN-based MR2 does not use HTTP (Jetty) anymore.
It uses Netty.
On Fri, May 24, 2013 at 3:14 AM, John Lilley <[EMAIL PROTECTED]> wrote:
> Thanks to previous kind answers and more reading in the elephant book, I now
> understand that mapper tasks place partitioned results into local files that
> are served up to reducers via HTTP:
> “The output file’s partitions are made available to the reducers over HTTP.
> The maximum number of worker threads used to serve the file partitions is
> controlled by the tasktracker.http.threads property; this setting is per
> tasktracker, not per map task slot. The default of 40 may need to be
> increased for large clusters running large jobs. In MapReduce 2, this
> property is not applicable because the maximum number of threads used is set
> automatically based on the number of processors on the machine. (MapReduce 2
> uses Netty, which by default allows up to twice as many threads as there are
> My question is, for a custom (non-MR) application under YARN, how would I
> set up my application tasks’ output data to be served over HTTP? Is there
> an API to control this, or are there predefined local folders that will be
> served up? Once I am finished with the temporary data, how do I request
> that the files are removed?