Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Plain View
MapReduce >> mail # user >> HTTP file server, map output, and other files


+
John Lilley 2013-05-23, 21:44
Copy link to this message
-
Re: HTTP file server, map output, and other files
YARN has a ShuffleHandler plugin used for MR purposes, but the APIs
used here aren't "general"/public so you'd have to build your own
utilities to do that. Its not too difficult to achieve but a general
API would certainly be nice.

Tez (Incubating) aims to solve some of this for users writing YARN
apps in a general way, but it isn't consumable yet. You can follow Tez
on the Apache Incubator at
http://incubator.apache.org/projects/tez.html.

P.s. As mentioned, YARN-based MR2 does not use HTTP (Jetty) anymore.
It uses Netty.

On Fri, May 24, 2013 at 3:14 AM, John Lilley <[EMAIL PROTECTED]> wrote:
> Thanks to previous kind answers and more reading in the elephant book, I now
> understand that mapper tasks place partitioned results into local files that
> are served up to reducers via HTTP:
>
>
>
> “The output file’s partitions are made available to the reducers over HTTP.
> The maximum number of worker threads used to serve the file partitions is
> controlled by the tasktracker.http.threads property; this setting is per
> tasktracker, not per map task slot. The default of 40 may need to be
> increased for large clusters running large jobs. In MapReduce 2, this
> property is not applicable because the maximum number of threads used is set
> automatically based on the number of processors on the machine. (MapReduce 2
> uses Netty, which by default allows up to twice as many threads as there are
> processors.)”
>
>
>
> My question is, for a custom (non-MR) application under YARN, how would I
> set up my application tasks’ output data to be served over HTTP?  Is there
> an API to control this, or are there predefined local folders that will be
> served up?  Once I am finished with the temporary data, how do I request
> that the files are removed?
>
>
>
> Thanks
>
> John
>
>

--
Harsh J
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB