-HTTP file server, map output, and other files
John Lilley 2013-05-23, 21:44
Thanks to previous kind answers and more reading in the elephant book, I now understand that mapper tasks place partitioned results into local files that are served up to reducers via HTTP:
"The output file's partitions are made available to the reducers over HTTP. The maximum number of worker threads used to serve the file partitions is controlled by the tasktracker.http.threads property; this setting is per tasktracker, not per map task slot. The default of 40 may need to be increased for large clusters running large jobs. In MapReduce 2, this property is not applicable because the maximum number of threads used is set automatically based on the number of processors on the machine. (MapReduce 2 uses Netty, which by default allows up to twice as many threads as there are processors.)"
My question is, for a custom (non-MR) application under YARN, how would I set up my application tasks' output data to be served over HTTP? Is there an API to control this, or are there predefined local folders that will be served up? Once I am finished with the temporary data, how do I request that the files are removed?