Jian He 2013-10-21, 18:22
-Re: temporary file locations for YARN applications
Harsh J 2013-10-21, 05:58
MR does use multiple disks when spilling. But the work directory is
also round-robined to spread I/O.
YARN sets an environment property thats a list (comma separated value)
of directories (ApplicationConstants.LOCAL_DIR_ENV) your app container
can together use. Perhaps read it in with
and then round robin internally over those paths (with free space
Perhaps you can even reuse the org.apache.hadoop.fs.LocalDirAllocator
class; which is what MR uses. Its not been declared publicly stable
though, but we can do that over a JIRA.
On Mon, Oct 21, 2013 at 2:05 AM, John Lilley <[EMAIL PROTECTED]> wrote:
> Harsh, thanks for the quick response. These files don't need to be on the DFS (although we use that too). These are local files used during sorting, joining, transitive closure.
> The task-relative folder might be good enough, but our app *can* make use of multiple temp folders if they are available. Our YARN app can be fairly I/O intensive; is it possible to allocate more than one temp folder on different physical devices?
> Or perhaps YARN might help us. Will YARN assign tasks to CWD folders on different disks so that they do not compete with each other on I/O?
> For that matter, where does MR allocate the temporary files generated by Mapper output? Presumably MR has the same I/O parallelism requirements that we do.
> -----Original Message-----
> From: Harsh J [mailto:[EMAIL PROTECTED]]
> Sent: Sunday, October 20, 2013 10:49 AM
> To: <[EMAIL PROTECTED]>
> Subject: Re: temporary file locations for YARN applications
> Every container gets its own local work directory (You can use the relative ./) thats auto-cleaned up at the end of the container's life.
> This is the best place to store the temporary files. This is not something you need custom configuration for.
> Do the files need to be on a distributed FS or a local one?
> On Sun, Oct 20, 2013 at 8:54 PM, John Lilley <[EMAIL PROTECTED]> wrote:
>> We have a pure YARN application (no MapReduce) that has need to store
>> a significant amount of temporary data. How can we know the best
>> location for these files? How can we ensure that our YARN tasks have
>> write access to these locations? Is this something that must be configured outside of YARN?
> Harsh J