Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
MapReduce >> mail # user >> Distribution of native executables and data for YARN-based execution

Copy link to this message
Distribution of native executables and data for YARN-based execution
I am attempting to distribute the execution of a C-based program onto a Hadoop cluster, without using MapReduce.  I read that YARN can be used to schedule non-MapReduce applications by programming to the ASM/RM interfaces.  As I understand it, eventually I get down to specifying each sub-task via ContainerLaunchContext.setCommands().

However, the program and shared libraries need to be stored on each worker's local disk to run.  In addition there is a hefty data set that the application uses (say, 4GB) that is accessed via regular open()/read() calls by a library.  I thought a decent strategy would be to push the program+data package to a known folder in HDFS, then launch a "bootstrap" that compared the HDFS folder version to a local folder, copying any updated files as needed before launching the native application task.

Are there better approaches?  I notice that one can implicitly copy "local resources" as part of the launch, but I don't want to copy 4GB every time, only occasionally when the application or reference data is updated.  Also, will my bootstrapper be allowed to set executable-mode bits on the programs after they are copied?