Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
MapReduce, mail # user - Distribution of native executables and data for YARN-based execution


Copy link to this message
-
Re: Distribution of native executables and data for YARN-based execution
Vinod Kumar Vavilapalli 2013-05-17, 17:08

I have a little bit of conflict of interest given I worked on Hadoop YARN all time but..

I have worked on torque/condor based resource management systems too. There are many advantages of working on top of YARN, a couple that should be specifically relevant here:
 - MR and non MR all on same cluster (there are a few not-so-ready MR implementations on existing schedulers but with lots of limitations)
 - Data locality feature that is native in Hadoop YARN and hard to simulate in other schedulers (we have experience trying this in the past)
 - Elastic resource managements - jobs can grow and shrink elastically

Thanks,
+Vinod Kumar Vavilapalli
Hortonworks Inc.
http://hortonworks.com/

On May 17, 2013, at 7:20 AM, Tim St Clair wrote:

> Hi John -
>
> If you are doing extensive levels of non-MR C-style batch, you may be better served to look at myriad universes of existing schedulers (torque, condor, etc.).  Or investigate the space around interop (1 cluster, many schedulers).  
>
> Either way, I recommend minimizing your dependency graph on your C-application where possible if you are working in a heterogeneous environment.
>
> Cheers,
> Tim
>
>
> From: "John Lilley" <[EMAIL PROTECTED]>
> To: [EMAIL PROTECTED]
> Sent: Friday, May 17, 2013 8:35:53 AM
> Subject: RE: Distribution of native executables and data for YARN-based execution
>
> Thanks!  This sounds exactly like what I need.  PUBLIC is right.
>  
> Do you know if this works for executables as well?  Like, would there be any issue transferring the executable bit on the file?
>  
> john
>  
> From: Vinod Kumar Vavilapalli [mailto:[EMAIL PROTECTED]]
> Sent: Friday, May 17, 2013 12:56 AM
> To: [EMAIL PROTECTED]
> Subject: Re: Distribution of native executables and data for YARN-based execution
>  
>  
> The "local resources" you mentioned is the exact solution for this. For each LocalResource, you also mention a LocalResourceVisibility which takes one of the three values today - PUBLIC, PRIVATE and APPLICATON.
>  
> PUBLIC resources are downloaded only once and shared by any application running on that node.
>  
> PRIVATE resources are downloaded only once and shared by any application run by the same user on that node
>  
> APPLICATION resources are downloaded per application and removed after the application finishes.
>  
> Seems like you want PUBLIC or PRIVATE.
>  
> Note that for PUBLIC resources to work, the corresponding files need to be public on HDFS too.
>  
> Also if the remote files on HDFS are updated, these local files will be uploaded afresh again on each node where your containers run.
>  
> HTH
>  
> Thanks,
> +Vinod Kumar Vavilapalli
> Hortonworks Inc.
> http://hortonworks.com/
>  
>  
> On May 16, 2013, at 2:21 PM, John Lilley wrote:
>
>
> I am attempting to distribute the execution of a C-based program onto a Hadoop cluster, without using MapReduce.  I read that YARN can be used to schedule non-MapReduce applications by programming to the ASM/RM interfaces.  As I understand it, eventually I get down to specifying each sub-task via ContainerLaunchContext.setCommands().
>  
> However, the program and shared libraries need to be stored on each worker’s local disk to run.  In addition there is a hefty data set that the application uses (say, 4GB) that is accessed via regular open()/read() calls by a library.  I thought a decent strategy would be to push the program+data package to a known folder in HDFS, then launch a “bootstrap” that compared the HDFS folder version to a local folder, copying any updated files as needed before launching the native application task.
>  
> Are there better approaches?  I notice that one can implicitly copy “local resources” as part of the launch, but I don’t want to copy 4GB every time, only occasionally when the application or reference data is updated.  Also, will my bootstrapper be allowed to set executable-mode bits on the programs after they are copied?
>  
> Thanks
> John
>  
>  
>