Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig, mail # user - Pig and DistributedCache


Copy link to this message
-
Re: Pig and DistributedCache
Eugene Morozov 2013-02-07, 07:42
Rohini,

thank you for the reply.

Isn't it kinda hack to use "tmpfiles"? It's neither API nor good known
practice, it's internal details. How safe is it to use such a trick? I mean
after month or so we probably update our CDH4 to whatever is there.
Will it still work? Will it be safe for the cluster or for my job? Who
knows what will be implemented there?

You see, I can understand the code, find such a solution, but I won't be
able keep all of them in mind to check when we update the cluster.

On Thu, Feb 7, 2013 at 1:23 AM, Rohini Palaniswamy
<[EMAIL PROTECTED]>wrote:

> You should be fine using tmpfiles and that's the way to do it.
>
>  Else you will have to copy the file to hdfs, and call the
> DistributedCache.addFileToClassPath yourself (basically what tmpfiles
> setting is doing). But the problem there as you mentioned is cleaning up
> the hdfs file after the job completes. If you use tmpfiles, it is copied to
> the job's staging directory in user home and gets cleaned up automatically
> when job completes. If the file is not going to change between jobs, I
> would advise creating it in hdfs once in a fixed location and reusing it
> across jobs doing only DistributedCache.addFileToClassPath(). But if it is
> dynamic and differs from job to job, tmpfiles is your choice.
>
> Regards,
> Rohini
>
>
> On Mon, Feb 4, 2013 at 1:26 PM, Eugene Morozov <[EMAIL PROTECTED]
> >wrote:
>
> > Hello, folks!
> >
> > I'm using greatly customized HBaseStorage in my pig script.
> > And during HBaseStorage.setLocation() I'm preparing a file with values
> that
> > would be source for my filter. The filter is used  during
> > HBaseStorage.getNext().
> >
> > Since Pig script is basically MR job with many mappers, it means that my
> > values-file must be accessible for all my Map tasks. There is
> > DistributedCache that should copy files across the cluster to have them
> as
> > local for any map tasks. I don't want to write my file to HDFS in first
> > place, because there is no way to clean it up after MR job is done  (may
> be
> > you can point me in the direction). On the other hand if I'm writing the
> > file to local file system "/tmp", then I may either specify
> deleteOnExit()
> > or just forget about it - linux will take care of its local "/tmp".
> >
> > But here is small problem. DistributedCache copies files only if it is
> used
> > with command line parameter like "-files". In that case
> > GenericOptionsParsers copies all files, but DistributedCache API itself
> > allows only to specify parameters in jobConf - it doesn't actually do
> > copying.
> >
> > I've found that GenericOptionsParser specifies property "tmpfiles", which
> > is used by JobClient to copy files before it runs MR job. And I've been
> > able to specify the same property in jobConf from my HBaseStorage. It
> does
> > the trick, but it's a hack.
> > Is there any other correct way to achieve the goal?
> >
> > Thanks in advance.
> > --
> > Evgeny Morozov
> > Developer Grid Dynamics
> > Skype: morozov.evgeny
> > www.griddynamics.com
> > [EMAIL PROTECTED]
> >
>

--
Evgeny Morozov
Developer Grid Dynamics
Skype: morozov.evgeny
www.griddynamics.com
[EMAIL PROTECTED]