Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig >> mail # user >> Pig and DistributedCache

Copy link to this message
Re: Pig and DistributedCache
Hi, again.

I've been able to successfully use the trick with DistributedCache and
"tmpfiles" - during run of my Pig script the files are copied by JobClient
to job-cache.

But here is the issue. The files are there, but they have permission 700
and user that runs maptask (I suppose it's hbase) doesn't have permission
to read them. Permissions are belong to my current OS user.

In first, It looks like a bug, doesn't it?
In second, what can I do about it?
On Thu, Feb 7, 2013 at 11:42 AM, Eugene Morozov

> Rohini,
> thank you for the reply.
> Isn't it kinda hack to use "tmpfiles"? It's neither API nor good known
> practice, it's internal details. How safe is it to use such a trick? I mean
> after month or so we probably update our CDH4 to whatever is there.
> Will it still work? Will it be safe for the cluster or for my job? Who
> knows what will be implemented there?
> You see, I can understand the code, find such a solution, but I won't be
> able keep all of them in mind to check when we update the cluster.
> On Thu, Feb 7, 2013 at 1:23 AM, Rohini Palaniswamy <
>> You should be fine using tmpfiles and that's the way to do it.
>>  Else you will have to copy the file to hdfs, and call the
>> DistributedCache.addFileToClassPath yourself (basically what tmpfiles
>> setting is doing). But the problem there as you mentioned is cleaning up
>> the hdfs file after the job completes. If you use tmpfiles, it is copied
>> to
>> the job's staging directory in user home and gets cleaned up automatically
>> when job completes. If the file is not going to change between jobs, I
>> would advise creating it in hdfs once in a fixed location and reusing it
>> across jobs doing only DistributedCache.addFileToClassPath(). But if it is
>> dynamic and differs from job to job, tmpfiles is your choice.
>> Regards,
>> Rohini
>> On Mon, Feb 4, 2013 at 1:26 PM, Eugene Morozov <[EMAIL PROTECTED]
>> >wrote:
>> > Hello, folks!
>> >
>> > I'm using greatly customized HBaseStorage in my pig script.
>> > And during HBaseStorage.setLocation() I'm preparing a file with values
>> that
>> > would be source for my filter. The filter is used  during
>> > HBaseStorage.getNext().
>> >
>> > Since Pig script is basically MR job with many mappers, it means that my
>> > values-file must be accessible for all my Map tasks. There is
>> > DistributedCache that should copy files across the cluster to have them
>> as
>> > local for any map tasks. I don't want to write my file to HDFS in first
>> > place, because there is no way to clean it up after MR job is done
>>  (may be
>> > you can point me in the direction). On the other hand if I'm writing the
>> > file to local file system "/tmp", then I may either specify
>> deleteOnExit()
>> > or just forget about it - linux will take care of its local "/tmp".
>> >
>> > But here is small problem. DistributedCache copies files only if it is
>> used
>> > with command line parameter like "-files". In that case
>> > GenericOptionsParsers copies all files, but DistributedCache API itself
>> > allows only to specify parameters in jobConf - it doesn't actually do
>> > copying.
>> >
>> > I've found that GenericOptionsParser specifies property "tmpfiles",
>> which
>> > is used by JobClient to copy files before it runs MR job. And I've been
>> > able to specify the same property in jobConf from my HBaseStorage. It
>> does
>> > the trick, but it's a hack.
>> > Is there any other correct way to achieve the goal?
>> >
>> > Thanks in advance.
>> > --
>> > Evgeny Morozov
>> > Developer Grid Dynamics
>> > Skype: morozov.evgeny
>> > www.griddynamics.com
>> >
> --
> Evgeny Morozov
> Developer Grid Dynamics
> Skype: morozov.evgeny
> www.griddynamics.com

Evgeny Morozov
Developer Grid Dynamics
Skype: morozov.evgeny