Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Plain View
Pig, mail # user - Pig and DistributedCache


+
Eugene Morozov 2013-02-04, 21:26
+
Rohini Palaniswamy 2013-02-06, 21:23
+
Eugene Morozov 2013-02-07, 07:42
+
Eugene Morozov 2013-02-11, 06:26
Copy link to this message
-
Re: Pig and DistributedCache
Rohini Palaniswamy 2013-02-17, 04:22
Hi Eugene,
      Sorry. Missed your reply earlier.

    tmpfiles has been around for a while and will not be removed in hadoop
anytime soon. So don't worry about it. The hadoop configurations have never
been fully documented and people look at code and use them. They usually
deprecate for  years before removing it.

  The file will be created with the permissions based on the dfs.umaskmode
setting (or fs.permissions.umask-mode in Hadoop 0.23/2.x) and the owner of
the file will be the user who runs the pig script. The map job will be
launched as the same user by the pig script. I don't understand what you
mean by user runs map task does not have permissions. What kind of hadoop
authentication are you are doing such that the file is created as one user
and map job is launched as another user?

Regards,
Rohini
On Sun, Feb 10, 2013 at 10:26 PM, Eugene Morozov
<[EMAIL PROTECTED]>wrote:

> Hi, again.
>
> I've been able to successfully use the trick with DistributedCache and
> "tmpfiles" - during run of my Pig script the files are copied by JobClient
> to job-cache.
>
> But here is the issue. The files are there, but they have permission 700
> and user that runs maptask (I suppose it's hbase) doesn't have permission
> to read them. Permissions are belong to my current OS user.
>
> In first, It looks like a bug, doesn't it?
> In second, what can I do about it?
>
>
> On Thu, Feb 7, 2013 at 11:42 AM, Eugene Morozov
> <[EMAIL PROTECTED]>wrote:
>
> > Rohini,
> >
> > thank you for the reply.
> >
> > Isn't it kinda hack to use "tmpfiles"? It's neither API nor good known
> > practice, it's internal details. How safe is it to use such a trick? I
> mean
> > after month or so we probably update our CDH4 to whatever is there.
> > Will it still work? Will it be safe for the cluster or for my job? Who
> > knows what will be implemented there?
> >
> > You see, I can understand the code, find such a solution, but I won't be
> > able keep all of them in mind to check when we update the cluster.
> >
> >
> > On Thu, Feb 7, 2013 at 1:23 AM, Rohini Palaniswamy <
> > [EMAIL PROTECTED]> wrote:
> >
> >> You should be fine using tmpfiles and that's the way to do it.
> >>
> >>  Else you will have to copy the file to hdfs, and call the
> >> DistributedCache.addFileToClassPath yourself (basically what tmpfiles
> >> setting is doing). But the problem there as you mentioned is cleaning up
> >> the hdfs file after the job completes. If you use tmpfiles, it is copied
> >> to
> >> the job's staging directory in user home and gets cleaned up
> automatically
> >> when job completes. If the file is not going to change between jobs, I
> >> would advise creating it in hdfs once in a fixed location and reusing it
> >> across jobs doing only DistributedCache.addFileToClassPath(). But if it
> is
> >> dynamic and differs from job to job, tmpfiles is your choice.
> >>
> >> Regards,
> >> Rohini
> >>
> >>
> >> On Mon, Feb 4, 2013 at 1:26 PM, Eugene Morozov <
> [EMAIL PROTECTED]
> >> >wrote:
> >>
> >> > Hello, folks!
> >> >
> >> > I'm using greatly customized HBaseStorage in my pig script.
> >> > And during HBaseStorage.setLocation() I'm preparing a file with values
> >> that
> >> > would be source for my filter. The filter is used  during
> >> > HBaseStorage.getNext().
> >> >
> >> > Since Pig script is basically MR job with many mappers, it means that
> my
> >> > values-file must be accessible for all my Map tasks. There is
> >> > DistributedCache that should copy files across the cluster to have
> them
> >> as
> >> > local for any map tasks. I don't want to write my file to HDFS in
> first
> >> > place, because there is no way to clean it up after MR job is done
> >>  (may be
> >> > you can point me in the direction). On the other hand if I'm writing
> the
> >> > file to local file system "/tmp", then I may either specify
> >> deleteOnExit()
> >> > or just forget about it - linux will take care of its local "/tmp".
> >> >
> >> > But here is small problem. DistributedCache copies files only if it is
+
Eugene Morozov 2013-02-19, 12:26
+
Rohini Palaniswamy 2013-02-19, 21:39
+
Eugene Morozov 2013-02-20, 04:54