|
|
-
Pig and DistributedCache
Eugene Morozov 2013-02-04, 21:26
Hello, folks!
I'm using greatly customized HBaseStorage in my pig script. And during HBaseStorage.setLocation() I'm preparing a file with values that would be source for my filter. The filter is used during HBaseStorage.getNext().
Since Pig script is basically MR job with many mappers, it means that my values-file must be accessible for all my Map tasks. There is DistributedCache that should copy files across the cluster to have them as local for any map tasks. I don't want to write my file to HDFS in first place, because there is no way to clean it up after MR job is done (may be you can point me in the direction). On the other hand if I'm writing the file to local file system "/tmp", then I may either specify deleteOnExit() or just forget about it - linux will take care of its local "/tmp".
But here is small problem. DistributedCache copies files only if it is used with command line parameter like "-files". In that case GenericOptionsParsers copies all files, but DistributedCache API itself allows only to specify parameters in jobConf - it doesn't actually do copying.
I've found that GenericOptionsParser specifies property "tmpfiles", which is used by JobClient to copy files before it runs MR job. And I've been able to specify the same property in jobConf from my HBaseStorage. It does the trick, but it's a hack. Is there any other correct way to achieve the goal?
Thanks in advance. -- Evgeny Morozov Developer Grid Dynamics Skype: morozov.evgeny www.griddynamics.com [EMAIL PROTECTED]
+
Eugene Morozov 2013-02-04, 21:26
-
Re: Pig and DistributedCache
Rohini Palaniswamy 2013-02-06, 21:23
You should be fine using tmpfiles and that's the way to do it.
Else you will have to copy the file to hdfs, and call the DistributedCache.addFileToClassPath yourself (basically what tmpfiles setting is doing). But the problem there as you mentioned is cleaning up the hdfs file after the job completes. If you use tmpfiles, it is copied to the job's staging directory in user home and gets cleaned up automatically when job completes. If the file is not going to change between jobs, I would advise creating it in hdfs once in a fixed location and reusing it across jobs doing only DistributedCache.addFileToClassPath(). But if it is dynamic and differs from job to job, tmpfiles is your choice.
Regards, Rohini On Mon, Feb 4, 2013 at 1:26 PM, Eugene Morozov <[EMAIL PROTECTED]>wrote:
> Hello, folks! > > I'm using greatly customized HBaseStorage in my pig script. > And during HBaseStorage.setLocation() I'm preparing a file with values that > would be source for my filter. The filter is used during > HBaseStorage.getNext(). > > Since Pig script is basically MR job with many mappers, it means that my > values-file must be accessible for all my Map tasks. There is > DistributedCache that should copy files across the cluster to have them as > local for any map tasks. I don't want to write my file to HDFS in first > place, because there is no way to clean it up after MR job is done (may be > you can point me in the direction). On the other hand if I'm writing the > file to local file system "/tmp", then I may either specify deleteOnExit() > or just forget about it - linux will take care of its local "/tmp". > > But here is small problem. DistributedCache copies files only if it is used > with command line parameter like "-files". In that case > GenericOptionsParsers copies all files, but DistributedCache API itself > allows only to specify parameters in jobConf - it doesn't actually do > copying. > > I've found that GenericOptionsParser specifies property "tmpfiles", which > is used by JobClient to copy files before it runs MR job. And I've been > able to specify the same property in jobConf from my HBaseStorage. It does > the trick, but it's a hack. > Is there any other correct way to achieve the goal? > > Thanks in advance. > -- > Evgeny Morozov > Developer Grid Dynamics > Skype: morozov.evgeny > www.griddynamics.com > [EMAIL PROTECTED] >
+
Rohini Palaniswamy 2013-02-06, 21:23
-
Re: Pig and DistributedCache
Eugene Morozov 2013-02-07, 07:42
Rohini,
thank you for the reply.
Isn't it kinda hack to use "tmpfiles"? It's neither API nor good known practice, it's internal details. How safe is it to use such a trick? I mean after month or so we probably update our CDH4 to whatever is there. Will it still work? Will it be safe for the cluster or for my job? Who knows what will be implemented there?
You see, I can understand the code, find such a solution, but I won't be able keep all of them in mind to check when we update the cluster.
On Thu, Feb 7, 2013 at 1:23 AM, Rohini Palaniswamy <[EMAIL PROTECTED]>wrote:
> You should be fine using tmpfiles and that's the way to do it. > > Else you will have to copy the file to hdfs, and call the > DistributedCache.addFileToClassPath yourself (basically what tmpfiles > setting is doing). But the problem there as you mentioned is cleaning up > the hdfs file after the job completes. If you use tmpfiles, it is copied to > the job's staging directory in user home and gets cleaned up automatically > when job completes. If the file is not going to change between jobs, I > would advise creating it in hdfs once in a fixed location and reusing it > across jobs doing only DistributedCache.addFileToClassPath(). But if it is > dynamic and differs from job to job, tmpfiles is your choice. > > Regards, > Rohini > > > On Mon, Feb 4, 2013 at 1:26 PM, Eugene Morozov <[EMAIL PROTECTED] > >wrote: > > > Hello, folks! > > > > I'm using greatly customized HBaseStorage in my pig script. > > And during HBaseStorage.setLocation() I'm preparing a file with values > that > > would be source for my filter. The filter is used during > > HBaseStorage.getNext(). > > > > Since Pig script is basically MR job with many mappers, it means that my > > values-file must be accessible for all my Map tasks. There is > > DistributedCache that should copy files across the cluster to have them > as > > local for any map tasks. I don't want to write my file to HDFS in first > > place, because there is no way to clean it up after MR job is done (may > be > > you can point me in the direction). On the other hand if I'm writing the > > file to local file system "/tmp", then I may either specify > deleteOnExit() > > or just forget about it - linux will take care of its local "/tmp". > > > > But here is small problem. DistributedCache copies files only if it is > used > > with command line parameter like "-files". In that case > > GenericOptionsParsers copies all files, but DistributedCache API itself > > allows only to specify parameters in jobConf - it doesn't actually do > > copying. > > > > I've found that GenericOptionsParser specifies property "tmpfiles", which > > is used by JobClient to copy files before it runs MR job. And I've been > > able to specify the same property in jobConf from my HBaseStorage. It > does > > the trick, but it's a hack. > > Is there any other correct way to achieve the goal? > > > > Thanks in advance. > > -- > > Evgeny Morozov > > Developer Grid Dynamics > > Skype: morozov.evgeny > > www.griddynamics.com > > [EMAIL PROTECTED] > > >
-- Evgeny Morozov Developer Grid Dynamics Skype: morozov.evgeny www.griddynamics.com [EMAIL PROTECTED]
+
Eugene Morozov 2013-02-07, 07:42
-
Re: Pig and DistributedCache
Eugene Morozov 2013-02-11, 06:26
Hi, again.
I've been able to successfully use the trick with DistributedCache and "tmpfiles" - during run of my Pig script the files are copied by JobClient to job-cache.
But here is the issue. The files are there, but they have permission 700 and user that runs maptask (I suppose it's hbase) doesn't have permission to read them. Permissions are belong to my current OS user.
In first, It looks like a bug, doesn't it? In second, what can I do about it? On Thu, Feb 7, 2013 at 11:42 AM, Eugene Morozov <[EMAIL PROTECTED]>wrote:
> Rohini, > > thank you for the reply. > > Isn't it kinda hack to use "tmpfiles"? It's neither API nor good known > practice, it's internal details. How safe is it to use such a trick? I mean > after month or so we probably update our CDH4 to whatever is there. > Will it still work? Will it be safe for the cluster or for my job? Who > knows what will be implemented there? > > You see, I can understand the code, find such a solution, but I won't be > able keep all of them in mind to check when we update the cluster. > > > On Thu, Feb 7, 2013 at 1:23 AM, Rohini Palaniswamy < > [EMAIL PROTECTED]> wrote: > >> You should be fine using tmpfiles and that's the way to do it. >> >> Else you will have to copy the file to hdfs, and call the >> DistributedCache.addFileToClassPath yourself (basically what tmpfiles >> setting is doing). But the problem there as you mentioned is cleaning up >> the hdfs file after the job completes. If you use tmpfiles, it is copied >> to >> the job's staging directory in user home and gets cleaned up automatically >> when job completes. If the file is not going to change between jobs, I >> would advise creating it in hdfs once in a fixed location and reusing it >> across jobs doing only DistributedCache.addFileToClassPath(). But if it is >> dynamic and differs from job to job, tmpfiles is your choice. >> >> Regards, >> Rohini >> >> >> On Mon, Feb 4, 2013 at 1:26 PM, Eugene Morozov <[EMAIL PROTECTED] >> >wrote: >> >> > Hello, folks! >> > >> > I'm using greatly customized HBaseStorage in my pig script. >> > And during HBaseStorage.setLocation() I'm preparing a file with values >> that >> > would be source for my filter. The filter is used during >> > HBaseStorage.getNext(). >> > >> > Since Pig script is basically MR job with many mappers, it means that my >> > values-file must be accessible for all my Map tasks. There is >> > DistributedCache that should copy files across the cluster to have them >> as >> > local for any map tasks. I don't want to write my file to HDFS in first >> > place, because there is no way to clean it up after MR job is done >> (may be >> > you can point me in the direction). On the other hand if I'm writing the >> > file to local file system "/tmp", then I may either specify >> deleteOnExit() >> > or just forget about it - linux will take care of its local "/tmp". >> > >> > But here is small problem. DistributedCache copies files only if it is >> used >> > with command line parameter like "-files". In that case >> > GenericOptionsParsers copies all files, but DistributedCache API itself >> > allows only to specify parameters in jobConf - it doesn't actually do >> > copying. >> > >> > I've found that GenericOptionsParser specifies property "tmpfiles", >> which >> > is used by JobClient to copy files before it runs MR job. And I've been >> > able to specify the same property in jobConf from my HBaseStorage. It >> does >> > the trick, but it's a hack. >> > Is there any other correct way to achieve the goal? >> > >> > Thanks in advance. >> > -- >> > Evgeny Morozov >> > Developer Grid Dynamics >> > Skype: morozov.evgeny >> > www.griddynamics.com >> > [EMAIL PROTECTED] >> > >> > > > > -- > Evgeny Morozov > Developer Grid Dynamics > Skype: morozov.evgeny > www.griddynamics.com > [EMAIL PROTECTED] >
-- Evgeny Morozov Developer Grid Dynamics Skype: morozov.evgeny www.griddynamics.com [EMAIL PROTECTED]
+
Eugene Morozov 2013-02-11, 06:26
-
Re: Pig and DistributedCache
Rohini Palaniswamy 2013-02-17, 04:22
Hi Eugene, Sorry. Missed your reply earlier.
tmpfiles has been around for a while and will not be removed in hadoop anytime soon. So don't worry about it. The hadoop configurations have never been fully documented and people look at code and use them. They usually deprecate for years before removing it.
The file will be created with the permissions based on the dfs.umaskmode setting (or fs.permissions.umask-mode in Hadoop 0.23/2.x) and the owner of the file will be the user who runs the pig script. The map job will be launched as the same user by the pig script. I don't understand what you mean by user runs map task does not have permissions. What kind of hadoop authentication are you are doing such that the file is created as one user and map job is launched as another user?
Regards, Rohini On Sun, Feb 10, 2013 at 10:26 PM, Eugene Morozov <[EMAIL PROTECTED]>wrote:
> Hi, again. > > I've been able to successfully use the trick with DistributedCache and > "tmpfiles" - during run of my Pig script the files are copied by JobClient > to job-cache. > > But here is the issue. The files are there, but they have permission 700 > and user that runs maptask (I suppose it's hbase) doesn't have permission > to read them. Permissions are belong to my current OS user. > > In first, It looks like a bug, doesn't it? > In second, what can I do about it? > > > On Thu, Feb 7, 2013 at 11:42 AM, Eugene Morozov > <[EMAIL PROTECTED]>wrote: > > > Rohini, > > > > thank you for the reply. > > > > Isn't it kinda hack to use "tmpfiles"? It's neither API nor good known > > practice, it's internal details. How safe is it to use such a trick? I > mean > > after month or so we probably update our CDH4 to whatever is there. > > Will it still work? Will it be safe for the cluster or for my job? Who > > knows what will be implemented there? > > > > You see, I can understand the code, find such a solution, but I won't be > > able keep all of them in mind to check when we update the cluster. > > > > > > On Thu, Feb 7, 2013 at 1:23 AM, Rohini Palaniswamy < > > [EMAIL PROTECTED]> wrote: > > > >> You should be fine using tmpfiles and that's the way to do it. > >> > >> Else you will have to copy the file to hdfs, and call the > >> DistributedCache.addFileToClassPath yourself (basically what tmpfiles > >> setting is doing). But the problem there as you mentioned is cleaning up > >> the hdfs file after the job completes. If you use tmpfiles, it is copied > >> to > >> the job's staging directory in user home and gets cleaned up > automatically > >> when job completes. If the file is not going to change between jobs, I > >> would advise creating it in hdfs once in a fixed location and reusing it > >> across jobs doing only DistributedCache.addFileToClassPath(). But if it > is > >> dynamic and differs from job to job, tmpfiles is your choice. > >> > >> Regards, > >> Rohini > >> > >> > >> On Mon, Feb 4, 2013 at 1:26 PM, Eugene Morozov < > [EMAIL PROTECTED] > >> >wrote: > >> > >> > Hello, folks! > >> > > >> > I'm using greatly customized HBaseStorage in my pig script. > >> > And during HBaseStorage.setLocation() I'm preparing a file with values > >> that > >> > would be source for my filter. The filter is used during > >> > HBaseStorage.getNext(). > >> > > >> > Since Pig script is basically MR job with many mappers, it means that > my > >> > values-file must be accessible for all my Map tasks. There is > >> > DistributedCache that should copy files across the cluster to have > them > >> as > >> > local for any map tasks. I don't want to write my file to HDFS in > first > >> > place, because there is no way to clean it up after MR job is done > >> (may be > >> > you can point me in the direction). On the other hand if I'm writing > the > >> > file to local file system "/tmp", then I may either specify > >> deleteOnExit() > >> > or just forget about it - linux will take care of its local "/tmp". > >> > > >> > But here is small problem. DistributedCache copies files only if it is
+
Rohini Palaniswamy 2013-02-17, 04:22
-
Re: Pig and DistributedCache
Eugene Morozov 2013-02-19, 12:26
Rohini,
Sorry for misleading in previous e-mails with these users. Here is more robust explanation of my issue.
This is what I've got when I've tried to run it.
File has been successfully copied by using "tmpfiles". 2013-02-08 13:38:56,533 INFO org.apache.hadoop.hbase.filter.PrefixFuzzyRowFilterWithFile: File [/var/lib/hadoop-hdfs/cache/mapred/mapred/staging/vagrant/.staging/job_201302081322_0001/files/pairs-tmp#pairs-tmp] has been found 2013-02-08 13:38:56,539 ERROR org.apache.hadoop.hbase.filter.PrefixFuzzyRowFilterWithFile: Cannot read file: [/var/lib/hadoop-hdfs/cache/mapred/mapred/staging/vagrant/.staging/job_201302081322_0001/files/pairs-tmp#pairs-tmp] org.apache.hadoop.security.AccessControlException: Permission denied: user=hbase, access=EXECUTE, inode="/var/lib/hadoop-hdfs/cache/mapred/mapred/staging/vagrant/.staging":vagrant:supergroup:drwx------ org.apache.hadoop.hbase.filter.PrefixFuzzyRowFilterWithFile - it's my filter, it just lives in org.apache... package.
1. I have user vagrant and this user runs pig script. 2. After that client side builds the filter, serialize it and move it to server side. 3. RegionServer starts playing here: it deserializes the filter and tries to use it while reading table. 4. Filter in its turn tries to read the file, but since RegionServer has been started under system user called "hbase", the filter also has corresponding authentification and cannot access the file, which has been written with another user.
Any ideas of what to try?
On Sun, Feb 17, 2013 at 8:22 AM, Rohini Palaniswamy <[EMAIL PROTECTED] > wrote:
> Hi Eugene, > Sorry. Missed your reply earlier. > > tmpfiles has been around for a while and will not be removed in hadoop > anytime soon. So don't worry about it. The hadoop configurations have never > been fully documented and people look at code and use them. They usually > deprecate for years before removing it. > > The file will be created with the permissions based on the dfs.umaskmode > setting (or fs.permissions.umask-mode in Hadoop 0.23/2.x) and the owner of > the file will be the user who runs the pig script. The map job will be > launched as the same user by the pig script. I don't understand what you > mean by user runs map task does not have permissions. What kind of hadoop > authentication are you are doing such that the file is created as one user > and map job is launched as another user? > > Regards, > Rohini > > > On Sun, Feb 10, 2013 at 10:26 PM, Eugene Morozov > <[EMAIL PROTECTED]>wrote: > > > Hi, again. > > > > I've been able to successfully use the trick with DistributedCache and > > "tmpfiles" - during run of my Pig script the files are copied by > JobClient > > to job-cache. > > > > But here is the issue. The files are there, but they have permission 700 > > and user that runs maptask (I suppose it's hbase) doesn't have permission > > to read them. Permissions are belong to my current OS user. > > > > In first, It looks like a bug, doesn't it? > > In second, what can I do about it? > > > > > > On Thu, Feb 7, 2013 at 11:42 AM, Eugene Morozov > > <[EMAIL PROTECTED]>wrote: > > > > > Rohini, > > > > > > thank you for the reply. > > > > > > Isn't it kinda hack to use "tmpfiles"? It's neither API nor good known > > > practice, it's internal details. How safe is it to use such a trick? I > > mean > > > after month or so we probably update our CDH4 to whatever is there. > > > Will it still work? Will it be safe for the cluster or for my job? Who > > > knows what will be implemented there? > > > > > > You see, I can understand the code, find such a solution, but I won't > be > > > able keep all of them in mind to check when we update the cluster. > > > > > > > > > On Thu, Feb 7, 2013 at 1:23 AM, Rohini Palaniswamy < > > > [EMAIL PROTECTED]> wrote: > > > > > >> You should be fine using tmpfiles and that's the way to do it. > > >> > > >> Else you will have to copy the file to hdfs, and call the > > >> DistributedCache.addFileToClassPath yourself (basically what tmpfiles
Evgeny Morozov Developer Grid Dynamics Skype: morozov.evgeny www.griddynamics.com [EMAIL PROTECTED]
+
Eugene Morozov 2013-02-19, 12:26
-
Re: Pig and DistributedCache
Rohini Palaniswamy 2013-02-19, 21:39
Eugene, As I said earlier, you can use a different dfs.umaskmode. Running pig with -Ddfs.umaskmode=022 will give read access to all(755 instead of 700). But all the files output, by the pig script will have those permission.
Better thing would be when you write the serialized file in the below step, write it with more accessible permissions. 2. After that client side builds the filter, serialize it and move it to server side.
Regards, Rohini On Tue, Feb 19, 2013 at 4:26 AM, Eugene Morozov <[EMAIL PROTECTED]>wrote:
> Rohini, > > Sorry for misleading in previous e-mails with these users. Here is more > robust explanation of my issue. > > This is what I've got when I've tried to run it. > > File has been successfully copied by using "tmpfiles". > 2013-02-08 13:38:56,533 INFO > org.apache.hadoop.hbase.filter.PrefixFuzzyRowFilterWithFile: File > > [/var/lib/hadoop-hdfs/cache/mapred/mapred/staging/vagrant/.staging/job_201302081322_0001/files/pairs-tmp#pairs-tmp] > has been found > 2013-02-08 13:38:56,539 ERROR > org.apache.hadoop.hbase.filter.PrefixFuzzyRowFilterWithFile: Cannot read > file: > > [/var/lib/hadoop-hdfs/cache/mapred/mapred/staging/vagrant/.staging/job_201302081322_0001/files/pairs-tmp#pairs-tmp] > org.apache.hadoop.security.AccessControlException: Permission denied: > user=hbase, access=EXECUTE, > > inode="/var/lib/hadoop-hdfs/cache/mapred/mapred/staging/vagrant/.staging":vagrant:supergroup:drwx------ > > > org.apache.hadoop.hbase.filter.PrefixFuzzyRowFilterWithFile - it's my > filter, it just lives in org.apache... package. > > 1. I have user vagrant and this user runs pig script. > 2. After that client side builds the filter, serialize it and move it to > server side. > 3. RegionServer starts playing here: it deserializes the filter and tries > to use it while reading table. > 4. Filter in its turn tries to read the file, but since RegionServer has > been started under system user called "hbase", the filter also has > corresponding authentification and cannot access the file, which has been > written with another user. > > Any ideas of what to try? > > On Sun, Feb 17, 2013 at 8:22 AM, Rohini Palaniswamy < > [EMAIL PROTECTED] > > wrote: > > > Hi Eugene, > > Sorry. Missed your reply earlier. > > > > tmpfiles has been around for a while and will not be removed in > hadoop > > anytime soon. So don't worry about it. The hadoop configurations have > never > > been fully documented and people look at code and use them. They usually > > deprecate for years before removing it. > > > > The file will be created with the permissions based on the > dfs.umaskmode > > setting (or fs.permissions.umask-mode in Hadoop 0.23/2.x) and the owner > of > > the file will be the user who runs the pig script. The map job will be > > launched as the same user by the pig script. I don't understand what you > > mean by user runs map task does not have permissions. What kind of hadoop > > authentication are you are doing such that the file is created as one > user > > and map job is launched as another user? > > > > Regards, > > Rohini > > > > > > On Sun, Feb 10, 2013 at 10:26 PM, Eugene Morozov > > <[EMAIL PROTECTED]>wrote: > > > > > Hi, again. > > > > > > I've been able to successfully use the trick with DistributedCache and > > > "tmpfiles" - during run of my Pig script the files are copied by > > JobClient > > > to job-cache. > > > > > > But here is the issue. The files are there, but they have permission > 700 > > > and user that runs maptask (I suppose it's hbase) doesn't have > permission > > > to read them. Permissions are belong to my current OS user. > > > > > > In first, It looks like a bug, doesn't it? > > > In second, what can I do about it? > > > > > > > > > On Thu, Feb 7, 2013 at 11:42 AM, Eugene Morozov > > > <[EMAIL PROTECTED]>wrote: > > > > > > > Rohini, > > > > > > > > thank you for the reply. > > > > > > > > Isn't it kinda hack to use "tmpfiles"? It's neither API nor good > known
+
Rohini Palaniswamy 2013-02-19, 21:39
-
Re: Pig and DistributedCache
Eugene Morozov 2013-02-20, 04:54
Rohini,
thanks a lot, I'll check the parameter.
On Wed, Feb 20, 2013 at 1:39 AM, Rohini Palaniswamy <[EMAIL PROTECTED] > wrote:
> Eugene, > As I said earlier, you can use a different dfs.umaskmode. Running pig > with -Ddfs.umaskmode=022 will give read access to all(755 instead of 700). > But all the files output, by the pig script will have those permission. > > Better thing would be when you write the serialized file in the below step, > write it with more accessible permissions. > 2. After that client side builds the filter, serialize it and move it to > server side. > > Regards, > Rohini > > > On Tue, Feb 19, 2013 at 4:26 AM, Eugene Morozov > <[EMAIL PROTECTED]>wrote: > > > Rohini, > > > > Sorry for misleading in previous e-mails with these users. Here is more > > robust explanation of my issue. > > > > This is what I've got when I've tried to run it. > > > > File has been successfully copied by using "tmpfiles". > > 2013-02-08 13:38:56,533 INFO > > org.apache.hadoop.hbase.filter.PrefixFuzzyRowFilterWithFile: File > > > > > [/var/lib/hadoop-hdfs/cache/mapred/mapred/staging/vagrant/.staging/job_201302081322_0001/files/pairs-tmp#pairs-tmp] > > has been found > > 2013-02-08 13:38:56,539 ERROR > > org.apache.hadoop.hbase.filter.PrefixFuzzyRowFilterWithFile: Cannot read > > file: > > > > > [/var/lib/hadoop-hdfs/cache/mapred/mapred/staging/vagrant/.staging/job_201302081322_0001/files/pairs-tmp#pairs-tmp] > > org.apache.hadoop.security.AccessControlException: Permission denied: > > user=hbase, access=EXECUTE, > > > > > inode="/var/lib/hadoop-hdfs/cache/mapred/mapred/staging/vagrant/.staging":vagrant:supergroup:drwx------ > > > > > > org.apache.hadoop.hbase.filter.PrefixFuzzyRowFilterWithFile - it's my > > filter, it just lives in org.apache... package. > > > > 1. I have user vagrant and this user runs pig script. > > 2. After that client side builds the filter, serialize it and move it to > > server side. > > 3. RegionServer starts playing here: it deserializes the filter and tries > > to use it while reading table. > > 4. Filter in its turn tries to read the file, but since RegionServer has > > been started under system user called "hbase", the filter also has > > corresponding authentification and cannot access the file, which has been > > written with another user. > > > > Any ideas of what to try? > > > > On Sun, Feb 17, 2013 at 8:22 AM, Rohini Palaniswamy < > > [EMAIL PROTECTED] > > > wrote: > > > > > Hi Eugene, > > > Sorry. Missed your reply earlier. > > > > > > tmpfiles has been around for a while and will not be removed in > > hadoop > > > anytime soon. So don't worry about it. The hadoop configurations have > > never > > > been fully documented and people look at code and use them. They > usually > > > deprecate for years before removing it. > > > > > > The file will be created with the permissions based on the > > dfs.umaskmode > > > setting (or fs.permissions.umask-mode in Hadoop 0.23/2.x) and the owner > > of > > > the file will be the user who runs the pig script. The map job will be > > > launched as the same user by the pig script. I don't understand what > you > > > mean by user runs map task does not have permissions. What kind of > hadoop > > > authentication are you are doing such that the file is created as one > > user > > > and map job is launched as another user? > > > > > > Regards, > > > Rohini > > > > > > > > > On Sun, Feb 10, 2013 at 10:26 PM, Eugene Morozov > > > <[EMAIL PROTECTED]>wrote: > > > > > > > Hi, again. > > > > > > > > I've been able to successfully use the trick with DistributedCache > and > > > > "tmpfiles" - during run of my Pig script the files are copied by > > > JobClient > > > > to job-cache. > > > > > > > > But here is the issue. The files are there, but they have permission > > 700 > > > > and user that runs maptask (I suppose it's hbase) doesn't have > > permission > > > > to read them. Permissions are belong to my current OS user. > > > Evgeny Morozov Developer Grid Dynamics Skype: morozov.evgeny www.griddynamics.com [EMAIL PROTECTED]
+
Eugene Morozov 2013-02-20, 04:54
|
|