Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Plain View
Pig >> mail # user >> Pig and DistributedCache


+
Eugene Morozov 2013-02-04, 21:26
Copy link to this message
-
Re: Pig and DistributedCache
You should be fine using tmpfiles and that's the way to do it.

 Else you will have to copy the file to hdfs, and call the
DistributedCache.addFileToClassPath yourself (basically what tmpfiles
setting is doing). But the problem there as you mentioned is cleaning up
the hdfs file after the job completes. If you use tmpfiles, it is copied to
the job's staging directory in user home and gets cleaned up automatically
when job completes. If the file is not going to change between jobs, I
would advise creating it in hdfs once in a fixed location and reusing it
across jobs doing only DistributedCache.addFileToClassPath(). But if it is
dynamic and differs from job to job, tmpfiles is your choice.

Regards,
Rohini
On Mon, Feb 4, 2013 at 1:26 PM, Eugene Morozov <[EMAIL PROTECTED]>wrote:

> Hello, folks!
>
> I'm using greatly customized HBaseStorage in my pig script.
> And during HBaseStorage.setLocation() I'm preparing a file with values that
> would be source for my filter. The filter is used  during
> HBaseStorage.getNext().
>
> Since Pig script is basically MR job with many mappers, it means that my
> values-file must be accessible for all my Map tasks. There is
> DistributedCache that should copy files across the cluster to have them as
> local for any map tasks. I don't want to write my file to HDFS in first
> place, because there is no way to clean it up after MR job is done  (may be
> you can point me in the direction). On the other hand if I'm writing the
> file to local file system "/tmp", then I may either specify deleteOnExit()
> or just forget about it - linux will take care of its local "/tmp".
>
> But here is small problem. DistributedCache copies files only if it is used
> with command line parameter like "-files". In that case
> GenericOptionsParsers copies all files, but DistributedCache API itself
> allows only to specify parameters in jobConf - it doesn't actually do
> copying.
>
> I've found that GenericOptionsParser specifies property "tmpfiles", which
> is used by JobClient to copy files before it runs MR job. And I've been
> able to specify the same property in jobConf from my HBaseStorage. It does
> the trick, but it's a hack.
> Is there any other correct way to achieve the goal?
>
> Thanks in advance.
> --
> Evgeny Morozov
> Developer Grid Dynamics
> Skype: morozov.evgeny
> www.griddynamics.com
> [EMAIL PROTECTED]
>
+
Eugene Morozov 2013-02-07, 07:42
+
Eugene Morozov 2013-02-11, 06:26
+
Rohini Palaniswamy 2013-02-17, 04:22
+
Eugene Morozov 2013-02-19, 12:26
+
Rohini Palaniswamy 2013-02-19, 21:39
+
Eugene Morozov 2013-02-20, 04:54
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB