Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Plain View
Pig >> mail # user >> Pig and DistributedCache

Copy link to this message
Pig and DistributedCache
Hello, folks!

I'm using greatly customized HBaseStorage in my pig script.
And during HBaseStorage.setLocation() I'm preparing a file with values that
would be source for my filter. The filter is used  during

Since Pig script is basically MR job with many mappers, it means that my
values-file must be accessible for all my Map tasks. There is
DistributedCache that should copy files across the cluster to have them as
local for any map tasks. I don't want to write my file to HDFS in first
place, because there is no way to clean it up after MR job is done  (may be
you can point me in the direction). On the other hand if I'm writing the
file to local file system "/tmp", then I may either specify deleteOnExit()
or just forget about it - linux will take care of its local "/tmp".

But here is small problem. DistributedCache copies files only if it is used
with command line parameter like "-files". In that case
GenericOptionsParsers copies all files, but DistributedCache API itself
allows only to specify parameters in jobConf - it doesn't actually do

I've found that GenericOptionsParser specifies property "tmpfiles", which
is used by JobClient to copy files before it runs MR job. And I've been
able to specify the same property in jobConf from my HBaseStorage. It does
the trick, but it's a hack.
Is there any other correct way to achieve the goal?

Thanks in advance.
Evgeny Morozov
Developer Grid Dynamics
Skype: morozov.evgeny
Rohini Palaniswamy 2013-02-06, 21:23
Eugene Morozov 2013-02-07, 07:42
Eugene Morozov 2013-02-11, 06:26
Rohini Palaniswamy 2013-02-17, 04:22
Eugene Morozov 2013-02-19, 12:26
Rohini Palaniswamy 2013-02-19, 21:39
Eugene Morozov 2013-02-20, 04:54