Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig >> mail # user >> Apache Pig UDF and  Distributed cache

Copy link to this message
Apache Pig UDF and  Distributed cache
Hi All,
I am trying to use Distributed cache in my UDF. I have the following file in HDFS that I want all my map functions to have available locally:
hadoop dfs -ls /scratch/-rw-r--r--   1 userid supergroup    size date time /scratch/id_lookup
In My pig script I pass it as a parameter

ProcessedUI = FOREACH A GENERATE myparser.myUDF(param1, param2, '/scratch/id_lookup');
In my UDF inside exec function I do the following:
 lookup_file = (String)input.get(2);
I have implemented the getCacheFiles as follows:
public List<String> getCacheFiles() {            List<String> list = new ArrayList<String>(1);            list.add(lookup_file + "#id_lookup");            return list;  }
Now I try to read that file using standard io methods.
public void VectorizeData (){                    FileReader fr = new FileReader("./id_lookup");                    BufferedReader brd = new BufferedReader(fr);}

I think I am not using it correctly (may be paths messed up etc.). I get the following exception:
2013-12-11 11:09:50,821 [JobControl] ERROR org.apache.hadoop.security.UserGroupInformation - PriviledgedActionException as:userid cause:java.io.FileNotFoundException: File does not exist: null2013-12-11 11:09:51,291 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 0% complete2013-12-11 11:09:51,301 [main] WARN  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Ooops! Some job has failed! Specify -stop_on_failure if you want Pig to stop immediately on failure.
Any help on this would be great!