Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig >> mail # user >> Udfcachetest not working


Copy link to this message
-
Re: Udfcachetest not working
Matt Hayes of Datafu project showed me this code, which works in local mode
and in hadoop mode. This should be folded into getCacheFiles(), imo:

public static final String MODEL_FILE = "MODEL_FILE";
private TokenizerME tokenizer;

public Tokenize(String modelPath) {
  this.modelPath = modelPath;
}

@Override
public List<String> getCacheFiles() {
  List<String> list = new ArrayList<String>(1);
  list.add(this.modelPath + "#" + MODEL_FILE);
  return list;
}

public DataBag exec(Tuple input) throws IOException
{
  if (this.tokenizer == null) {
    initTokenizer();
  }

  // etc.
}

private void initTokenizer() {
  String loadFile = getFilename();
  InputStream file = new FileInputStream(loadFile);
  InputStream buffer = new BufferedInputStream(file);
  TokenizerModel model = new TokenizerModel(buffer);
  this.tokenizer = new TokenizerME(model);
}

private String getFilename() throws IOException {
  // if the symlink exists, use it, if not, use the raw name if it exists
  // note: this is to help with testing, as it seems distributed cache
doesn't work with PigUnit
  String loadFile = MODEL_FILE;
  if (!new File(loadFile).exists()) {
    if (new File(this.filename).exists()) {
      loadFile = this.modelPath;
    } else {
      throw new IOException(String.format("could not load model,
neither symlink %s nor file %s exist", MODEL_FILE, this.modelPath));
    }
  }
  return loadFile;
}

On Mon, Jan 6, 2014 at 12:39 PM, Russell Jurney <[EMAIL PROTECTED]>wrote:

> According to https://issues.apache.org/jira/browse/PIG-1752 :
>
> "One other note. I didn't include any unit tests with this patch. I don't
> know how to test it in the unit tests since the distributed cache isn't
> used in local mode. I've tested it on a cluster. Any thoughts on how to
> include tests for this in the unit tests are welcome."
>
> getcacheFiles does not work with local mode. This is problematic. How do I
> write a UDF that works in both local mode and hadoop mode?
>
>
> On Mon, Jan 6, 2014 at 12:08 PM, Russell Jurney <[EMAIL PROTECTED]>wrote:
>
>> Question: in local mode, can the path given to getCacheFiles() be on the
>> local filesystem? Or does it have to be on HDFS?
>>
>>
>> On Mon, Jan 6, 2014 at 11:29 AM, Russell Jurney <[EMAIL PROTECTED]
>> > wrote:
>>
>>> 1. I've also given it an absolute local path. I don't know what you mean
>>> by an absolute cache path. How do I know what that is? The examples use
>>> ./link to access the cached file.
>>> 2. Because all examples do so. What paths should I use to access the
>>> distributed cache from inside exec?
>>>
>>> Exception does say that passed is missing. But as I read the examples,
>>> it should be there.
>>>
>>> On Monday, January 6, 2014, Serega Sheypak wrote:
>>>
>>>> Yes it works. Exception clearly says that ./passwd is missing.
>>>> 1. Try to give absolute path to file, see if it works. It should.
>>>> 2. Then give relative path. Looks like you incorrectly provide relative
>>>> path. why do you put "./" before filename?
>>>>
>>>>
>>>> 2014/1/6 Russell Jurney <[EMAIL PROTECTED]>
>>>>
>>>> > I have implemented to class below to test the udf cache, and it fails
>>>> in
>>>> > local mode with the error below. That cache should work in local mode
>>>> as
>>>> > well, right?
>>>> >
>>>> > ------------
>>>> >
>>>> > org.apache.pig.backend.executionengine.ExecException: ERROR 2078:
>>>> Caught
>>>> > error from UDF: datafu.pig.text.Udfcachetest [./passwd (No such file
>>>> or
>>>> > directory)]
>>>> >
>>>> > at
>>>> >
>>>> >
>>>> org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:358)
>>>> >
>>>> > at
>>>> >
>>>> >
>>>> org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNextString(POUserFunc.java:432)
>>>> >
>>>> > at
>>>> >
>>>> >
>>>> org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.getNext(PhysicalOperator.java:315)
>>>> >
>>>> > at
>>>> >
>>>
Russell Jurney twitter.com/rjurney [EMAIL PROTECTED] datasyndrome.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB