Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig >> mail # user >> Udfcachetest not working


Copy link to this message
-
Re: Udfcachetest not working
Matt Hayes of Datafu project showed me this code, which works in local mode
and in hadoop mode. This should be folded into getCacheFiles(), imo:

public static final String MODEL_FILE = "MODEL_FILE";
private TokenizerME tokenizer;

public Tokenize(String modelPath) {
  this.modelPath = modelPath;
}

@Override
public List<String> getCacheFiles() {
  List<String> list = new ArrayList<String>(1);
  list.add(this.modelPath + "#" + MODEL_FILE);
  return list;
}

public DataBag exec(Tuple input) throws IOException
{
  if (this.tokenizer == null) {
    initTokenizer();
  }

  // etc.
}

private void initTokenizer() {
  String loadFile = getFilename();
  InputStream file = new FileInputStream(loadFile);
  InputStream buffer = new BufferedInputStream(file);
  TokenizerModel model = new TokenizerModel(buffer);
  this.tokenizer = new TokenizerME(model);
}

private String getFilename() throws IOException {
  // if the symlink exists, use it, if not, use the raw name if it exists
  // note: this is to help with testing, as it seems distributed cache
doesn't work with PigUnit
  String loadFile = MODEL_FILE;
  if (!new File(loadFile).exists()) {
    if (new File(this.filename).exists()) {
      loadFile = this.modelPath;
    } else {
      throw new IOException(String.format("could not load model,
neither symlink %s nor file %s exist", MODEL_FILE, this.modelPath));
    }
  }
  return loadFile;
}

On Mon, Jan 6, 2014 at 12:39 PM, Russell Jurney <[EMAIL PROTECTED]>wrote:

> According to https://issues.apache.org/jira/browse/PIG-1752 :
>
> "One other note. I didn't include any unit tests with this patch. I don't
> know how to test it in the unit tests since the distributed cache isn't
> used in local mode. I've tested it on a cluster. Any thoughts on how to
> include tests for this in the unit tests are welcome."
>
> getcacheFiles does not work with local mode. This is problematic. How do I
> write a UDF that works in both local mode and hadoop mode?
>
>
> On Mon, Jan 6, 2014 at 12:08 PM, Russell Jurney <[EMAIL PROTECTED]>wrote:
>
>> Question: in local mode, can the path given to getCacheFiles() be on the
>> local filesystem? Or does it have to be on HDFS?
>>
>>
>> On Mon, Jan 6, 2014 at 11:29 AM, Russell Jurney <[EMAIL PROTECTED]
>> > wrote:
>>
>>> 1. I've also given it an absolute local path. I don't know what you mean
>>> by an absolute cache path. How do I know what that is? The examples use
>>> ./link to access the cached file.
>>> 2. Because all examples do so. What paths should I use to access the
>>> distributed cache from inside exec?
>>>
>>> Exception does say that passed is missing. But as I read the examples,
>>> it should be there.
>>>
>>> On Monday, January 6, 2014, Serega Sheypak wrote:
>>>
>>>> Yes it works. Exception clearly says that ./passwd is missing.
>>>> 1. Try to give absolute path to file, see if it works. It should.
>>>> 2. Then give relative path. Looks like you incorrectly provide relative
>>>> path. why do you put "./" before filename?
>>>>
>>>>
>>>> 2014/1/6 Russell Jurney <[EMAIL PROTECTED]>
>>>>
>>>> > I have implemented to class below to test the udf cache, and it fails
>>>> in
>>>> > local mode with the error below. That cache should work in local mode
>>>> as
>>>> > well, right?
>>>> >
>>>> > ------------
>>>> >
>>>> > org.apache.pig.backend.executionengine.ExecException: ERROR 2078:
>>>> Caught
>>>> > error from UDF: datafu.pig.text.Udfcachetest [./passwd (No such file
>>>> or
>>>> > directory)]
>>>> >
>>>> > at
>>>> >
>>>> >
>>>> org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:358)
>>>> >
>>>> > at
>>>> >
>>>> >
>>>> org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNextString(POUserFunc.java:432)
>>>> >
>>>> > at
>>>> >
>>>> >
>>>> org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.getNext(PhysicalOperator.java:315)
>>>> >
>>>> > at
>>>> >
>>>
Russell Jurney twitter.com/rjurney [EMAIL PROTECTED] datasyndrome.com