Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Plain View
Pig >> mail # user >> Udfcachetest not working


+
Russell Jurney 2014-01-06, 04:41
+
Serega Sheypak 2014-01-06, 08:04
+
Russell Jurney 2014-01-06, 19:29
+
Russell Jurney 2014-01-06, 20:08
+
Russell Jurney 2014-01-06, 20:39
+
Russell Jurney 2014-01-06, 21:17
Copy link to this message
-
Re: Udfcachetest not working
Hi, so did you solve the problem?
I suppose you understand the idea of distirubted cache. It doesn't matter
is it local or distributed mode. The idea is that you access local file
system.
It's better to use Oozie in prod, it does place files to distributec cache
for you.

Here is an example:

<action name="an-action-with-pig-script">
        <pig>
            <job-tracker>${jobTracker}</job-tracker>
            <name-node>${nameNode}</name-node>
            <prepare>
                <delete path="${output_path}" />
            </prepare>
            <configuration>
                <property>
                    <name>mapred.job.queue.name</name>
                    <value>default</value>
                </property>
                <property>
                    <name>pig.exec.reducers.bytes.per.reducer</name>
                    <value>50000000</value>
                </property>
                <!-- more conf .... ->
            </configuration>

            <script>pig/my_script.pig</script>

            <!-- See file tag -->
            <param>urlPath=./source_url</param>

            <param>in_dir=${in_dir}</param>
            <param>output=${output_path}</param>

            <param>udf=my_jython_udf.py</param> <!-- put your udf to dist
cache -->
            <file>pig/udf/my_jython_udf.py#my_jython_udf.py</file>

            <!-- put your file to dist cache see urlPath=./source_url -->
            <file>${source_url_in_dir}/part-r-00000.avro#source_url</file>

        </pig>
        <ok to="some-next-action"/>
        <error to="kill"/>
    </action>
2014/1/7 Russell Jurney <[EMAIL PROTECTED]>

> Matt Hayes of Datafu project showed me this code, which works in local mode
> and in hadoop mode. This should be folded into getCacheFiles(), imo:
>
> public static final String MODEL_FILE = "MODEL_FILE";
> private TokenizerME tokenizer;
>
> public Tokenize(String modelPath) {
>   this.modelPath = modelPath;
> }
>
> @Override
> public List<String> getCacheFiles() {
>   List<String> list = new ArrayList<String>(1);
>   list.add(this.modelPath + "#" + MODEL_FILE);
>   return list;
> }
>
> public DataBag exec(Tuple input) throws IOException
> {
>   if (this.tokenizer == null) {
>     initTokenizer();
>   }
>
>   // etc.
> }
>
> private void initTokenizer() {
>   String loadFile = getFilename();
>   InputStream file = new FileInputStream(loadFile);
>   InputStream buffer = new BufferedInputStream(file);
>   TokenizerModel model = new TokenizerModel(buffer);
>   this.tokenizer = new TokenizerME(model);
> }
>
> private String getFilename() throws IOException {
>   // if the symlink exists, use it, if not, use the raw name if it exists
>   // note: this is to help with testing, as it seems distributed cache
> doesn't work with PigUnit
>   String loadFile = MODEL_FILE;
>   if (!new File(loadFile).exists()) {
>     if (new File(this.filename).exists()) {
>       loadFile = this.modelPath;
>     } else {
>       throw new IOException(String.format("could not load model,
> neither symlink %s nor file %s exist", MODEL_FILE, this.modelPath));
>     }
>   }
>   return loadFile;
> }
>
>
>
> On Mon, Jan 6, 2014 at 12:39 PM, Russell Jurney <[EMAIL PROTECTED]
> >wrote:
>
> > According to https://issues.apache.org/jira/browse/PIG-1752 :
> >
> > "One other note. I didn't include any unit tests with this patch. I don't
> > know how to test it in the unit tests since the distributed cache isn't
> > used in local mode. I've tested it on a cluster. Any thoughts on how to
> > include tests for this in the unit tests are welcome."
> >
> > getcacheFiles does not work with local mode. This is problematic. How do
> I
> > write a UDF that works in both local mode and hadoop mode?
> >
> >
> > On Mon, Jan 6, 2014 at 12:08 PM, Russell Jurney <
> [EMAIL PROTECTED]>wrote:
> >
> >> Question: in local mode, can the path given to getCacheFiles() be on the
> >> local filesystem? Or does it have to be on HDFS?
> >>
> >>
> >> On Mon, Jan 6, 2014 at 11:29 AM, Russell Jurney <