Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig >> mail # user >> null:: prefix for field names and a WARN when registering a Python UDF


Copy link to this message
-
Re: null:: prefix for field names and a WARN when registering a Python UDF
Hi Alan,

>> When I register this UDF an unexpected warning pops up which I'm going
to ignore for now (unless someone says this is important):

Yes, you can usually ignore them except ERROR messages. If these messages
annoy you a lot, you can redirect stderr to a file (i.e. 2>errors.txt).
>> The other strange thing is *null::* gets prepended to each field name.
This is mostly annoying, and, in the case of JsonStorage(), clutters things
unnecessarily. Is there a way to resolve this?

The reason why "null::" is prepended is because a python udf returns a
tuple, but the tuple is not given a name. So if you change the outputSchema
of your udf to something like this:

@outputSchema("t:( < field schemas here > )")

You will see "t::" is prepended instead.

You can also remove the prefix by adding another FOREACH and re-define
names using AS clauses for every field. That is,

aprs = FOREACH raw GENERATE FLATTEN(myudf.aprs(line));
aprs_cleaned = FOREACH aprs GENERATE time AS time, from_call AS from_call,
<and other fields>;

This is somewhat annoying if there are a lot of fields like your example.
In fact, there is a jira to add a built-in UDF that removes the prefixes:
https://issues.apache.org/jira/browse/PIG-3088. I will probably rebase the
patch and get it committed.

Thanks,
Cheolsoo
On Sat, Jun 8, 2013 at 12:01 PM, Alan Crosswell <[EMAIL PROTECTED]> wrote:

> Hello,
>
> I'm new to Pig and am having a few small problems that I'd appreciate some
> help with. I'm using Pig-0.11.1 after 0.9.2 just plain didn't work right
> with my Python UDF.
>
> I am using a Python UDF that has two functions with the following
> outputSchema:
>
>
> @outputSchema("time:chararray,from_call:chararray,to_call:chararray,digis:chararray,gtype:chararray,gate:chararray,info:chararray,firsthop:chararray")
> def aprs(l):
>   ...
>
> and
>
>
> @outputSchema("latitude:double,longitude:double,ambiguity:double,course:double,speed:double")
> def position(to_call,info):
>   ...
>
> When I register this UDF an unexpected warning pops up which I'm going to
> ignore for now (unless someone says this is important):
>
> grunt> *Register 's3n://n2ygk/aprspig.py' using jython as myudf;*
> 2013-06-08 18:38:03,990 [main] INFO
>  org.apache.hadoop.fs.s3native.NativeS3FileSystem - Opening
> 's3n://n2ygk/aprspig.py' for reading
> 2013-06-08 18:38:04,118 [main] INFO
>  org.apache.hadoop.util.NativeCodeLoader - Loaded the native-hadoop library
> 2013-06-08 18:38:04,175 [main] INFO
>  org.apache.pig.scripting.jython.JythonScriptEngine - created tmp
> python.cachedir=/tmp/pig_jython_6851471253258374122
> 2013-06-08 18:38:08,576 [main] WARN
>  org.apache.pig.scripting.jython.JythonScriptEngine -
> pig.cmd.args.remainders is empty. This is not expected unless on testing.
> 2013-06-08 18:38:11,981 [main] INFO
>  org.apache.pig.scripting.jython.JythonScriptEngine - Register scripting
> UDF: myudf.position
> 2013-06-08 18:38:11,984 [main] INFO
>  org.apache.pig.scripting.jython.JythonScriptEngine - Register scripting
> UDF: myudf.aprs
>
> The other strange thing is *null::* gets prepended to each field name. This
> is mostly annoying, and, in the case of JsonStorage(), clutters
> things unnecessarily. Is there a way to resolve this?
>
> grunt> *aprs = FOREACH raw GENERATE FLATTEN(myudf.aprs(line));*
> 2013-06-08 01:06:37,324 [main] INFO
>  org.apache.pig.scripting.jython.JythonFunction - Schema 'time:chararra
>
> y,from_call:chararray,to_call:chararray,digis:chararray,gtype:chararray,gate:chararray,info:chararray,firsthop:chararray'
> defined for func aprs
> grunt> *DESCRIBE aprs;*
> aprs: {null::time: chararray,null::from_call: chararray,null::to_call:
> chararray,null::digis: chararray,null::gtype: chararray,null::gate:
> chararray,null::info: chararray,null::firsthop: chararray}
>
> Is my UDF being defined or invoked incorrectly to result in the null:: or
> is this just a feature?
>
> This is just annoying but I'd appreciate any pointers on how to make it go
> away.
>
> Thanks.