Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Plain View
Pig >> mail # user >> UDF property passing


+
Jeremy Hanna 2011-07-06, 16:42
+
Dmitriy Ryaboy 2011-07-06, 17:47
+
Jeremy Hanna 2011-07-07, 02:20
+
Raghu Angadi 2011-07-07, 04:10
+
Jeremy Hanna 2011-07-07, 07:24
Copy link to this message
-
Re: UDF property passing
What is the guidance here on using member variables when implementing UDFs and passing properties?  That is, what are the semantics for using them to store properties for a UDF instance?  The docs talk a lot about making sure that no side effects happen from multiple calls to a UDF instance, but it is not clear whether that means it's doing things like changing the Location for a given instance of a UDF or just calling it multiple times.  PigStorage suggests not (since it keeps a member var location), but the UDFContext docs suggests that one keep all state in the UDFContext under an appropriate signature.  

See also https://issues.apache.org/jira/browse/CASSANDRA-2869 for another case where this has reared it's head in an improper implementation.

-Grant

On Jul 7, 2011, at 3:24 AM, Jeremy Hanna wrote:

>
> On Jul 6, 2011, at 11:10 PM, Raghu Angadi wrote:
>
>> On Wed, Jul 6, 2011 at 7:20 PM, Jeremy Hanna <[EMAIL PROTECTED]>wrote:
>>
>>>
>>> On Jul 6, 2011, at 12:47 PM, Dmitriy Ryaboy wrote:
>>>
>>>> I think this is the same problem we were having earlier:
>>>> http://hadoop.markmail.org/thread/kgxhdgw6zdmadch4
>>>>
>>>> One workaround is to use defines to explicitly create different
>>>> instances of your UDF, and use them separately.. it's ugly but it
>>>> works.
>>>
>>> Thanks Dmitriy.
>>>
>>> I tried doing something like:
>>> define ToCassandraBag1 org.pygmalion.udf.ToCassandraBag();
>>> define ToCassandraBag2 org.pygmalion.udf.ToCassandraBag();
>>>
>>
>> This still does not work since you can't distinguish the two. The way I was
>> thinking of doing this is to let user pass in some unique sting as a
>> substitute for context:
>>
>> define ToCassandraBag1 ToCassandraBag('1');
>> define ToCassandraBag2 ToCassandraBag('2');
>
> Ah yes.  I had misunderstood.  Thanks for the clarification.  Now the pig docs also make more sense in the Passing Configurations to UDFs section:
> http://pig.apache.org/docs/r0.8.1/udf.html#Passing+Configurations+to+UDFs
> It says:
> "The UDF can pass its constructor arguments, or some other identifying strings. This allows each instantiation of the UDF to have a different properties object thus avoiding name space collisions between instantiations of the UDF."
> and the HBaseStorage example was helpful to see that in action.
>
> Thanks both to Raghu and Dmitriy.
>
>>
>> inside the UDF, you would use this arg to make a 'contextString' (see
>> HBaseStorage.java for example use) to store any state.
>>
>> ideally UDFs shouldn't have to do this.. They should have the same context
>> info that is available for loaders and storage.
>>
>> Raghu.
>>
>>
>>>
>>> at the top and then using each one only once.  That still produces the same
>>> error.  I guess in this case we'll just have to require the field names be
>>> entered into the UDF and it won't introspect them.  Ah well.  Would be nice
>>> to be able to use it but I don't really see another way around this bug with
>>> the shared UDF context.
>>>
>>>>
>>>> D
>>>>
>>>> On Wed, Jul 6, 2011 at 9:42 AM, Jeremy Hanna <[EMAIL PROTECTED]>
>>> wrote:
>>>>> We have a UDF that introspects the output schema and gets the field
>>> names there and use that in the exec method.
>>>>>
>>>>> The UDF is found here:
>>> https://github.com/jeromatron/pygmalion/blob/master/udf/src/main/java/org/pygmalion/udf/ToCassandraBag.java
>>>>>
>>>>> A simple example is found here:
>>> https://github.com/jeromatron/pygmalion/blob/master/scripts/from_to_cassandra_bag_example.pig
>>>>>
>>>>> It takes the relation's aliases and uses them in the output so that the
>>> user doesn't have to specify them.  However we've been noticing that if you
>>> have more than one ToCassandraBag call in a pig script, sometimes they are
>>> run at the same time and the key is the same in the UDF context:
>>> cassandra.input_field_schema.  So we think there is an issue there (array
>>> out of bounds exceptions when running the script, but when running in grunt

Grant Ingersoll
+
Jeremy Hanna 2011-07-08, 17:19
+
Raghu Angadi 2011-07-08, 19:21
+
Grant Ingersoll 2011-07-08, 21:48
+
Raghu Angadi 2011-07-09, 18:05