Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig >> mail # user >> UDF property passing

Copy link to this message
Re: UDF property passing
On Wed, Jul 6, 2011 at 7:20 PM, Jeremy Hanna <[EMAIL PROTECTED]>wrote:

> On Jul 6, 2011, at 12:47 PM, Dmitriy Ryaboy wrote:
> > I think this is the same problem we were having earlier:
> > http://hadoop.markmail.org/thread/kgxhdgw6zdmadch4
> >
> > One workaround is to use defines to explicitly create different
> > instances of your UDF, and use them separately.. it's ugly but it
> > works.
> Thanks Dmitriy.
> I tried doing something like:
> define ToCassandraBag1 org.pygmalion.udf.ToCassandraBag();
> define ToCassandraBag2 org.pygmalion.udf.ToCassandraBag();

This still does not work since you can't distinguish the two. The way I was
thinking of doing this is to let user pass in some unique sting as a
substitute for context:

define ToCassandraBag1 ToCassandraBag('1');
define ToCassandraBag2 ToCassandraBag('2');

inside the UDF, you would use this arg to make a 'contextString' (see
HBaseStorage.java for example use) to store any state.

ideally UDFs shouldn't have to do this.. They should have the same context
info that is available for loaders and storage.

> at the top and then using each one only once.  That still produces the same
> error.  I guess in this case we'll just have to require the field names be
> entered into the UDF and it won't introspect them.  Ah well.  Would be nice
> to be able to use it but I don't really see another way around this bug with
> the shared UDF context.
> >
> > D
> >
> > On Wed, Jul 6, 2011 at 9:42 AM, Jeremy Hanna <[EMAIL PROTECTED]>
> wrote:
> >> We have a UDF that introspects the output schema and gets the field
> names there and use that in the exec method.
> >>
> >> The UDF is found here:
> https://github.com/jeromatron/pygmalion/blob/master/udf/src/main/java/org/pygmalion/udf/ToCassandraBag.java
> >>
> >> A simple example is found here:
> https://github.com/jeromatron/pygmalion/blob/master/scripts/from_to_cassandra_bag_example.pig
> >>
> >> It takes the relation's aliases and uses them in the output so that the
> user doesn't have to specify them.  However we've been noticing that if you
> have more than one ToCassandraBag call in a pig script, sometimes they are
> run at the same time and the key is the same in the UDF context:
> cassandra.input_field_schema.  So we think there is an issue there (array
> out of bounds exceptions when running the script, but when running in grunt
> one at a time, it doesn't do that).
> >>
> >> Is there a right way to do this parameter passing so that we don't get
> these errors when multiple calls exist?
> >>
> >> We thought of using the schema hash code as a suffix (e.g.
> cassandra.input_field_schema.12344321) but we don't have access to the
> schema in the exec method.
> >>
> >> We thought of having the first parameter of the input tuple be a unique
> name that the script specifies, like 'filename.relationalias' as a
> convention to make them unique to the file.  However in the outputSchema, we
> don't have access to the input tuple, just the schema itself, so it couldn't
> get that value in there.
> >>
> >> Any ideas on how to make this so it doesn't stomp on each other within
> the pig script?  Is there a best way to do that?
> >>
> >> Thanks!
> >>
> >> Jeremy