Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig >> mail # user >> Holding onto info when doing a udf on a bag


Copy link to this message
-
Re: Holding onto info when doing a udf on a bag
Hi Jonathan,
It's input.getField(1).schema
You can get the schema of your input by overriding Schema outputSchema(Schema) but it looks like you figured that out.
outputSchema is called on the client side so if you want to make use of the input schema in exec(Tuple) you need to pass it in the UDF context:
Properties properties = UDFContext.getUDFContext().getUDFProperties(this.getClass());
properties.put("inputSchema", inputSchema);
Julien

On 1/10/11 1:25 PM, "Jonathan Coveney" <[EMAIL PROTECTED]> wrote:

I was able to get it work (I just didn't override the schema), but I'd
rather like it to have the schema so that describes and whatnot work.

Is there no way, given a Schema with fields, to get the Schema of one of
those fields? I can try to make a hack or something, but is there a
limitation as to why you can't do Schema inner = input.getSchema(1) (instead
of getField, which returns a Schema.FieldSchema, a getSchema function which
gave the actual schema of the given object?).

As always, I appreciate the help.

2011/1/10 Jonathan Coveney <[EMAIL PROTECTED]>

> I was under the impression that for Bag->Bag functions, providing the
> schema made things much faster?
>
>
> 2011/1/10 Dmitriy Ryaboy <[EMAIL PROTECTED]>
>
>> Heck, if you know the schema at runtime, you could pass in a string
>> describing the schema as another argument.
>> Or pass it in during initialization:
>>
>> define udfWithSchema myUdf('a:int, b:chararrahy')
>>
>> What do you need the schema for, exactly?
>>
>> D
>>
>> On Mon, Jan 10, 2011 at 10:36 AM, Jonathan Coveney <[EMAIL PROTECTED]
>> >wrote:
>>
>> > I thought about that, but I do not know how long the tuple is. This
>> isn't
>> > an
>> > issue from a calculation perspective, I suppose, as long as you make
>> sure
>> > that prop is the first thing in the bag. But from a schema...hmm, I
>> guess
>> > you could just grab the schema of the other elements and build it
>> > accordingly?
>> >
>> > 2011/1/10 Dmitriy Ryaboy <[EMAIL PROTECTED]>
>> >
>> > > Jonathan, can't you just pass the bag A in?
>> > >
>> > > On Mon, Jan 10, 2011 at 9:56 AM, Jonathan Coveney <[EMAIL PROTECTED]
>> > > >wrote:
>> > >
>> > > > So I have a udf, let's call it myudf.bag2bag, which takes a bag
>> which
>> > > > contains "prop," and creates a new bag of tuples based on that.
>> > > >
>> > > > I have data in the form of
>> > > >
>> > > > id    prop    other1    other2
>> > > >
>> > > > If all I care about is running the udf, obviously I can do
>> > > >
>> > > > A = LOAD 'file' AS (id, prop, other1, other2);
>> > > > B = GROUP A BY id;
>> > > > C = FOREACH B GENERATE group, FLATTEN(myudf.bag2bag(A.prop));
>> > > >
>> > > > And all is fine
>> > > >
>> > > > But what do I do if I want to hold on to the other data, especially
>> if
>> > > you
>> > > > don't know how much there will be (from a bag2bag perspective)
>> > > >
>> > > > My thought is that in bag2bag, you can pass in a touple of "extras,"
>> > > which
>> > > > you then pass back, ie
>> > > >
>> > > > C = FOREACH B GENERATE group, FLATTEN(myudf.bag2bag(A.prop,
>> (A,other1,
>> > > > A.other2))));
>> > > >
>> > > > I'm just not sure how I would specify the schema for this, in such a
>> > way
>> > > > that any number of entries could be in the tuple, and then you could
>> > just
>> > > > sort of reference them later.
>> > > >
>> > > > Is this possible?
>> > > >
>> > >
>> >
>>
>
>