|
Jonathan Coveney
2011-01-10, 17:56
Dmitriy Ryaboy
2011-01-10, 18:32
Jonathan Coveney
2011-01-10, 18:36
Dmitriy Ryaboy
2011-01-10, 18:49
Jonathan Coveney
2011-01-10, 19:14
Jonathan Coveney
2011-01-10, 21:25
Julien Le Dem
2011-01-10, 22:18
Jonathan Coveney
2011-01-10, 22:59
Dmitriy Ryaboy
2011-01-10, 23:03
Jonathan Coveney
2011-01-11, 01:41
Dmitriy Ryaboy
2011-01-11, 02:03
|
-
Holding onto info when doing a udf on a bagJonathan Coveney 2011-01-10, 17:56
So I have a udf, let's call it myudf.bag2bag, which takes a bag which
contains "prop," and creates a new bag of tuples based on that. I have data in the form of id prop other1 other2 If all I care about is running the udf, obviously I can do A = LOAD 'file' AS (id, prop, other1, other2); B = GROUP A BY id; C = FOREACH B GENERATE group, FLATTEN(myudf.bag2bag(A.prop)); And all is fine But what do I do if I want to hold on to the other data, especially if you don't know how much there will be (from a bag2bag perspective) My thought is that in bag2bag, you can pass in a touple of "extras," which you then pass back, ie C = FOREACH B GENERATE group, FLATTEN(myudf.bag2bag(A.prop, (A,other1, A.other2)))); I'm just not sure how I would specify the schema for this, in such a way that any number of entries could be in the tuple, and then you could just sort of reference them later. Is this possible?
-
Re: Holding onto info when doing a udf on a bagDmitriy Ryaboy 2011-01-10, 18:32
Jonathan, can't you just pass the bag A in?
On Mon, Jan 10, 2011 at 9:56 AM, Jonathan Coveney <[EMAIL PROTECTED]>wrote: > So I have a udf, let's call it myudf.bag2bag, which takes a bag which > contains "prop," and creates a new bag of tuples based on that. > > I have data in the form of > > id prop other1 other2 > > If all I care about is running the udf, obviously I can do > > A = LOAD 'file' AS (id, prop, other1, other2); > B = GROUP A BY id; > C = FOREACH B GENERATE group, FLATTEN(myudf.bag2bag(A.prop)); > > And all is fine > > But what do I do if I want to hold on to the other data, especially if you > don't know how much there will be (from a bag2bag perspective) > > My thought is that in bag2bag, you can pass in a touple of "extras," which > you then pass back, ie > > C = FOREACH B GENERATE group, FLATTEN(myudf.bag2bag(A.prop, (A,other1, > A.other2)))); > > I'm just not sure how I would specify the schema for this, in such a way > that any number of entries could be in the tuple, and then you could just > sort of reference them later. > > Is this possible? >
-
Re: Holding onto info when doing a udf on a bagJonathan Coveney 2011-01-10, 18:36
I thought about that, but I do not know how long the tuple is. This isn't an
issue from a calculation perspective, I suppose, as long as you make sure that prop is the first thing in the bag. But from a schema...hmm, I guess you could just grab the schema of the other elements and build it accordingly? 2011/1/10 Dmitriy Ryaboy <[EMAIL PROTECTED]> > Jonathan, can't you just pass the bag A in? > > On Mon, Jan 10, 2011 at 9:56 AM, Jonathan Coveney <[EMAIL PROTECTED] > >wrote: > > > So I have a udf, let's call it myudf.bag2bag, which takes a bag which > > contains "prop," and creates a new bag of tuples based on that. > > > > I have data in the form of > > > > id prop other1 other2 > > > > If all I care about is running the udf, obviously I can do > > > > A = LOAD 'file' AS (id, prop, other1, other2); > > B = GROUP A BY id; > > C = FOREACH B GENERATE group, FLATTEN(myudf.bag2bag(A.prop)); > > > > And all is fine > > > > But what do I do if I want to hold on to the other data, especially if > you > > don't know how much there will be (from a bag2bag perspective) > > > > My thought is that in bag2bag, you can pass in a touple of "extras," > which > > you then pass back, ie > > > > C = FOREACH B GENERATE group, FLATTEN(myudf.bag2bag(A.prop, (A,other1, > > A.other2)))); > > > > I'm just not sure how I would specify the schema for this, in such a way > > that any number of entries could be in the tuple, and then you could just > > sort of reference them later. > > > > Is this possible? > > >
-
Re: Holding onto info when doing a udf on a bagDmitriy Ryaboy 2011-01-10, 18:49
Heck, if you know the schema at runtime, you could pass in a string
describing the schema as another argument. Or pass it in during initialization: define udfWithSchema myUdf('a:int, b:chararrahy') What do you need the schema for, exactly? D On Mon, Jan 10, 2011 at 10:36 AM, Jonathan Coveney <[EMAIL PROTECTED]>wrote: > I thought about that, but I do not know how long the tuple is. This isn't > an > issue from a calculation perspective, I suppose, as long as you make sure > that prop is the first thing in the bag. But from a schema...hmm, I guess > you could just grab the schema of the other elements and build it > accordingly? > > 2011/1/10 Dmitriy Ryaboy <[EMAIL PROTECTED]> > > > Jonathan, can't you just pass the bag A in? > > > > On Mon, Jan 10, 2011 at 9:56 AM, Jonathan Coveney <[EMAIL PROTECTED] > > >wrote: > > > > > So I have a udf, let's call it myudf.bag2bag, which takes a bag which > > > contains "prop," and creates a new bag of tuples based on that. > > > > > > I have data in the form of > > > > > > id prop other1 other2 > > > > > > If all I care about is running the udf, obviously I can do > > > > > > A = LOAD 'file' AS (id, prop, other1, other2); > > > B = GROUP A BY id; > > > C = FOREACH B GENERATE group, FLATTEN(myudf.bag2bag(A.prop)); > > > > > > And all is fine > > > > > > But what do I do if I want to hold on to the other data, especially if > > you > > > don't know how much there will be (from a bag2bag perspective) > > > > > > My thought is that in bag2bag, you can pass in a touple of "extras," > > which > > > you then pass back, ie > > > > > > C = FOREACH B GENERATE group, FLATTEN(myudf.bag2bag(A.prop, (A,other1, > > > A.other2)))); > > > > > > I'm just not sure how I would specify the schema for this, in such a > way > > > that any number of entries could be in the tuple, and then you could > just > > > sort of reference them later. > > > > > > Is this possible? > > > > > >
-
Re: Holding onto info when doing a udf on a bagJonathan Coveney 2011-01-10, 19:14
I was under the impression that for Bag->Bag functions, providing the schema
made things much faster? 2011/1/10 Dmitriy Ryaboy <[EMAIL PROTECTED]> > Heck, if you know the schema at runtime, you could pass in a string > describing the schema as another argument. > Or pass it in during initialization: > > define udfWithSchema myUdf('a:int, b:chararrahy') > > What do you need the schema for, exactly? > > D > > On Mon, Jan 10, 2011 at 10:36 AM, Jonathan Coveney <[EMAIL PROTECTED] > >wrote: > > > I thought about that, but I do not know how long the tuple is. This isn't > > an > > issue from a calculation perspective, I suppose, as long as you make sure > > that prop is the first thing in the bag. But from a schema...hmm, I guess > > you could just grab the schema of the other elements and build it > > accordingly? > > > > 2011/1/10 Dmitriy Ryaboy <[EMAIL PROTECTED]> > > > > > Jonathan, can't you just pass the bag A in? > > > > > > On Mon, Jan 10, 2011 at 9:56 AM, Jonathan Coveney <[EMAIL PROTECTED] > > > >wrote: > > > > > > > So I have a udf, let's call it myudf.bag2bag, which takes a bag which > > > > contains "prop," and creates a new bag of tuples based on that. > > > > > > > > I have data in the form of > > > > > > > > id prop other1 other2 > > > > > > > > If all I care about is running the udf, obviously I can do > > > > > > > > A = LOAD 'file' AS (id, prop, other1, other2); > > > > B = GROUP A BY id; > > > > C = FOREACH B GENERATE group, FLATTEN(myudf.bag2bag(A.prop)); > > > > > > > > And all is fine > > > > > > > > But what do I do if I want to hold on to the other data, especially > if > > > you > > > > don't know how much there will be (from a bag2bag perspective) > > > > > > > > My thought is that in bag2bag, you can pass in a touple of "extras," > > > which > > > > you then pass back, ie > > > > > > > > C = FOREACH B GENERATE group, FLATTEN(myudf.bag2bag(A.prop, > (A,other1, > > > > A.other2)))); > > > > > > > > I'm just not sure how I would specify the schema for this, in such a > > way > > > > that any number of entries could be in the tuple, and then you could > > just > > > > sort of reference them later. > > > > > > > > Is this possible? > > > > > > > > > >
-
Re: Holding onto info when doing a udf on a bagJonathan Coveney 2011-01-10, 21:25
I was able to get it work (I just didn't override the schema), but I'd
rather like it to have the schema so that describes and whatnot work. Is there no way, given a Schema with fields, to get the Schema of one of those fields? I can try to make a hack or something, but is there a limitation as to why you can't do Schema inner = input.getSchema(1) (instead of getField, which returns a Schema.FieldSchema, a getSchema function which gave the actual schema of the given object?). As always, I appreciate the help. 2011/1/10 Jonathan Coveney <[EMAIL PROTECTED]> > I was under the impression that for Bag->Bag functions, providing the > schema made things much faster? > > > 2011/1/10 Dmitriy Ryaboy <[EMAIL PROTECTED]> > >> Heck, if you know the schema at runtime, you could pass in a string >> describing the schema as another argument. >> Or pass it in during initialization: >> >> define udfWithSchema myUdf('a:int, b:chararrahy') >> >> What do you need the schema for, exactly? >> >> D >> >> On Mon, Jan 10, 2011 at 10:36 AM, Jonathan Coveney <[EMAIL PROTECTED] >> >wrote: >> >> > I thought about that, but I do not know how long the tuple is. This >> isn't >> > an >> > issue from a calculation perspective, I suppose, as long as you make >> sure >> > that prop is the first thing in the bag. But from a schema...hmm, I >> guess >> > you could just grab the schema of the other elements and build it >> > accordingly? >> > >> > 2011/1/10 Dmitriy Ryaboy <[EMAIL PROTECTED]> >> > >> > > Jonathan, can't you just pass the bag A in? >> > > >> > > On Mon, Jan 10, 2011 at 9:56 AM, Jonathan Coveney <[EMAIL PROTECTED] >> > > >wrote: >> > > >> > > > So I have a udf, let's call it myudf.bag2bag, which takes a bag >> which >> > > > contains "prop," and creates a new bag of tuples based on that. >> > > > >> > > > I have data in the form of >> > > > >> > > > id prop other1 other2 >> > > > >> > > > If all I care about is running the udf, obviously I can do >> > > > >> > > > A = LOAD 'file' AS (id, prop, other1, other2); >> > > > B = GROUP A BY id; >> > > > C = FOREACH B GENERATE group, FLATTEN(myudf.bag2bag(A.prop)); >> > > > >> > > > And all is fine >> > > > >> > > > But what do I do if I want to hold on to the other data, especially >> if >> > > you >> > > > don't know how much there will be (from a bag2bag perspective) >> > > > >> > > > My thought is that in bag2bag, you can pass in a touple of "extras," >> > > which >> > > > you then pass back, ie >> > > > >> > > > C = FOREACH B GENERATE group, FLATTEN(myudf.bag2bag(A.prop, >> (A,other1, >> > > > A.other2)))); >> > > > >> > > > I'm just not sure how I would specify the schema for this, in such a >> > way >> > > > that any number of entries could be in the tuple, and then you could >> > just >> > > > sort of reference them later. >> > > > >> > > > Is this possible? >> > > > >> > > >> > >> > >
-
Re: Holding onto info when doing a udf on a bagJulien Le Dem 2011-01-10, 22:18
Hi Jonathan,
It's input.getField(1).schema You can get the schema of your input by overriding Schema outputSchema(Schema) but it looks like you figured that out. outputSchema is called on the client side so if you want to make use of the input schema in exec(Tuple) you need to pass it in the UDF context: Properties properties = UDFContext.getUDFContext().getUDFProperties(this.getClass()); properties.put("inputSchema", inputSchema); Julien On 1/10/11 1:25 PM, "Jonathan Coveney" <[EMAIL PROTECTED]> wrote: I was able to get it work (I just didn't override the schema), but I'd rather like it to have the schema so that describes and whatnot work. Is there no way, given a Schema with fields, to get the Schema of one of those fields? I can try to make a hack or something, but is there a limitation as to why you can't do Schema inner = input.getSchema(1) (instead of getField, which returns a Schema.FieldSchema, a getSchema function which gave the actual schema of the given object?). As always, I appreciate the help. 2011/1/10 Jonathan Coveney <[EMAIL PROTECTED]> > I was under the impression that for Bag->Bag functions, providing the > schema made things much faster? > > > 2011/1/10 Dmitriy Ryaboy <[EMAIL PROTECTED]> > >> Heck, if you know the schema at runtime, you could pass in a string >> describing the schema as another argument. >> Or pass it in during initialization: >> >> define udfWithSchema myUdf('a:int, b:chararrahy') >> >> What do you need the schema for, exactly? >> >> D >> >> On Mon, Jan 10, 2011 at 10:36 AM, Jonathan Coveney <[EMAIL PROTECTED] >> >wrote: >> >> > I thought about that, but I do not know how long the tuple is. This >> isn't >> > an >> > issue from a calculation perspective, I suppose, as long as you make >> sure >> > that prop is the first thing in the bag. But from a schema...hmm, I >> guess >> > you could just grab the schema of the other elements and build it >> > accordingly? >> > >> > 2011/1/10 Dmitriy Ryaboy <[EMAIL PROTECTED]> >> > >> > > Jonathan, can't you just pass the bag A in? >> > > >> > > On Mon, Jan 10, 2011 at 9:56 AM, Jonathan Coveney <[EMAIL PROTECTED] >> > > >wrote: >> > > >> > > > So I have a udf, let's call it myudf.bag2bag, which takes a bag >> which >> > > > contains "prop," and creates a new bag of tuples based on that. >> > > > >> > > > I have data in the form of >> > > > >> > > > id prop other1 other2 >> > > > >> > > > If all I care about is running the udf, obviously I can do >> > > > >> > > > A = LOAD 'file' AS (id, prop, other1, other2); >> > > > B = GROUP A BY id; >> > > > C = FOREACH B GENERATE group, FLATTEN(myudf.bag2bag(A.prop)); >> > > > >> > > > And all is fine >> > > > >> > > > But what do I do if I want to hold on to the other data, especially >> if >> > > you >> > > > don't know how much there will be (from a bag2bag perspective) >> > > > >> > > > My thought is that in bag2bag, you can pass in a touple of "extras," >> > > which >> > > > you then pass back, ie >> > > > >> > > > C = FOREACH B GENERATE group, FLATTEN(myudf.bag2bag(A.prop, >> (A,other1, >> > > > A.other2)))); >> > > > >> > > > I'm just not sure how I would specify the schema for this, in such a >> > way >> > > > that any number of entries could be in the tuple, and then you could >> > just >> > > > sort of reference them later. >> > > > >> > > > Is this possible? >> > > > >> > > >> > >> > >
-
Re: Holding onto info when doing a udf on a bagJonathan Coveney 2011-01-10, 22:59
Thank you Julien.
Once again I want to thank everyone for their help... I know that I use the listserv a lot, but you guys have really helped me turn Pig into a powerful tool in my workplace, and I know that Pig benefits from being used on large production systems. Jon 2011/1/10 Julien Le Dem <[EMAIL PROTECTED]> > Hi Jonathan, > It's input.getField(1).schema > You can get the schema of your input by overriding Schema > outputSchema(Schema) but it looks like you figured that out. > outputSchema is called on the client side so if you want to make use of the > input schema in exec(Tuple) you need to pass it in the UDF context: > Properties properties > UDFContext.getUDFContext().getUDFProperties(this.getClass()); > properties.put("inputSchema", inputSchema); > Julien > > On 1/10/11 1:25 PM, "Jonathan Coveney" <[EMAIL PROTECTED]> wrote: > > I was able to get it work (I just didn't override the schema), but I'd > rather like it to have the schema so that describes and whatnot work. > > Is there no way, given a Schema with fields, to get the Schema of one of > those fields? I can try to make a hack or something, but is there a > limitation as to why you can't do Schema inner = input.getSchema(1) > (instead > of getField, which returns a Schema.FieldSchema, a getSchema function which > gave the actual schema of the given object?). > > As always, I appreciate the help. > > 2011/1/10 Jonathan Coveney <[EMAIL PROTECTED]> > > > I was under the impression that for Bag->Bag functions, providing the > > schema made things much faster? > > > > > > 2011/1/10 Dmitriy Ryaboy <[EMAIL PROTECTED]> > > > >> Heck, if you know the schema at runtime, you could pass in a string > >> describing the schema as another argument. > >> Or pass it in during initialization: > >> > >> define udfWithSchema myUdf('a:int, b:chararrahy') > >> > >> What do you need the schema for, exactly? > >> > >> D > >> > >> On Mon, Jan 10, 2011 at 10:36 AM, Jonathan Coveney <[EMAIL PROTECTED] > >> >wrote: > >> > >> > I thought about that, but I do not know how long the tuple is. This > >> isn't > >> > an > >> > issue from a calculation perspective, I suppose, as long as you make > >> sure > >> > that prop is the first thing in the bag. But from a schema...hmm, I > >> guess > >> > you could just grab the schema of the other elements and build it > >> > accordingly? > >> > > >> > 2011/1/10 Dmitriy Ryaboy <[EMAIL PROTECTED]> > >> > > >> > > Jonathan, can't you just pass the bag A in? > >> > > > >> > > On Mon, Jan 10, 2011 at 9:56 AM, Jonathan Coveney < > [EMAIL PROTECTED] > >> > > >wrote: > >> > > > >> > > > So I have a udf, let's call it myudf.bag2bag, which takes a bag > >> which > >> > > > contains "prop," and creates a new bag of tuples based on that. > >> > > > > >> > > > I have data in the form of > >> > > > > >> > > > id prop other1 other2 > >> > > > > >> > > > If all I care about is running the udf, obviously I can do > >> > > > > >> > > > A = LOAD 'file' AS (id, prop, other1, other2); > >> > > > B = GROUP A BY id; > >> > > > C = FOREACH B GENERATE group, FLATTEN(myudf.bag2bag(A.prop)); > >> > > > > >> > > > And all is fine > >> > > > > >> > > > But what do I do if I want to hold on to the other data, > especially > >> if > >> > > you > >> > > > don't know how much there will be (from a bag2bag perspective) > >> > > > > >> > > > My thought is that in bag2bag, you can pass in a touple of > "extras," > >> > > which > >> > > > you then pass back, ie > >> > > > > >> > > > C = FOREACH B GENERATE group, FLATTEN(myudf.bag2bag(A.prop, > >> (A,other1, > >> > > > A.other2)))); > >> > > > > >> > > > I'm just not sure how I would specify the schema for this, in such > a > >> > way > >> > > > that any number of entries could be in the tuple, and then you > could > >> > just > >> > > > sort of reference them later. > >> > > > > >> > > > Is this possible? > >> > > > > >> > > > >> > > >> > > > > > >
-
Re: Holding onto info when doing a udf on a bagDmitriy Ryaboy 2011-01-10, 23:03
Absolutely.
Would love to hear what you are doing once it goes in production by the way. D On Mon, Jan 10, 2011 at 2:59 PM, Jonathan Coveney <[EMAIL PROTECTED]>wrote: > Thank you Julien. > > Once again I want to thank everyone for their help... I know that I use the > listserv a lot, but you guys have really helped me turn Pig into a powerful > tool in my workplace, and I know that Pig benefits from being used on large > production systems. > > Jon > > 2011/1/10 Julien Le Dem <[EMAIL PROTECTED]> > > > Hi Jonathan, > > It's input.getField(1).schema > > You can get the schema of your input by overriding Schema > > outputSchema(Schema) but it looks like you figured that out. > > outputSchema is called on the client side so if you want to make use of > the > > input schema in exec(Tuple) you need to pass it in the UDF context: > > Properties properties > > UDFContext.getUDFContext().getUDFProperties(this.getClass()); > > properties.put("inputSchema", inputSchema); > > Julien > > > > On 1/10/11 1:25 PM, "Jonathan Coveney" <[EMAIL PROTECTED]> wrote: > > > > I was able to get it work (I just didn't override the schema), but I'd > > rather like it to have the schema so that describes and whatnot work. > > > > Is there no way, given a Schema with fields, to get the Schema of one of > > those fields? I can try to make a hack or something, but is there a > > limitation as to why you can't do Schema inner = input.getSchema(1) > > (instead > > of getField, which returns a Schema.FieldSchema, a getSchema function > which > > gave the actual schema of the given object?). > > > > As always, I appreciate the help. > > > > 2011/1/10 Jonathan Coveney <[EMAIL PROTECTED]> > > > > > I was under the impression that for Bag->Bag functions, providing the > > > schema made things much faster? > > > > > > > > > 2011/1/10 Dmitriy Ryaboy <[EMAIL PROTECTED]> > > > > > >> Heck, if you know the schema at runtime, you could pass in a string > > >> describing the schema as another argument. > > >> Or pass it in during initialization: > > >> > > >> define udfWithSchema myUdf('a:int, b:chararrahy') > > >> > > >> What do you need the schema for, exactly? > > >> > > >> D > > >> > > >> On Mon, Jan 10, 2011 at 10:36 AM, Jonathan Coveney < > [EMAIL PROTECTED] > > >> >wrote: > > >> > > >> > I thought about that, but I do not know how long the tuple is. This > > >> isn't > > >> > an > > >> > issue from a calculation perspective, I suppose, as long as you make > > >> sure > > >> > that prop is the first thing in the bag. But from a schema...hmm, I > > >> guess > > >> > you could just grab the schema of the other elements and build it > > >> > accordingly? > > >> > > > >> > 2011/1/10 Dmitriy Ryaboy <[EMAIL PROTECTED]> > > >> > > > >> > > Jonathan, can't you just pass the bag A in? > > >> > > > > >> > > On Mon, Jan 10, 2011 at 9:56 AM, Jonathan Coveney < > > [EMAIL PROTECTED] > > >> > > >wrote: > > >> > > > > >> > > > So I have a udf, let's call it myudf.bag2bag, which takes a bag > > >> which > > >> > > > contains "prop," and creates a new bag of tuples based on that. > > >> > > > > > >> > > > I have data in the form of > > >> > > > > > >> > > > id prop other1 other2 > > >> > > > > > >> > > > If all I care about is running the udf, obviously I can do > > >> > > > > > >> > > > A = LOAD 'file' AS (id, prop, other1, other2); > > >> > > > B = GROUP A BY id; > > >> > > > C = FOREACH B GENERATE group, FLATTEN(myudf.bag2bag(A.prop)); > > >> > > > > > >> > > > And all is fine > > >> > > > > > >> > > > But what do I do if I want to hold on to the other data, > > especially > > >> if > > >> > > you > > >> > > > don't know how much there will be (from a bag2bag perspective) > > >> > > > > > >> > > > My thought is that in bag2bag, you can pass in a touple of > > "extras," > > >> > > which > > >> > > > you then pass back, ie > > >> > > > > > >> > > > C = FOREACH B GENERATE group, FLATTEN(myudf.bag2bag(A.prop, > > >> (A,other1, > > >> > > > A.other2)))); > > >> > > >
-
Re: Holding onto info when doing a udf on a bagJonathan Coveney 2011-01-11, 01:41
If we ever do anything really worth writing about, maybe I'll ask the higher
ups if we can do a case study... I'm not sure what sort of use information would best benefit the Pig community, any thoughts? But I would love to give back, and show that Pig can handle some serious data. 2011/1/10 Dmitriy Ryaboy <[EMAIL PROTECTED]> > Absolutely. > Would love to hear what you are doing once it goes in production by the > way. > > D > > On Mon, Jan 10, 2011 at 2:59 PM, Jonathan Coveney <[EMAIL PROTECTED] > >wrote: > > > Thank you Julien. > > > > Once again I want to thank everyone for their help... I know that I use > the > > listserv a lot, but you guys have really helped me turn Pig into a > powerful > > tool in my workplace, and I know that Pig benefits from being used on > large > > production systems. > > > > Jon > > > > 2011/1/10 Julien Le Dem <[EMAIL PROTECTED]> > > > > > Hi Jonathan, > > > It's input.getField(1).schema > > > You can get the schema of your input by overriding Schema > > > outputSchema(Schema) but it looks like you figured that out. > > > outputSchema is called on the client side so if you want to make use of > > the > > > input schema in exec(Tuple) you need to pass it in the UDF context: > > > Properties properties > > > UDFContext.getUDFContext().getUDFProperties(this.getClass()); > > > properties.put("inputSchema", inputSchema); > > > Julien > > > > > > On 1/10/11 1:25 PM, "Jonathan Coveney" <[EMAIL PROTECTED]> wrote: > > > > > > I was able to get it work (I just didn't override the schema), but I'd > > > rather like it to have the schema so that describes and whatnot work. > > > > > > Is there no way, given a Schema with fields, to get the Schema of one > of > > > those fields? I can try to make a hack or something, but is there a > > > limitation as to why you can't do Schema inner = input.getSchema(1) > > > (instead > > > of getField, which returns a Schema.FieldSchema, a getSchema function > > which > > > gave the actual schema of the given object?). > > > > > > As always, I appreciate the help. > > > > > > 2011/1/10 Jonathan Coveney <[EMAIL PROTECTED]> > > > > > > > I was under the impression that for Bag->Bag functions, providing the > > > > schema made things much faster? > > > > > > > > > > > > 2011/1/10 Dmitriy Ryaboy <[EMAIL PROTECTED]> > > > > > > > >> Heck, if you know the schema at runtime, you could pass in a string > > > >> describing the schema as another argument. > > > >> Or pass it in during initialization: > > > >> > > > >> define udfWithSchema myUdf('a:int, b:chararrahy') > > > >> > > > >> What do you need the schema for, exactly? > > > >> > > > >> D > > > >> > > > >> On Mon, Jan 10, 2011 at 10:36 AM, Jonathan Coveney < > > [EMAIL PROTECTED] > > > >> >wrote: > > > >> > > > >> > I thought about that, but I do not know how long the tuple is. > This > > > >> isn't > > > >> > an > > > >> > issue from a calculation perspective, I suppose, as long as you > make > > > >> sure > > > >> > that prop is the first thing in the bag. But from a schema...hmm, > I > > > >> guess > > > >> > you could just grab the schema of the other elements and build it > > > >> > accordingly? > > > >> > > > > >> > 2011/1/10 Dmitriy Ryaboy <[EMAIL PROTECTED]> > > > >> > > > > >> > > Jonathan, can't you just pass the bag A in? > > > >> > > > > > >> > > On Mon, Jan 10, 2011 at 9:56 AM, Jonathan Coveney < > > > [EMAIL PROTECTED] > > > >> > > >wrote: > > > >> > > > > > >> > > > So I have a udf, let's call it myudf.bag2bag, which takes a > bag > > > >> which > > > >> > > > contains "prop," and creates a new bag of tuples based on > that. > > > >> > > > > > > >> > > > I have data in the form of > > > >> > > > > > > >> > > > id prop other1 other2 > > > >> > > > > > > >> > > > If all I care about is running the udf, obviously I can do > > > >> > > > > > > >> > > > A = LOAD 'file' AS (id, prop, other1, other2); > > > >> > > > B = GROUP A BY id; > > > >> > > > C = FOREACH B GENERATE group, FLATTEN(myudf.bag2bag(A.prop));
-
Re: Holding onto info when doing a udf on a bagDmitriy Ryaboy 2011-01-11, 02:03
I think it's interesting to see what motivates different companies to choose
Pig, what issues they have encountered and how they solved them, the general architecture, etc. There are a few slide decks floating on the internet about how Pig is being used in production at Yahoo, Twitter, LinkedIn, Mendeley, Meebo, and a bunch of others, you can try looking at them for inspiration. Curious by what you mean when you say "serious data" :) D On Mon, Jan 10, 2011 at 5:41 PM, Jonathan Coveney <[EMAIL PROTECTED]>wrote: > If we ever do anything really worth writing about, maybe I'll ask the > higher > ups if we can do a case study... I'm not sure what sort of use information > would best benefit the Pig community, any thoughts? > > But I would love to give back, and show that Pig can handle some serious > data. > > 2011/1/10 Dmitriy Ryaboy <[EMAIL PROTECTED]> > > > Absolutely. > > Would love to hear what you are doing once it goes in production by the > > way. > > > > D > > > > On Mon, Jan 10, 2011 at 2:59 PM, Jonathan Coveney <[EMAIL PROTECTED] > > >wrote: > > > > > Thank you Julien. > > > > > > Once again I want to thank everyone for their help... I know that I use > > the > > > listserv a lot, but you guys have really helped me turn Pig into a > > powerful > > > tool in my workplace, and I know that Pig benefits from being used on > > large > > > production systems. > > > > > > Jon > > > > > > 2011/1/10 Julien Le Dem <[EMAIL PROTECTED]> > > > > > > > Hi Jonathan, > > > > It's input.getField(1).schema > > > > You can get the schema of your input by overriding Schema > > > > outputSchema(Schema) but it looks like you figured that out. > > > > outputSchema is called on the client side so if you want to make use > of > > > the > > > > input schema in exec(Tuple) you need to pass it in the UDF context: > > > > Properties properties > > > > UDFContext.getUDFContext().getUDFProperties(this.getClass()); > > > > properties.put("inputSchema", inputSchema); > > > > Julien > > > > > > > > On 1/10/11 1:25 PM, "Jonathan Coveney" <[EMAIL PROTECTED]> wrote: > > > > > > > > I was able to get it work (I just didn't override the schema), but > I'd > > > > rather like it to have the schema so that describes and whatnot work. > > > > > > > > Is there no way, given a Schema with fields, to get the Schema of one > > of > > > > those fields? I can try to make a hack or something, but is there a > > > > limitation as to why you can't do Schema inner = input.getSchema(1) > > > > (instead > > > > of getField, which returns a Schema.FieldSchema, a getSchema function > > > which > > > > gave the actual schema of the given object?). > > > > > > > > As always, I appreciate the help. > > > > > > > > 2011/1/10 Jonathan Coveney <[EMAIL PROTECTED]> > > > > > > > > > I was under the impression that for Bag->Bag functions, providing > the > > > > > schema made things much faster? > > > > > > > > > > > > > > > 2011/1/10 Dmitriy Ryaboy <[EMAIL PROTECTED]> > > > > > > > > > >> Heck, if you know the schema at runtime, you could pass in a > string > > > > >> describing the schema as another argument. > > > > >> Or pass it in during initialization: > > > > >> > > > > >> define udfWithSchema myUdf('a:int, b:chararrahy') > > > > >> > > > > >> What do you need the schema for, exactly? > > > > >> > > > > >> D > > > > >> > > > > >> On Mon, Jan 10, 2011 at 10:36 AM, Jonathan Coveney < > > > [EMAIL PROTECTED] > > > > >> >wrote: > > > > >> > > > > >> > I thought about that, but I do not know how long the tuple is. > > This > > > > >> isn't > > > > >> > an > > > > >> > issue from a calculation perspective, I suppose, as long as you > > make > > > > >> sure > > > > >> > that prop is the first thing in the bag. But from a > schema...hmm, > > I > > > > >> guess > > > > >> > you could just grab the schema of the other elements and build > it > > > > >> > accordingly? > > > > >> > > > > > >> > 2011/1/10 Dmitriy Ryaboy <[EMAIL PROTECTED]> > > > > >> > > > > > >> > > Jonathan, can't you just pass the bag A in? |