|
|
-
outputSchema for UDF EvalFunc returning a DataBag
Andrew Clegg 2011-10-03, 15:27
Hi, When you have a UDF that returns a bag, and you're writing the outputSchema method, do you have to explicitly include the mandatory 'container' tuple within the bag, or is this implicit? i.e. if I'm returning a bag of ints, do I have to do: return new Schema( new FieldSchema(null, new Schema( new FieldSchema(null, DataType.INTEGER)), DataType.BAG)); Or do I have to explicitly define a tuple like so: return new Schema( new FieldSchema(null, new Schema( new FieldSchema(null, new Schema( new FieldSchema(null, DataType.INTEGER)), DataType.TUPLE)), DataType.BAG)); The docs seem pretty vague on this, and you're allowed to do either. My feeling would be that if the first form was illegal, you wouldn't be allowed to create a schema like that, but this may be wishful thinking. Thanks, Andrew. -- http://tinyurl.com/andrew-clegg-linkedin | http://twitter.com/andrew_clegg
-
Re: outputSchema for UDF EvalFunc returning a DataBag
Raghu Angadi 2011-10-03, 21:56
my understanding is that Pig 0.8 expects the first form and Pig 0.9 requires the second. Raghu. On Mon, Oct 3, 2011 at 8:27 AM, Andrew Clegg <andrew.clegg+[EMAIL PROTECTED]>wrote: > Hi, > > When you have a UDF that returns a bag, and you're writing the > outputSchema method, do you have to explicitly include the mandatory > 'container' tuple within the bag, or is this implicit? > > i.e. if I'm returning a bag of ints, do I have to do: > > return new Schema( > new FieldSchema(null, > new Schema( > new FieldSchema(null, DataType.INTEGER)), DataType.BAG)); > > Or do I have to explicitly define a tuple like so: > > return new Schema( > new FieldSchema(null, > new Schema( > new FieldSchema(null, > new Schema( > new FieldSchema(null, DataType.INTEGER)), DataType.TUPLE)), > DataType.BAG)); > > The docs seem pretty vague on this, and you're allowed to do either. > My feeling would be that if the first form was illegal, you wouldn't > be allowed to create a schema like that, but this may be wishful > thinking. > > Thanks, > > Andrew. > > -- > > http://tinyurl.com/andrew-clegg-linkedin | http://twitter.com/andrew_clegg>
-
Re: outputSchema for UDF EvalFunc returning a DataBag
Dmitriy Ryaboy 2011-10-03, 22:19
Raghu's being a little modest.. said understanding is based on getting ElephantBird to work with arbitrarily nested structures for both versions of Pig. Chances are he's right :-). D On Mon, Oct 3, 2011 at 2:56 PM, Raghu Angadi <[EMAIL PROTECTED]> wrote: > my understanding is that Pig 0.8 expects the first form and Pig 0.9 > requires > the second. > > Raghu. > > On Mon, Oct 3, 2011 at 8:27 AM, Andrew Clegg > <andrew.clegg+[EMAIL PROTECTED]>wrote: > > > Hi, > > > > When you have a UDF that returns a bag, and you're writing the > > outputSchema method, do you have to explicitly include the mandatory > > 'container' tuple within the bag, or is this implicit? > > > > i.e. if I'm returning a bag of ints, do I have to do: > > > > return new Schema( > > new FieldSchema(null, > > new Schema( > > new FieldSchema(null, DataType.INTEGER)), DataType.BAG)); > > > > Or do I have to explicitly define a tuple like so: > > > > return new Schema( > > new FieldSchema(null, > > new Schema( > > new FieldSchema(null, > > new Schema( > > new FieldSchema(null, DataType.INTEGER)), DataType.TUPLE)), > > DataType.BAG)); > > > > The docs seem pretty vague on this, and you're allowed to do either. > > My feeling would be that if the first form was illegal, you wouldn't > > be allowed to create a schema like that, but this may be wishful > > thinking. > > > > Thanks, > > > > Andrew. > > > > -- > > > > http://tinyurl.com/andrew-clegg-linkedin | > http://twitter.com/andrew_clegg> > >
-
Re: outputSchema for UDF EvalFunc returning a DataBag
Andrew Clegg 2011-10-03, 22:41
Thanks Raghu (and Dmitry). Could this maybe get added to the docs page on UDFs? (Apologies if it's there already and I missed it.) Also -- it's a bit cumbersome writing all these nested Schema and FieldSchema constructors, especially when you're writing tests for UDFs with flexible schema support. I was wondering if it would be practical to reuse whatever code the front-end uses to parse schema descriptions from load statements in scripts. Is this a silly idea? If it isn't silly, does anyone know where I need to look for that code? On 3 October 2011 22:56, Raghu Angadi <[EMAIL PROTECTED]> wrote: > my understanding is that Pig 0.8 expects the first form and Pig 0.9 requires > the second. > > Raghu. > > On Mon, Oct 3, 2011 at 8:27 AM, Andrew Clegg > <andrew.clegg+[EMAIL PROTECTED]>wrote: > >> Hi, >> >> When you have a UDF that returns a bag, and you're writing the >> outputSchema method, do you have to explicitly include the mandatory >> 'container' tuple within the bag, or is this implicit? >> >> i.e. if I'm returning a bag of ints, do I have to do: >> >> return new Schema( >> new FieldSchema(null, >> new Schema( >> new FieldSchema(null, DataType.INTEGER)), DataType.BAG)); >> >> Or do I have to explicitly define a tuple like so: >> >> return new Schema( >> new FieldSchema(null, >> new Schema( >> new FieldSchema(null, >> new Schema( >> new FieldSchema(null, DataType.INTEGER)), DataType.TUPLE)), >> DataType.BAG)); >> >> The docs seem pretty vague on this, and you're allowed to do either. >> My feeling would be that if the first form was illegal, you wouldn't >> be allowed to create a schema like that, but this may be wishful >> thinking. >> >> Thanks, >> >> Andrew. >> >> -- >> >> http://tinyurl.com/andrew-clegg-linkedin | http://twitter.com/andrew_clegg>> > -- http://tinyurl.com/andrew-clegg-linkedin | http://twitter.com/andrew_clegg
-
Re: outputSchema for UDF EvalFunc returning a DataBag
Raghu Angadi 2011-10-03, 23:14
Utils.getSchemaFromString() seems like exactly what you want ( from org_apache_pig_impl_util ). Raghu. [btw. my two previous attempts to send to the list got rejected as spam ] On Mon, Oct 3, 2011 at 3:41 PM, Andrew Clegg <andrew.clegg+[EMAIL PROTECTED]>wrote: > Thanks Raghu (and Dmitry). > > Could this maybe get added to the docs page on UDFs? (Apologies if > it's there already and I missed it.) > > Also -- it's a bit cumbersome writing all these nested Schema and > FieldSchema constructors, especially when you're writing tests for > UDFs with flexible schema support. > > I was wondering if it would be practical to reuse whatever code the > front-end uses to parse schema descriptions from load statements in > scripts. Is this a silly idea? If it isn't silly, does anyone know > where I need to look for that code? > > > On 3 October 2011 22:56, Raghu Angadi <[EMAIL PROTECTED]> wrote: > > my understanding is that Pig 0.8 expects the first form and Pig 0.9 > requires > > the second. > > > > Raghu. > > > > On Mon, Oct 3, 2011 at 8:27 AM, Andrew Clegg > > <andrew.clegg+[EMAIL PROTECTED]>wrote: > > > >> Hi, > >> > >> When you have a UDF that returns a bag, and you're writing the > >> outputSchema method, do you have to explicitly include the mandatory > >> 'container' tuple within the bag, or is this implicit? > >> > >> i.e. if I'm returning a bag of ints, do I have to do: > >> > >> return new Schema( > >> new FieldSchema(null, > >> new Schema( > >> new FieldSchema(null, DataType.INTEGER)), DataType.BAG)); > >> > >> Or do I have to explicitly define a tuple like so: > >> > >> return new Schema( > >> new FieldSchema(null, > >> new Schema( > >> new FieldSchema(null, > >> new Schema( > >> new FieldSchema(null, DataType.INTEGER)), DataType.TUPLE)), > >> DataType.BAG)); > >> > >> The docs seem pretty vague on this, and you're allowed to do either. > >> My feeling would be that if the first form was illegal, you wouldn't > >> be allowed to create a schema like that, but this may be wishful > >> thinking. > >> > >> Thanks, > >> > >> Andrew. > >> > >> -- > >> > >> http://tinyurl.com/andrew-clegg-linkedin | > http://twitter.com/andrew_clegg> >> > > > > > > -- > > http://tinyurl.com/andrew-clegg-linkedin | http://twitter.com/andrew_clegg>
-
Re: outputSchema for UDF EvalFunc returning a DataBag
Andrew Clegg 2011-10-04, 13:01
Yep, getSchemaFromString is what I was looking for, but I can't get it to generate a schema (for unit test purposes) that matches what I get inside my script during a real run. As an example, say I have a file like this: foo\t2 bar\t3 baz\t3 marge\t4 homer\t4 and I load it like this: infile = load 'test.txt' as (name:chararray, weight:int); grouped = group infile all; bucketed = foreach grouped generate flatten(Buckets(infile)); the outputSchema method of my UDF (Buckets) gets called with a schema that stringifies like so: {infile: {name: chararray,weight: int}} i.e. it has a single field, which is a bag, containing two elements directly (no wrapping tuple, presumably because this is Pig 0.8.1?). (sidenote, I guess the outermost {}s are a display convention, as there's only one bag there) When I'm unit-testing the UDF's outputSchema method, I'd like to generate exactly that schema. But if I call getSchemaFromString like this: Utils.getSchemaFromString("B: {f1: chararray, f2: int}") It throws a parser error: Encountered " "{" "{ "" at line 1, column 4. Was expecting one of: "int" ... "long" ... "float" ... "double" ... "chararray" ... "bytearray" ... "int" ... "long" ... "float" ... "double" ... "chararray" ... "bytearray" ... Two questions I guess. (1) Is there a way of generating a schema like that via Utils? (2) ... or is this schema actually wrong, and I'm looking at a symptom of https://issues.apache.org/jira/browse/PIG-767 that would behave differently if I was in Pig 0.9.0? Many thanks, Andrew. On 4 October 2011 00:14, Raghu Angadi <[EMAIL PROTECTED]> wrote: > Utils.getSchemaFromString() seems like exactly what you want ( > from org_apache_pig_impl_util ). > > Raghu. > > [btw. my two previous attempts to send to the list got rejected as spam ] > > On Mon, Oct 3, 2011 at 3:41 PM, Andrew Clegg > <andrew.clegg+[EMAIL PROTECTED]>wrote: > >> Thanks Raghu (and Dmitry). >> >> Could this maybe get added to the docs page on UDFs? (Apologies if >> it's there already and I missed it.) >> >> Also -- it's a bit cumbersome writing all these nested Schema and >> FieldSchema constructors, especially when you're writing tests for >> UDFs with flexible schema support. >> >> I was wondering if it would be practical to reuse whatever code the >> front-end uses to parse schema descriptions from load statements in >> scripts. Is this a silly idea? If it isn't silly, does anyone know >> where I need to look for that code? >> >> >> On 3 October 2011 22:56, Raghu Angadi <[EMAIL PROTECTED]> wrote: >> > my understanding is that Pig 0.8 expects the first form and Pig 0.9 >> requires >> > the second. >> > >> > Raghu. >> > >> > On Mon, Oct 3, 2011 at 8:27 AM, Andrew Clegg >> > <andrew.clegg+[EMAIL PROTECTED]>wrote: >> > >> >> Hi, >> >> >> >> When you have a UDF that returns a bag, and you're writing the >> >> outputSchema method, do you have to explicitly include the mandatory >> >> 'container' tuple within the bag, or is this implicit? >> >> >> >> i.e. if I'm returning a bag of ints, do I have to do: >> >> >> >> return new Schema( >> >> new FieldSchema(null, >> >> new Schema( >> >> new FieldSchema(null, DataType.INTEGER)), DataType.BAG)); >> >> >> >> Or do I have to explicitly define a tuple like so: >> >> >> >> return new Schema( >> >> new FieldSchema(null, >> >> new Schema( >> >> new FieldSchema(null, >> >> new Schema( >> >> new FieldSchema(null, DataType.INTEGER)), DataType.TUPLE)), >> >> DataType.BAG)); >> >> >> >> The docs seem pretty vague on this, and you're allowed to do either. >> >> My feeling would be that if the first form was illegal, you wouldn't >> >> be allowed to create a schema like that, but this may be wishful >> >> thinking. >> >> >> >> Thanks, >> >> >> >> Andrew. >> >> >> >> -- >> >> >> >> http://tinyurl.com/andrew-clegg-linkedin | >> http://twitter.com/andrew_clegg>> >> >> > >> >> >> >> -- >> >> http://tinyurl.com/andrew-clegg-linkedin | http://twitter.com/andrew_clegghttp://tinyurl.com/andrew-clegg-linkedin | http://twitter.com/andrew_clegg
-
Re: outputSchema for UDF EvalFunc returning a DataBag
Dmitriy Ryaboy 2011-10-05, 05:34
this seems to work: Utils.getSchemaFromString("(b:bag{f1: chararray, f2: int})"); On Tue, Oct 4, 2011 at 6:01 AM, Andrew Clegg <andrew.clegg+[EMAIL PROTECTED]>wrote: > Yep, getSchemaFromString is what I was looking for, but I can't get it > to generate a schema (for unit test purposes) that matches what I get > inside my script during a real run. > > As an example, say I have a file like this: > > foo\t2 > bar\t3 > baz\t3 > marge\t4 > homer\t4 > > and I load it like this: > > infile = load 'test.txt' as (name:chararray, weight:int); > grouped = group infile all; > bucketed = foreach grouped generate flatten(Buckets(infile)); > > the outputSchema method of my UDF (Buckets) gets called with a schema > that stringifies like so: > > {infile: {name: chararray,weight: int}} > > i.e. it has a single field, which is a bag, containing two elements > directly (no wrapping tuple, presumably because this is Pig 0.8.1?). > > (sidenote, I guess the outermost {}s are a display convention, as > there's only one bag there) > > When I'm unit-testing the UDF's outputSchema method, I'd like to > generate exactly that schema. > > But if I call getSchemaFromString like this: > > Utils.getSchemaFromString("B: {f1: chararray, f2: int}") > > It throws a parser error: > > Encountered " "{" "{ "" at line 1, column 4. > Was expecting one of: > "int" ... > "long" ... > "float" ... > "double" ... > "chararray" ... > "bytearray" ... > "int" ... > "long" ... > "float" ... > "double" ... > "chararray" ... > "bytearray" ... > > Two questions I guess. > > (1) Is there a way of generating a schema like that via Utils? > > (2) ... or is this schema actually wrong, and I'm looking at a symptom > of https://issues.apache.org/jira/browse/PIG-767 that would behave > differently if I was in Pig 0.9.0? > > Many thanks, > > Andrew. > > > On 4 October 2011 00:14, Raghu Angadi <[EMAIL PROTECTED]> wrote: > > Utils.getSchemaFromString() seems like exactly what you want ( > > from org_apache_pig_impl_util ). > > > > Raghu. > > > > [btw. my two previous attempts to send to the list got rejected as spam ] > > > > On Mon, Oct 3, 2011 at 3:41 PM, Andrew Clegg > > <andrew.clegg+[EMAIL PROTECTED]>wrote: > > > >> Thanks Raghu (and Dmitry). > >> > >> Could this maybe get added to the docs page on UDFs? (Apologies if > >> it's there already and I missed it.) > >> > >> Also -- it's a bit cumbersome writing all these nested Schema and > >> FieldSchema constructors, especially when you're writing tests for > >> UDFs with flexible schema support. > >> > >> I was wondering if it would be practical to reuse whatever code the > >> front-end uses to parse schema descriptions from load statements in > >> scripts. Is this a silly idea? If it isn't silly, does anyone know > >> where I need to look for that code? > >> > >> > >> On 3 October 2011 22:56, Raghu Angadi <[EMAIL PROTECTED]> wrote: > >> > my understanding is that Pig 0.8 expects the first form and Pig 0.9 > >> requires > >> > the second. > >> > > >> > Raghu. > >> > > >> > On Mon, Oct 3, 2011 at 8:27 AM, Andrew Clegg > >> > <andrew.clegg+[EMAIL PROTECTED]>wrote: > >> > > >> >> Hi, > >> >> > >> >> When you have a UDF that returns a bag, and you're writing the > >> >> outputSchema method, do you have to explicitly include the mandatory > >> >> 'container' tuple within the bag, or is this implicit? > >> >> > >> >> i.e. if I'm returning a bag of ints, do I have to do: > >> >> > >> >> return new Schema( > >> >> new FieldSchema(null, > >> >> new Schema( > >> >> new FieldSchema(null, DataType.INTEGER)), DataType.BAG)); > >> >> > >> >> Or do I have to explicitly define a tuple like so: > >> >> > >> >> return new Schema( > >> >> new FieldSchema(null, > >> >> new Schema( > >> >> new FieldSchema(null, > >> >> new Schema( > >> >> new FieldSchema(null, DataType.INTEGER)), DataType.TUPLE)), > >> >> DataType.BAG)); > >> >> > >> >> The docs seem pretty vague on this, and you're allowed to do either.
-
Re: outputSchema for UDF EvalFunc returning a DataBag
Raghu Angadi 2011-10-05, 22:41
After multiple attempts this worked : grunt> x = load 'x' as *(B: {t: (f1:chararray, f2:int)} )* ; grunt> describe x; x: {B: {t: (f1: chararray,f2: int)}} grunt> y = foreach x generate FLATTEN(B); grunt> describe y; y: {B::f1: chararray,B::f2: int} grunt> On Tue, Oct 4, 2011 at 6:01 AM, Andrew Clegg <andrew.clegg+[EMAIL PROTECTED]>wrote: > Yep, getSchemaFromString is what I was looking for, but I can't get it > to generate a schema (for unit test purposes) that matches what I get > inside my script during a real run. > > As an example, say I have a file like this: > > foo\t2 > bar\t3 > baz\t3 > marge\t4 > homer\t4 > > and I load it like this: > > infile = load 'test.txt' as (name:chararray, weight:int); > grouped = group infile all; > bucketed = foreach grouped generate flatten(Buckets(infile)); > > the outputSchema method of my UDF (Buckets) gets called with a schema > that stringifies like so: > > {infile: {name: chararray,weight: int}} > > i.e. it has a single field, which is a bag, containing two elements > directly (no wrapping tuple, presumably because this is Pig 0.8.1?). > > (sidenote, I guess the outermost {}s are a display convention, as > there's only one bag there) > > When I'm unit-testing the UDF's outputSchema method, I'd like to > generate exactly that schema. > > But if I call getSchemaFromString like this: > > Utils.getSchemaFromString("B: {f1: chararray, f2: int}") > > It throws a parser error: > > Encountered " "{" "{ "" at line 1, column 4. > Was expecting one of: > "int" ... > "long" ... > "float" ... > "double" ... > "chararray" ... > "bytearray" ... > "int" ... > "long" ... > "float" ... > "double" ... > "chararray" ... > "bytearray" ... > > Two questions I guess. > > (1) Is there a way of generating a schema like that via Utils? > > (2) ... or is this schema actually wrong, and I'm looking at a symptom > of https://issues.apache.org/jira/browse/PIG-767 that would behave > differently if I was in Pig 0.9.0? > > Many thanks, > > Andrew. > > > On 4 October 2011 00:14, Raghu Angadi <[EMAIL PROTECTED]> wrote: > > Utils.getSchemaFromString() seems like exactly what you want ( > > from org_apache_pig_impl_util ). > > > > Raghu. > > > > [btw. my two previous attempts to send to the list got rejected as spam ] > > > > On Mon, Oct 3, 2011 at 3:41 PM, Andrew Clegg > > <andrew.clegg+[EMAIL PROTECTED]>wrote: > > > >> Thanks Raghu (and Dmitry). > >> > >> Could this maybe get added to the docs page on UDFs? (Apologies if > >> it's there already and I missed it.) > >> > >> Also -- it's a bit cumbersome writing all these nested Schema and > >> FieldSchema constructors, especially when you're writing tests for > >> UDFs with flexible schema support. > >> > >> I was wondering if it would be practical to reuse whatever code the > >> front-end uses to parse schema descriptions from load statements in > >> scripts. Is this a silly idea? If it isn't silly, does anyone know > >> where I need to look for that code? > >> > >> > >> On 3 October 2011 22:56, Raghu Angadi <[EMAIL PROTECTED]> wrote: > >> > my understanding is that Pig 0.8 expects the first form and Pig 0.9 > >> requires > >> > the second. > >> > > >> > Raghu. > >> > > >> > On Mon, Oct 3, 2011 at 8:27 AM, Andrew Clegg > >> > <andrew.clegg+[EMAIL PROTECTED]>wrote: > >> > > >> >> Hi, > >> >> > >> >> When you have a UDF that returns a bag, and you're writing the > >> >> outputSchema method, do you have to explicitly include the mandatory > >> >> 'container' tuple within the bag, or is this implicit? > >> >> > >> >> i.e. if I'm returning a bag of ints, do I have to do: > >> >> > >> >> return new Schema( > >> >> new FieldSchema(null, > >> >> new Schema( > >> >> new FieldSchema(null, DataType.INTEGER)), DataType.BAG)); > >> >> > >> >> Or do I have to explicitly define a tuple like so: > >> >> > >> >> return new Schema( > >> >> new FieldSchema(null, > >> >> new Schema( > >> >> new FieldSchema(null, > >> >> new Schema(
|
|