Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig, mail # user - outputSchema for UDF EvalFunc returning a DataBag


Copy link to this message
-
Re: outputSchema for UDF EvalFunc returning a DataBag
Raghu Angadi 2011-10-05, 22:41
After multiple attempts this worked :

grunt> x = load 'x' as *(B: {t: (f1:chararray, f2:int)} )* ;
grunt> describe x;
x: {B: {t: (f1: chararray,f2: int)}}
grunt> y = foreach x generate FLATTEN(B);
grunt> describe y;
y: {B::f1: chararray,B::f2: int}
grunt>
On Tue, Oct 4, 2011 at 6:01 AM, Andrew Clegg
<andrew.clegg+[EMAIL PROTECTED]>wrote:

> Yep, getSchemaFromString is what I was looking for, but I can't get it
> to generate a schema (for unit test purposes) that matches what I get
> inside my script during a real run.
>
> As an example, say I have a file like this:
>
> foo\t2
> bar\t3
> baz\t3
> marge\t4
> homer\t4
>
> and I load it like this:
>
> infile = load 'test.txt' as (name:chararray, weight:int);
> grouped = group infile all;
> bucketed = foreach grouped generate flatten(Buckets(infile));
>
> the outputSchema method of my UDF (Buckets) gets called with a schema
> that stringifies like so:
>
> {infile: {name: chararray,weight: int}}
>
> i.e. it has a single field, which is a bag, containing two elements
> directly (no wrapping tuple, presumably because this is Pig 0.8.1?).
>
> (sidenote, I guess the outermost {}s are a display convention, as
> there's only one bag there)
>
> When I'm unit-testing the UDF's outputSchema method, I'd like to
> generate exactly that schema.
>
> But if I call getSchemaFromString like this:
>
> Utils.getSchemaFromString("B: {f1: chararray, f2: int}")
>
> It throws a parser error:
>
> Encountered " "{" "{ "" at line 1, column 4.
> Was expecting one of:
>    "int" ...
>    "long" ...
>    "float" ...
>    "double" ...
>    "chararray" ...
>    "bytearray" ...
>    "int" ...
>    "long" ...
>    "float" ...
>    "double" ...
>    "chararray" ...
>    "bytearray" ...
>
> Two questions I guess.
>
> (1) Is there a way of generating a schema like that via Utils?
>
> (2) ... or is this schema actually wrong, and I'm looking at a symptom
> of https://issues.apache.org/jira/browse/PIG-767 that would behave
> differently if I was in Pig 0.9.0?
>
> Many thanks,
>
> Andrew.
>
>
> On 4 October 2011 00:14, Raghu Angadi <[EMAIL PROTECTED]> wrote:
> > Utils.getSchemaFromString() seems like exactly what you want (
> > from org_apache_pig_impl_util ).
> >
> > Raghu.
> >
> > [btw. my two previous attempts to send to the list got rejected as spam ]
> >
> > On Mon, Oct 3, 2011 at 3:41 PM, Andrew Clegg
> > <andrew.clegg+[EMAIL PROTECTED]>wrote:
> >
> >> Thanks Raghu (and Dmitry).
> >>
> >> Could this maybe get added to the docs page on UDFs? (Apologies if
> >> it's there already and I missed it.)
> >>
> >> Also -- it's a bit cumbersome writing all these nested Schema and
> >> FieldSchema constructors, especially when you're writing tests for
> >> UDFs with flexible schema support.
> >>
> >> I was wondering if it would be practical to reuse whatever code the
> >> front-end uses to parse schema descriptions from load statements in
> >> scripts. Is this a silly idea? If it isn't silly, does anyone know
> >> where I need to look for that code?
> >>
> >>
> >> On 3 October 2011 22:56, Raghu Angadi <[EMAIL PROTECTED]> wrote:
> >> > my understanding is that Pig 0.8 expects the first form and Pig 0.9
> >> requires
> >> > the second.
> >> >
> >> > Raghu.
> >> >
> >> > On Mon, Oct 3, 2011 at 8:27 AM, Andrew Clegg
> >> > <andrew.clegg+[EMAIL PROTECTED]>wrote:
> >> >
> >> >> Hi,
> >> >>
> >> >> When you have a UDF that returns a bag, and you're writing the
> >> >> outputSchema method, do you have to explicitly include the mandatory
> >> >> 'container' tuple within the bag, or is this implicit?
> >> >>
> >> >> i.e. if I'm returning a bag of ints, do I have to do:
> >> >>
> >> >> return new Schema(
> >> >>  new FieldSchema(null,
> >> >>    new Schema(
> >> >>      new FieldSchema(null, DataType.INTEGER)), DataType.BAG));
> >> >>
> >> >> Or do I have to explicitly define a tuple like so:
> >> >>
> >> >> return new Schema(
> >> >>  new FieldSchema(null,
> >> >>    new Schema(
> >> >>      new FieldSchema(null,
> >> >>        new Schema(