Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig >> mail # user >> outputSchema for UDF EvalFunc returning a DataBag


Copy link to this message
-
Re: outputSchema for UDF EvalFunc returning a DataBag
After multiple attempts this worked :

grunt> x = load 'x' as *(B: {t: (f1:chararray, f2:int)} )* ;
grunt> describe x;
x: {B: {t: (f1: chararray,f2: int)}}
grunt> y = foreach x generate FLATTEN(B);
grunt> describe y;
y: {B::f1: chararray,B::f2: int}
grunt>
On Tue, Oct 4, 2011 at 6:01 AM, Andrew Clegg
<andrew.clegg+[EMAIL PROTECTED]>wrote:

> Yep, getSchemaFromString is what I was looking for, but I can't get it
> to generate a schema (for unit test purposes) that matches what I get
> inside my script during a real run.
>
> As an example, say I have a file like this:
>
> foo\t2
> bar\t3
> baz\t3
> marge\t4
> homer\t4
>
> and I load it like this:
>
> infile = load 'test.txt' as (name:chararray, weight:int);
> grouped = group infile all;
> bucketed = foreach grouped generate flatten(Buckets(infile));
>
> the outputSchema method of my UDF (Buckets) gets called with a schema
> that stringifies like so:
>
> {infile: {name: chararray,weight: int}}
>
> i.e. it has a single field, which is a bag, containing two elements
> directly (no wrapping tuple, presumably because this is Pig 0.8.1?).
>
> (sidenote, I guess the outermost {}s are a display convention, as
> there's only one bag there)
>
> When I'm unit-testing the UDF's outputSchema method, I'd like to
> generate exactly that schema.
>
> But if I call getSchemaFromString like this:
>
> Utils.getSchemaFromString("B: {f1: chararray, f2: int}")
>
> It throws a parser error:
>
> Encountered " "{" "{ "" at line 1, column 4.
> Was expecting one of:
>    "int" ...
>    "long" ...
>    "float" ...
>    "double" ...
>    "chararray" ...
>    "bytearray" ...
>    "int" ...
>    "long" ...
>    "float" ...
>    "double" ...
>    "chararray" ...
>    "bytearray" ...
>
> Two questions I guess.
>
> (1) Is there a way of generating a schema like that via Utils?
>
> (2) ... or is this schema actually wrong, and I'm looking at a symptom
> of https://issues.apache.org/jira/browse/PIG-767 that would behave
> differently if I was in Pig 0.9.0?
>
> Many thanks,
>
> Andrew.
>
>
> On 4 October 2011 00:14, Raghu Angadi <[EMAIL PROTECTED]> wrote:
> > Utils.getSchemaFromString() seems like exactly what you want (
> > from org_apache_pig_impl_util ).
> >
> > Raghu.
> >
> > [btw. my two previous attempts to send to the list got rejected as spam ]
> >
> > On Mon, Oct 3, 2011 at 3:41 PM, Andrew Clegg
> > <andrew.clegg+[EMAIL PROTECTED]>wrote:
> >
> >> Thanks Raghu (and Dmitry).
> >>
> >> Could this maybe get added to the docs page on UDFs? (Apologies if
> >> it's there already and I missed it.)
> >>
> >> Also -- it's a bit cumbersome writing all these nested Schema and
> >> FieldSchema constructors, especially when you're writing tests for
> >> UDFs with flexible schema support.
> >>
> >> I was wondering if it would be practical to reuse whatever code the
> >> front-end uses to parse schema descriptions from load statements in
> >> scripts. Is this a silly idea? If it isn't silly, does anyone know
> >> where I need to look for that code?
> >>
> >>
> >> On 3 October 2011 22:56, Raghu Angadi <[EMAIL PROTECTED]> wrote:
> >> > my understanding is that Pig 0.8 expects the first form and Pig 0.9
> >> requires
> >> > the second.
> >> >
> >> > Raghu.
> >> >
> >> > On Mon, Oct 3, 2011 at 8:27 AM, Andrew Clegg
> >> > <andrew.clegg+[EMAIL PROTECTED]>wrote:
> >> >
> >> >> Hi,
> >> >>
> >> >> When you have a UDF that returns a bag, and you're writing the
> >> >> outputSchema method, do you have to explicitly include the mandatory
> >> >> 'container' tuple within the bag, or is this implicit?
> >> >>
> >> >> i.e. if I'm returning a bag of ints, do I have to do:
> >> >>
> >> >> return new Schema(
> >> >>  new FieldSchema(null,
> >> >>    new Schema(
> >> >>      new FieldSchema(null, DataType.INTEGER)), DataType.BAG));
> >> >>
> >> >> Or do I have to explicitly define a tuple like so:
> >> >>
> >> >> return new Schema(
> >> >>  new FieldSchema(null,
> >> >>    new Schema(
> >> >>      new FieldSchema(null,
> >> >>        new Schema(
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB