Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig >> mail # dev >> Semantics of generate *


Copy link to this message
-
RE: Semantics of generate *
[Santhosh] Yes, a visitor is probably a cleaner way to do the
translation.
The question:  is the inability to return multiple projections from one
production a limit of how the parser is implemented or the tool, javacc,

used for the parser?

[Santhosh] It's the design/implementation and not the tool. Compared to
1.x, the types branch does not have the equivalent of a StarSpec. As a
result, we do not distinguish between project(0) and project( * ) during
parse time.

Thanks,
Santhosh
 

-----Original Message-----
From: Alan Gates [mailto:[EMAIL PROTECTED]]
Sent: Friday, October 03, 2008 10:48 AM
To: [EMAIL PROTECTED]
Subject: Re: Semantics of generate *

A thought and a question.

The thought:  rather than doing each individual operator do the
translation, could a visitor be written that would walk the tree right
after parsing and break project( * ) into project(1), project(2)...  ?  
This visitor could be one of the validators (like the type checker).  
This way all of the logic for this restitching is in one place.

The question:  is the inability to return multiple projections from one
production a limit of how the parser is implemented or the tool, javacc,

used for the parser?

Alan.

Santhosh Srinivasan wrote:
> In the current implementation of generate * in the front end, a single
> projection operator with the star attribute set to true is created.
> During the schema computation, instead of generating the schema of the
> projection input, a tuple that contains the schema of the projection
> input is created. This results in double wrapping. An example will
> illustrate the problem.
>
> grunt> a = load 'one' using PigStorage(' ') as (field1, field2,
field3);
> grunt> b = load 'two' as (field4, field5, field6);
> grunt> c = cogroup a by $0, b by $0;
> grunt> d = foreach c generate *;
> grunt> describe d;
>
> d: {c: (group: bytearray,a: {field1: bytearray,field2:
bytearray,field3:
> bytearray},b: {field4: bytearray,field5: bytearray,field6:
bytearray})}
>
> In the above example, the schema for operator d should have been
> identical to that of operator c. Instead, the schema of operator c is
> wrapped in a tuple and embedded within the schema of d. As a result,
we
> have a couple of issues:
>
> 1. It is not intuitive to users that the schema of c and d are not
> identical. They should be identical.
>
> grunt> e = foreach d generate group;
>
> 2008-10-02 16:06:11,335 [main] ERROR
> org.apache.pig.tools.grunt.GruntParser - java.io.IOException: Invalid
> alias: group in {c: (group: bytearray,a: {field1: bytearray,field2:
> bytearray,field3: bytearray},b: {field4: bytearray,field5:
> bytearray,field6: bytearray})}
>
> 2. As a workaround, we could flatten the contents of d and then access
> the contents of c.
>
> grunt> e = foreach d generate flatten($0);
> grunt> e = foreach d generate flatten($0);
> grunt> describe e;
>
> e: {c::group: bytearray,c::a: {field1: bytearray,field2:
> bytearray,field3: bytearray},c::b: {field4: bytearray,field5:
> bytearray,field6: bytearray}}
>
> However, we will not be able to compute the lineage of the fields of
> relation, as demonstrated by the following example:
>
> grunt> f = foreach e generate flatten(a), flatten(b);
> grunt> g = foreach f generate field1 + 1;
> grunt> describe g;
>
> 2008-10-02 16:26:20,655 [main] WARN  org.apache.pig.PigServer -
> bytearray is implicitly casted to integer under LOAdd Operator
> 2008-10-02 16:26:20,655 [main] ERROR org.apache.pig.PigServer -
Problem
> resolving LOForEach schema Cannot resolve load function to use for
> casting from bytearray to integer. Found more than one load function
to
> use: [org.apache.pig.builtin.PigStorage,
> org.apache.pig.builtin.BinStorage]
>
> This problem is contained in the frontend alone. In the backend, the
> double wrapping issue is resolved with the bug PIG-359. In order to
> resolve this issue in the frontend, the project( * ) operator has to
be
> translated into project(0), project(1), ..., project(n - 2), project(n
-
LOSplitOutput
projection
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB