Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig >> mail # user >> Pig creates wrong schema after dereferencing nested tuple fields


Copy link to this message
-
Re: Pig creates wrong schema after dereferencing nested tuple fields
Try this:

data = LOAD 'test_data.txt' USING PigStorage() AS (f1: int, f2: int, f3:
int, f4: int);

nested = FOREACH data GENERATE f1, (f2, f3, f4) AS nested_tuple;

dereferenced = FOREACH nested GENERATE f1, nested_tuple.(f2, f3) as
tuple_two;
DESCRIBE dereferenced;

uses_dereferenced = FOREACH dereferenced GENERATE nested_tuple.(f3) as f3;
DESCRIBE uses_dereferenced;
I find explicit naming can sometimes fix bugs.

On Fri, Jun 22, 2012 at 12:09 PM, Jonathan Packer <[EMAIL PROTECTED]>wrote:

> I'm running into a strange problem where Pig is not detecting the schema
> correctly when one dereferences multiple fields from a nested tuple. I
> wanted to check whether this was a bug I should file on JIRA or whether the
> multiple dereference syntax is deprecated or something.
>
> The following script fails:
>
> data = LOAD 'test_data.txt' USING PigStorage() AS (f1: int, f2: int, f3:
> int, f4: int);
>
> nested = FOREACH data GENERATE f1, (f2, f3, f4) AS nested_tuple;
>
> dereferenced = FOREACH nested GENERATE f1, nested_tuple.(f2, f3);
> DESCRIBE dereferenced;
>
> uses_dereferenced = FOREACH dereferenced GENERATE nested_tuple.f3;
> DESCRIBE uses_dereferenced;
>
> The schema of "dereferenced" should be {f1: int, nested_tuple: (f2: int,
> f3: int)}. DESCRIBE thinks it is {f1: int, f2: int} instead. When dump is
> used, the data is actually in form of the correct schema however, ex.
>
> (1,(2,3))
> (5,(6,7))
> ...
>
> This is not just a problem with DESCRIBE. Because the schema is incorrect,
> the reference to "nested_tuple" in the "uses_dereferenced" statement is
> considered to be invalid, and the script fails to run. The error is:
>
> Invalid field projection. Projected field [nested_tuple] does not exist in
> schema: f1:int,f2:int.
>

--
Russell Jurney twitter.com/rjurney [EMAIL PROTECTED] datasyndrome.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB