|
|
-
Pig creates wrong schema after dereferencing nested tuple fields
Jonathan Packer 2012-06-22, 19:09
I'm running into a strange problem where Pig is not detecting the schema correctly when one dereferences multiple fields from a nested tuple. I wanted to check whether this was a bug I should file on JIRA or whether the multiple dereference syntax is deprecated or something.
The following script fails:
data = LOAD 'test_data.txt' USING PigStorage() AS (f1: int, f2: int, f3: int, f4: int);
nested = FOREACH data GENERATE f1, (f2, f3, f4) AS nested_tuple;
dereferenced = FOREACH nested GENERATE f1, nested_tuple.(f2, f3); DESCRIBE dereferenced;
uses_dereferenced = FOREACH dereferenced GENERATE nested_tuple.f3; DESCRIBE uses_dereferenced;
The schema of "dereferenced" should be {f1: int, nested_tuple: (f2: int, f3: int)}. DESCRIBE thinks it is {f1: int, f2: int} instead. When dump is used, the data is actually in form of the correct schema however, ex.
(1,(2,3)) (5,(6,7)) ...
This is not just a problem with DESCRIBE. Because the schema is incorrect, the reference to "nested_tuple" in the "uses_dereferenced" statement is considered to be invalid, and the script fails to run. The error is:
Invalid field projection. Projected field [nested_tuple] does not exist in schema: f1:int,f2:int.
-
Re: Pig creates wrong schema after dereferencing nested tuple fields
Russell Jurney 2012-06-22, 19:46
Try this:
data = LOAD 'test_data.txt' USING PigStorage() AS (f1: int, f2: int, f3: int, f4: int);
nested = FOREACH data GENERATE f1, (f2, f3, f4) AS nested_tuple;
dereferenced = FOREACH nested GENERATE f1, nested_tuple.(f2, f3) as tuple_two; DESCRIBE dereferenced;
uses_dereferenced = FOREACH dereferenced GENERATE nested_tuple.(f3) as f3; DESCRIBE uses_dereferenced; I find explicit naming can sometimes fix bugs.
On Fri, Jun 22, 2012 at 12:09 PM, Jonathan Packer <[EMAIL PROTECTED]>wrote:
> I'm running into a strange problem where Pig is not detecting the schema > correctly when one dereferences multiple fields from a nested tuple. I > wanted to check whether this was a bug I should file on JIRA or whether the > multiple dereference syntax is deprecated or something. > > The following script fails: > > data = LOAD 'test_data.txt' USING PigStorage() AS (f1: int, f2: int, f3: > int, f4: int); > > nested = FOREACH data GENERATE f1, (f2, f3, f4) AS nested_tuple; > > dereferenced = FOREACH nested GENERATE f1, nested_tuple.(f2, f3); > DESCRIBE dereferenced; > > uses_dereferenced = FOREACH dereferenced GENERATE nested_tuple.f3; > DESCRIBE uses_dereferenced; > > The schema of "dereferenced" should be {f1: int, nested_tuple: (f2: int, > f3: int)}. DESCRIBE thinks it is {f1: int, f2: int} instead. When dump is > used, the data is actually in form of the correct schema however, ex. > > (1,(2,3)) > (5,(6,7)) > ... > > This is not just a problem with DESCRIBE. Because the schema is incorrect, > the reference to "nested_tuple" in the "uses_dereferenced" statement is > considered to be invalid, and the script fails to run. The error is: > > Invalid field projection. Projected field [nested_tuple] does not exist in > schema: f1:int,f2:int. >
-- Russell Jurney twitter.com/rjurney [EMAIL PROTECTED] datasyndrome.com
|
|