|
Chris Riccomini
2009-06-29, 15:37
Santhosh Srinivasan
2009-06-29, 15:56
Chris Riccomini
2009-06-29, 16:00
Santhosh Srinivasan
2009-06-29, 17:24
Chris Riccomini
2009-06-29, 17:55
Santhosh Srinivasan
2009-06-29, 18:23
|
-
FLATTEN disambiguation clauseChris Riccomini 2009-06-29, 15:37
Hi All,
I¹m trying to use flatten for some pig scripts, but FLATTEN is insisting on using the disambiguation clause even when it doesn¹t need to: Is there any way to force FLATTEN to NOT use the clause? Why is FLATTEN so aggressive with this? It¹s a bit irritating, and is causing problems in our data flow. Thanks! Chris grunt> describe filtered_scores filtered_scores: {unified_pair_scores: {dest_id: int,pairs_tc: float,scores_group_overlap: double,source_id: int}} grunt> filtered_scores = FOREACH filtered_scores GENERATE FLATTEN(unified_pair_scores); grunt> describe filtered_scores filtered_scores: {unified_pair_scores::dest_id: int,unified_pair_scores::pairs_tc: float,unified_pair_scores::scores_group_overlap: double,unified_pair_scores::source_id: int}
-
RE: FLATTEN disambiguation clauseSanthosh Srinivasan 2009-06-29, 15:56
The disambiguation can be dropped if the column name is unique. A workaround for now is to explicitly name your column names when you flatten.
filtered_scores = FOREACH filtered_scores GENERATE FLATTEN(unified_pair_scores) as (dest_id, pairs_tc, scores_group_overlap, source_id); The following should work (I have not tried it yet). If Pig is insisting on the disambiguation even when the column name is unique then it's a bug. filtered_scores = FOREACH filtered_scores GENERATE FLATTEN(unified_pair_scores); unique_name = FOREACH filtered_scores GENERATE dest_id; Santhosh -----Original Message----- From: Chris Riccomini [mailto:[EMAIL PROTECTED]] Sent: Monday, June 29, 2009 8:37 AM To: [EMAIL PROTECTED] Subject: FLATTEN disambiguation clause Hi All, I¹m trying to use flatten for some pig scripts, but FLATTEN is insisting on using the disambiguation clause even when it doesn¹t need to: Is there any way to force FLATTEN to NOT use the clause? Why is FLATTEN so aggressive with this? It¹s a bit irritating, and is causing problems in our data flow. Thanks! Chris grunt> describe filtered_scores filtered_scores: {unified_pair_scores: {dest_id: int,pairs_tc: float,scores_group_overlap: double,source_id: int}} grunt> filtered_scores = FOREACH filtered_scores GENERATE FLATTEN(unified_pair_scores); grunt> describe filtered_scores filtered_scores: {unified_pair_scores::dest_id: int,unified_pair_scores::pairs_tc: float,unified_pair_scores::scores_group_overlap: double,unified_pair_scores::source_id: int}
-
Re: FLATTEN disambiguation clauseChris Riccomini 2009-06-29, 16:00
Hi Santhosh,
Thanks for the fast response. It appears that it is a bug then. My query is at the bottom of this thread, but I'll repaste here. You can see that my column names are all unique (source_id, dest_id, pairs_tc, and scores_group_overlap), yet it still tries to disambiguate. grunt> describe filtered_scores filtered_scores: {unified_pair_scores: {dest_id: int,pairs_tc: float,scores_group_overlap: double,source_id: int}} grunt> filtered_scores = FOREACH filtered_scores GENERATE FLATTEN(unified_pair_scores); grunt> describe filtered_scores filtered_scores: {unified_pair_scores::dest_id: int,unified_pair_scores::pairs_tc: float,unified_pair_scores::scores_group_overlap: double,unified_pair_scores::source_id: int} I don't want to use the AS to rename values because the column types are dynamic, so I will not always know what's coming in. Is there an open bug on this? Thanks! Chris On 6/29/09 8:56 AM, "Santhosh Srinivasan" <[EMAIL PROTECTED]> wrote: > The disambiguation can be dropped if the column name is unique. A workaround > for now is to explicitly name your column names when you flatten. > > filtered_scores = FOREACH filtered_scores GENERATE > FLATTEN(unified_pair_scores) as (dest_id, pairs_tc, scores_group_overlap, > source_id); > > The following should work (I have not tried it yet). If Pig is insisting on > the disambiguation even when the column name is unique then it's a bug. > > filtered_scores = FOREACH filtered_scores GENERATE > FLATTEN(unified_pair_scores); > unique_name = FOREACH filtered_scores GENERATE dest_id; > > Santhosh > > -----Original Message----- > From: Chris Riccomini [mailto:[EMAIL PROTECTED]] > Sent: Monday, June 29, 2009 8:37 AM > To: [EMAIL PROTECTED] > Subject: FLATTEN disambiguation clause > > Hi All, > > I¹m trying to use flatten for some pig scripts, but FLATTEN is insisting on > using the disambiguation clause even when it doesn¹t need to: > > Is there any way to force FLATTEN to NOT use the clause? Why is FLATTEN so > aggressive with this? It¹s a bit irritating, and is causing problems in our > data flow. > > Thanks! > Chris > > grunt> describe filtered_scores > filtered_scores: {unified_pair_scores: {dest_id: int,pairs_tc: > float,scores_group_overlap: double,source_id: int}} > > grunt> filtered_scores = FOREACH filtered_scores GENERATE > FLATTEN(unified_pair_scores); > grunt> describe filtered_scores > filtered_scores: {unified_pair_scores::dest_id: > int,unified_pair_scores::pairs_tc: > float,unified_pair_scores::scores_group_overlap: > double,unified_pair_scores::source_id: int} > >
-
RE: FLATTEN disambiguation clauseSanthosh Srinivasan 2009-06-29, 17:24
Hi Chris,
I was probably not clear in my earlier response. The disambiguated names are always shown as the correct names. However, if a column name is unique then you should still be able to access the columns with the unique names. In my example, I added a new line after the flatten that accesses the column with the unique name. I am pasting it below for reference. filtered_scores = FOREACH filtered_scores GENERATE FLATTEN(unified_pair_scores); -- the line below accesses the column with the unique name and not with the disambiguated name unique_name = FOREACH filtered_scores GENERATE dest_id; To summarize, unique column names are accessible with either the disambiguated name or with the unique column name. Another example that validates the point follows: grunt> a = load 'input' as (name, age, gpa); grunt> b = group a ALL; grunt> c = foreach b generate flatten(a); grunt> describe c; c: {a::name: bytearray,a::age: bytearray,a::gpa: bytearray} grunt> d = foreach c generate name; grunt> describe d; d: {a::name: bytearray} Having explained that, I am not quite sure if that addresses your problem. I probably need to understand your use case. Thanks, Santhosh -----Original Message----- From: Chris Riccomini [mailto:[EMAIL PROTECTED]] Sent: Monday, June 29, 2009 9:00 AM To: [EMAIL PROTECTED] Subject: Re: FLATTEN disambiguation clause Hi Santhosh, Thanks for the fast response. It appears that it is a bug then. My query is at the bottom of this thread, but I'll repaste here. You can see that my column names are all unique (source_id, dest_id, pairs_tc, and scores_group_overlap), yet it still tries to disambiguate. grunt> describe filtered_scores filtered_scores: {unified_pair_scores: {dest_id: int,pairs_tc: float,scores_group_overlap: double,source_id: int}} grunt> filtered_scores = FOREACH filtered_scores GENERATE FLATTEN(unified_pair_scores); grunt> describe filtered_scores filtered_scores: {unified_pair_scores::dest_id: int,unified_pair_scores::pairs_tc: float,unified_pair_scores::scores_group_overlap: double,unified_pair_scores::source_id: int} I don't want to use the AS to rename values because the column types are dynamic, so I will not always know what's coming in. Is there an open bug on this? Thanks! Chris On 6/29/09 8:56 AM, "Santhosh Srinivasan" <[EMAIL PROTECTED]> wrote: > The disambiguation can be dropped if the column name is unique. A workaround > for now is to explicitly name your column names when you flatten. > > filtered_scores = FOREACH filtered_scores GENERATE > FLATTEN(unified_pair_scores) as (dest_id, pairs_tc, scores_group_overlap, > source_id); > > The following should work (I have not tried it yet). If Pig is insisting on > the disambiguation even when the column name is unique then it's a bug. > > filtered_scores = FOREACH filtered_scores GENERATE > FLATTEN(unified_pair_scores); > unique_name = FOREACH filtered_scores GENERATE dest_id; > > Santhosh > > -----Original Message----- > From: Chris Riccomini [mailto:[EMAIL PROTECTED]] > Sent: Monday, June 29, 2009 8:37 AM > To: [EMAIL PROTECTED] > Subject: FLATTEN disambiguation clause > > Hi All, > > I¹m trying to use flatten for some pig scripts, but FLATTEN is insisting on > using the disambiguation clause even when it doesn¹t need to: > > Is there any way to force FLATTEN to NOT use the clause? Why is FLATTEN so > aggressive with this? It¹s a bit irritating, and is causing problems in our > data flow. > > Thanks! > Chris > > grunt> describe filtered_scores > filtered_scores: {unified_pair_scores: {dest_id: int,pairs_tc: > float,scores_group_overlap: double,source_id: int}} > > grunt> filtered_scores = FOREACH filtered_scores GENERATE > FLATTEN(unified_pair_scores); > grunt> describe filtered_scores > filtered_scores: {unified_pair_scores::dest_id: > int,unified_pair_scores::pairs_tc: > float,unified_pair_scores::scores_group_overlap: > double,unified_pair_scores::source_id: int}
-
Re: FLATTEN disambiguation clauseChris Riccomini 2009-06-29, 17:55
Hi Santhosh,
This is good to know. This solves all of my problems except for one. Our StoreFunc is using field.alias to serialize our pig data to disk, which is giving us unified_pair_scores::dest_id, etc. How do I get just 'dest_id' from the field? Thanks! Chris On 6/29/09 10:24 AM, "Santhosh Srinivasan" <[EMAIL PROTECTED]> wrote: > Hi Chris, > > I was probably not clear in my earlier response. The disambiguated names are > always shown as the correct names. However, if a column name is unique then > you should still be able to access the columns with the unique names. > > In my example, I added a new line after the flatten that accesses the column > with the unique name. I am pasting it below for reference. > > filtered_scores = FOREACH filtered_scores GENERATE > FLATTEN(unified_pair_scores); > > -- the line below accesses the column with the unique name and not with the > disambiguated name > unique_name = FOREACH filtered_scores GENERATE dest_id; > > To summarize, unique column names are accessible with either the disambiguated > name or with the unique column name. Another example that validates the point > follows: > > grunt> a = load 'input' as (name, age, gpa); > grunt> b = group a ALL; > grunt> c = foreach b generate flatten(a); > > grunt> describe c; > c: {a::name: bytearray,a::age: bytearray,a::gpa: bytearray} > > grunt> d = foreach c generate name; > > grunt> describe d; > d: {a::name: bytearray} > > Having explained that, I am not quite sure if that addresses your problem. I > probably need to understand your use case. > > Thanks, > Santhosh > > -----Original Message----- > From: Chris Riccomini [mailto:[EMAIL PROTECTED]] > Sent: Monday, June 29, 2009 9:00 AM > To: [EMAIL PROTECTED] > Subject: Re: FLATTEN disambiguation clause > > Hi Santhosh, > > Thanks for the fast response. It appears that it is a bug then. My query is > at the bottom of this thread, but I'll repaste here. You can see that my > column names are all unique (source_id, dest_id, pairs_tc, and > scores_group_overlap), yet it still tries to disambiguate. > > grunt> describe filtered_scores > filtered_scores: {unified_pair_scores: {dest_id: int,pairs_tc: > float,scores_group_overlap: double,source_id: int}} > > grunt> filtered_scores = FOREACH filtered_scores GENERATE > FLATTEN(unified_pair_scores); > > grunt> describe filtered_scores > filtered_scores: {unified_pair_scores::dest_id: > int,unified_pair_scores::pairs_tc: > float,unified_pair_scores::scores_group_overlap: > double,unified_pair_scores::source_id: int} > > I don't want to use the AS to rename values because the column types are > dynamic, so I will not always know what's coming in. > > Is there an open bug on this? > > Thanks! > Chris > > > On 6/29/09 8:56 AM, "Santhosh Srinivasan" <[EMAIL PROTECTED]> wrote: > >> The disambiguation can be dropped if the column name is unique. A workaround >> for now is to explicitly name your column names when you flatten. >> >> filtered_scores = FOREACH filtered_scores GENERATE >> FLATTEN(unified_pair_scores) as (dest_id, pairs_tc, scores_group_overlap, >> source_id); >> >> The following should work (I have not tried it yet). If Pig is insisting on >> the disambiguation even when the column name is unique then it's a bug. >> >> filtered_scores = FOREACH filtered_scores GENERATE >> FLATTEN(unified_pair_scores); >> unique_name = FOREACH filtered_scores GENERATE dest_id; >> >> Santhosh >> >> -----Original Message----- >> From: Chris Riccomini [mailto:[EMAIL PROTECTED]] >> Sent: Monday, June 29, 2009 8:37 AM >> To: [EMAIL PROTECTED] >> Subject: FLATTEN disambiguation clause >> >> Hi All, >> >> I¹m trying to use flatten for some pig scripts, but FLATTEN is insisting on >> using the disambiguation clause even when it doesn¹t need to: >> >> Is there any way to force FLATTEN to NOT use the clause? Why is FLATTEN so >> aggressive with this? It¹s a bit irritating, and is causing problems in our >> data flow.
-
RE: FLATTEN disambiguation clauseSanthosh Srinivasan 2009-06-29, 18:23
Hi Chris,
Right now, the alias is set to the disambiguated alias. There is no mechanism to retrieve the unique alias if one exists. This enhancement has to be added. I have filed a JIRA - https://issues.apache.org/jira/browse/PIG-866 Thanks, Santhosh -----Original Message----- From: Chris Riccomini [mailto:[EMAIL PROTECTED]] Sent: Monday, June 29, 2009 10:56 AM To: [EMAIL PROTECTED] Subject: Re: FLATTEN disambiguation clause Hi Santhosh, This is good to know. This solves all of my problems except for one. Our StoreFunc is using field.alias to serialize our pig data to disk, which is giving us unified_pair_scores::dest_id, etc. How do I get just 'dest_id' from the field? Thanks! Chris On 6/29/09 10:24 AM, "Santhosh Srinivasan" <[EMAIL PROTECTED]> wrote: > Hi Chris, > > I was probably not clear in my earlier response. The disambiguated names are > always shown as the correct names. However, if a column name is unique then > you should still be able to access the columns with the unique names. > > In my example, I added a new line after the flatten that accesses the column > with the unique name. I am pasting it below for reference. > > filtered_scores = FOREACH filtered_scores GENERATE > FLATTEN(unified_pair_scores); > > -- the line below accesses the column with the unique name and not with the > disambiguated name > unique_name = FOREACH filtered_scores GENERATE dest_id; > > To summarize, unique column names are accessible with either the disambiguated > name or with the unique column name. Another example that validates the point > follows: > > grunt> a = load 'input' as (name, age, gpa); > grunt> b = group a ALL; > grunt> c = foreach b generate flatten(a); > > grunt> describe c; > c: {a::name: bytearray,a::age: bytearray,a::gpa: bytearray} > > grunt> d = foreach c generate name; > > grunt> describe d; > d: {a::name: bytearray} > > Having explained that, I am not quite sure if that addresses your problem. I > probably need to understand your use case. > > Thanks, > Santhosh > > -----Original Message----- > From: Chris Riccomini [mailto:[EMAIL PROTECTED]] > Sent: Monday, June 29, 2009 9:00 AM > To: [EMAIL PROTECTED] > Subject: Re: FLATTEN disambiguation clause > > Hi Santhosh, > > Thanks for the fast response. It appears that it is a bug then. My query is > at the bottom of this thread, but I'll repaste here. You can see that my > column names are all unique (source_id, dest_id, pairs_tc, and > scores_group_overlap), yet it still tries to disambiguate. > > grunt> describe filtered_scores > filtered_scores: {unified_pair_scores: {dest_id: int,pairs_tc: > float,scores_group_overlap: double,source_id: int}} > > grunt> filtered_scores = FOREACH filtered_scores GENERATE > FLATTEN(unified_pair_scores); > > grunt> describe filtered_scores > filtered_scores: {unified_pair_scores::dest_id: > int,unified_pair_scores::pairs_tc: > float,unified_pair_scores::scores_group_overlap: > double,unified_pair_scores::source_id: int} > > I don't want to use the AS to rename values because the column types are > dynamic, so I will not always know what's coming in. > > Is there an open bug on this? > > Thanks! > Chris > > > On 6/29/09 8:56 AM, "Santhosh Srinivasan" <[EMAIL PROTECTED]> wrote: > >> The disambiguation can be dropped if the column name is unique. A workaround >> for now is to explicitly name your column names when you flatten. >> >> filtered_scores = FOREACH filtered_scores GENERATE >> FLATTEN(unified_pair_scores) as (dest_id, pairs_tc, scores_group_overlap, >> source_id); >> >> The following should work (I have not tried it yet). If Pig is insisting on >> the disambiguation even when the column name is unique then it's a bug. >> >> filtered_scores = FOREACH filtered_scores GENERATE >> FLATTEN(unified_pair_scores); >> unique_name = FOREACH filtered_scores GENERATE dest_id; >> >> Santhosh >> >> -----Original Message----- >> From: Chris Riccomini [mailto:[EMAIL PROTECTED]] |