Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig >> mail # user >> FLATTEN disambiguation clause


Copy link to this message
-
RE: FLATTEN disambiguation clause
Hi Chris,

I was probably not clear in my earlier response. The disambiguated names are always shown as the correct names. However, if a column name is unique then you should still be able to access the columns with the unique names.

In my example, I added a new line after the flatten that accesses the column with the unique name. I am pasting it below for reference.

filtered_scores = FOREACH filtered_scores GENERATE FLATTEN(unified_pair_scores);

-- the line below accesses the column with the unique name and not with the disambiguated name
unique_name = FOREACH filtered_scores GENERATE dest_id;

To summarize, unique column names are accessible with either the disambiguated name or with the unique column name. Another example that validates the point follows:

grunt> a = load 'input' as (name, age, gpa);
grunt> b = group a ALL;
grunt> c = foreach b generate flatten(a);    

grunt> describe c;
c: {a::name: bytearray,a::age: bytearray,a::gpa: bytearray}

grunt> d = foreach c generate name;          

grunt> describe d;                          
d: {a::name: bytearray}

Having explained that, I am not quite sure if that addresses your problem. I probably need to understand your use case.

Thanks,
Santhosh

-----Original Message-----
From: Chris Riccomini [mailto:[EMAIL PROTECTED]]
Sent: Monday, June 29, 2009 9:00 AM
To: [EMAIL PROTECTED]
Subject: Re: FLATTEN disambiguation clause

Hi Santhosh,

Thanks for the fast response. It appears that it is a bug then. My query is
at the bottom of this thread, but I'll repaste here. You can see that my
column names are all unique (source_id, dest_id, pairs_tc, and
scores_group_overlap), yet it still tries to disambiguate.

grunt> describe filtered_scores
filtered_scores: {unified_pair_scores: {dest_id: int,pairs_tc:
float,scores_group_overlap: double,source_id: int}}

grunt> filtered_scores = FOREACH filtered_scores GENERATE
FLATTEN(unified_pair_scores);

grunt> describe filtered_scores
filtered_scores: {unified_pair_scores::dest_id:
int,unified_pair_scores::pairs_tc:
float,unified_pair_scores::scores_group_overlap:
double,unified_pair_scores::source_id: int}

I don't want to use the AS to rename values because the column types are
dynamic, so I will not always know what's coming in.

Is there an open bug on this?

Thanks!
Chris
On 6/29/09 8:56 AM, "Santhosh Srinivasan" <[EMAIL PROTECTED]> wrote:

> The disambiguation can be dropped if the column name is unique. A workaround
> for now is to explicitly name your column names when you flatten.
>
> filtered_scores = FOREACH filtered_scores GENERATE
> FLATTEN(unified_pair_scores) as (dest_id, pairs_tc, scores_group_overlap,
> source_id);
>
> The following should work (I have not tried it yet). If Pig is insisting on
> the disambiguation even when the column name is unique then it's a bug.
>
> filtered_scores = FOREACH filtered_scores GENERATE
> FLATTEN(unified_pair_scores);
> unique_name = FOREACH filtered_scores GENERATE dest_id;
>
> Santhosh
>
> -----Original Message-----
> From: Chris Riccomini [mailto:[EMAIL PROTECTED]]
> Sent: Monday, June 29, 2009 8:37 AM
> To: [EMAIL PROTECTED]
> Subject: FLATTEN disambiguation clause
>
> Hi All,
>
> I¹m trying to use flatten for some pig scripts, but FLATTEN is insisting on
> using the disambiguation clause even when it doesn¹t need to:
>
> Is there any way to force FLATTEN to NOT use the clause? Why is FLATTEN so
> aggressive with this? It¹s a bit irritating, and is causing problems in our
> data flow.
>
> Thanks!
> Chris
>
> grunt> describe filtered_scores
> filtered_scores: {unified_pair_scores: {dest_id: int,pairs_tc:
> float,scores_group_overlap: double,source_id: int}}
>
> grunt> filtered_scores = FOREACH filtered_scores GENERATE
> FLATTEN(unified_pair_scores);
> grunt> describe filtered_scores
> filtered_scores: {unified_pair_scores::dest_id:
> int,unified_pair_scores::pairs_tc:
> float,unified_pair_scores::scores_group_overlap:
> double,unified_pair_scores::source_id: int}