Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig >> mail # user >> FLATTEN disambiguation clause


Copy link to this message
-
RE: FLATTEN disambiguation clause
Hi Chris,

I was probably not clear in my earlier response. The disambiguated names are always shown as the correct names. However, if a column name is unique then you should still be able to access the columns with the unique names.

In my example, I added a new line after the flatten that accesses the column with the unique name. I am pasting it below for reference.

filtered_scores = FOREACH filtered_scores GENERATE FLATTEN(unified_pair_scores);

-- the line below accesses the column with the unique name and not with the disambiguated name
unique_name = FOREACH filtered_scores GENERATE dest_id;

To summarize, unique column names are accessible with either the disambiguated name or with the unique column name. Another example that validates the point follows:

grunt> a = load 'input' as (name, age, gpa);
grunt> b = group a ALL;
grunt> c = foreach b generate flatten(a);    

grunt> describe c;
c: {a::name: bytearray,a::age: bytearray,a::gpa: bytearray}

grunt> d = foreach c generate name;          

grunt> describe d;                          
d: {a::name: bytearray}

Having explained that, I am not quite sure if that addresses your problem. I probably need to understand your use case.

Thanks,
Santhosh

-----Original Message-----
From: Chris Riccomini [mailto:[EMAIL PROTECTED]]
Sent: Monday, June 29, 2009 9:00 AM
To: [EMAIL PROTECTED]
Subject: Re: FLATTEN disambiguation clause

Hi Santhosh,

Thanks for the fast response. It appears that it is a bug then. My query is
at the bottom of this thread, but I'll repaste here. You can see that my
column names are all unique (source_id, dest_id, pairs_tc, and
scores_group_overlap), yet it still tries to disambiguate.

grunt> describe filtered_scores
filtered_scores: {unified_pair_scores: {dest_id: int,pairs_tc:
float,scores_group_overlap: double,source_id: int}}

grunt> filtered_scores = FOREACH filtered_scores GENERATE
FLATTEN(unified_pair_scores);

grunt> describe filtered_scores
filtered_scores: {unified_pair_scores::dest_id:
int,unified_pair_scores::pairs_tc:
float,unified_pair_scores::scores_group_overlap:
double,unified_pair_scores::source_id: int}

I don't want to use the AS to rename values because the column types are
dynamic, so I will not always know what's coming in.

Is there an open bug on this?

Thanks!
Chris
On 6/29/09 8:56 AM, "Santhosh Srinivasan" <[EMAIL PROTECTED]> wrote:

> The disambiguation can be dropped if the column name is unique. A workaround
> for now is to explicitly name your column names when you flatten.
>
> filtered_scores = FOREACH filtered_scores GENERATE
> FLATTEN(unified_pair_scores) as (dest_id, pairs_tc, scores_group_overlap,
> source_id);
>
> The following should work (I have not tried it yet). If Pig is insisting on
> the disambiguation even when the column name is unique then it's a bug.
>
> filtered_scores = FOREACH filtered_scores GENERATE
> FLATTEN(unified_pair_scores);
> unique_name = FOREACH filtered_scores GENERATE dest_id;
>
> Santhosh
>
> -----Original Message-----
> From: Chris Riccomini [mailto:[EMAIL PROTECTED]]
> Sent: Monday, June 29, 2009 8:37 AM
> To: [EMAIL PROTECTED]
> Subject: FLATTEN disambiguation clause
>
> Hi All,
>
> I¹m trying to use flatten for some pig scripts, but FLATTEN is insisting on
> using the disambiguation clause even when it doesn¹t need to:
>
> Is there any way to force FLATTEN to NOT use the clause? Why is FLATTEN so
> aggressive with this? It¹s a bit irritating, and is causing problems in our
> data flow.
>
> Thanks!
> Chris
>
> grunt> describe filtered_scores
> filtered_scores: {unified_pair_scores: {dest_id: int,pairs_tc:
> float,scores_group_overlap: double,source_id: int}}
>
> grunt> filtered_scores = FOREACH filtered_scores GENERATE
> FLATTEN(unified_pair_scores);
> grunt> describe filtered_scores
> filtered_scores: {unified_pair_scores::dest_id:
> int,unified_pair_scores::pairs_tc:
> float,unified_pair_scores::scores_group_overlap:
> double,unified_pair_scores::source_id: int}
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB