Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig >> mail # user >> FLATTEN disambiguation clause


Copy link to this message
-
RE: FLATTEN disambiguation clause
Hi Chris,

Right now, the alias is set to the disambiguated alias. There is no mechanism to retrieve the unique alias if one exists. This enhancement has to be added. I have filed a JIRA - https://issues.apache.org/jira/browse/PIG-866

Thanks,
Santhosh

-----Original Message-----
From: Chris Riccomini [mailto:[EMAIL PROTECTED]]
Sent: Monday, June 29, 2009 10:56 AM
To: [EMAIL PROTECTED]
Subject: Re: FLATTEN disambiguation clause

Hi Santhosh,

This is good to know. This solves all of my problems except for one. Our
StoreFunc is using field.alias to serialize our pig data to disk, which is
giving us unified_pair_scores::dest_id, etc.

How do I get just 'dest_id' from the field?

Thanks!
Chris
On 6/29/09 10:24 AM, "Santhosh Srinivasan" <[EMAIL PROTECTED]> wrote:

> Hi Chris,
>
> I was probably not clear in my earlier response. The disambiguated names are
> always shown as the correct names. However, if a column name is unique then
> you should still be able to access the columns with the unique names.
>
> In my example, I added a new line after the flatten that accesses the column
> with the unique name. I am pasting it below for reference.
>
> filtered_scores = FOREACH filtered_scores GENERATE
> FLATTEN(unified_pair_scores);
>
> -- the line below accesses the column with the unique name and not with the
> disambiguated name
> unique_name = FOREACH filtered_scores GENERATE dest_id;
>
> To summarize, unique column names are accessible with either the disambiguated
> name or with the unique column name. Another example that validates the point
> follows:
>
> grunt> a = load 'input' as (name, age, gpa);
> grunt> b = group a ALL;
> grunt> c = foreach b generate flatten(a);
>
> grunt> describe c;
> c: {a::name: bytearray,a::age: bytearray,a::gpa: bytearray}
>
> grunt> d = foreach c generate name;
>
> grunt> describe d;
> d: {a::name: bytearray}
>
> Having explained that, I am not quite sure if that addresses your problem. I
> probably need to understand your use case.
>
> Thanks,
> Santhosh
>
> -----Original Message-----
> From: Chris Riccomini [mailto:[EMAIL PROTECTED]]
> Sent: Monday, June 29, 2009 9:00 AM
> To: [EMAIL PROTECTED]
> Subject: Re: FLATTEN disambiguation clause
>
> Hi Santhosh,
>
> Thanks for the fast response. It appears that it is a bug then. My query is
> at the bottom of this thread, but I'll repaste here. You can see that my
> column names are all unique (source_id, dest_id, pairs_tc, and
> scores_group_overlap), yet it still tries to disambiguate.
>
> grunt> describe filtered_scores
> filtered_scores: {unified_pair_scores: {dest_id: int,pairs_tc:
> float,scores_group_overlap: double,source_id: int}}
>
> grunt> filtered_scores = FOREACH filtered_scores GENERATE
> FLATTEN(unified_pair_scores);
>
> grunt> describe filtered_scores
> filtered_scores: {unified_pair_scores::dest_id:
> int,unified_pair_scores::pairs_tc:
> float,unified_pair_scores::scores_group_overlap:
> double,unified_pair_scores::source_id: int}
>
> I don't want to use the AS to rename values because the column types are
> dynamic, so I will not always know what's coming in.
>
> Is there an open bug on this?
>
> Thanks!
> Chris
>
>
> On 6/29/09 8:56 AM, "Santhosh Srinivasan" <[EMAIL PROTECTED]> wrote:
>
>> The disambiguation can be dropped if the column name is unique. A workaround
>> for now is to explicitly name your column names when you flatten.
>>
>> filtered_scores = FOREACH filtered_scores GENERATE
>> FLATTEN(unified_pair_scores) as (dest_id, pairs_tc, scores_group_overlap,
>> source_id);
>>
>> The following should work (I have not tried it yet). If Pig is insisting on
>> the disambiguation even when the column name is unique then it's a bug.
>>
>> filtered_scores = FOREACH filtered_scores GENERATE
>> FLATTEN(unified_pair_scores);
>> unique_name = FOREACH filtered_scores GENERATE dest_id;
>>
>> Santhosh
>>
>> -----Original Message-----
>> From: Chris Riccomini [mailto:[EMAIL PROTECTED]]
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB