Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig >> mail # user >> Group By data


Copy link to this message
-
Re: Group By data
I think this issue can be caused by https://issues.apache.org/jira/browse/PIG-1525 , can you check if your trunk version of pig is newer than Aug 9th ?
(I haven't tried running query against the sample yet).
-Thejas

On 8/25/10 11:21 AM, "Wasti, Syed" <[EMAIL PROTECTED]> wrote:

Hi Dmitriy,
Thanks for offering help, attached is the sample data file. From my
observation it looks like it has to something with min function on the
grouped data. The id's for which it is picking up the wrong date, the date
is from the previous id in sequence. You should get a better idea when you
see the output data.
Please let  me know of your findings.

On 8/24/10 11:33 PM, "Dmitriy Ryaboy" <[EMAIL PROTECTED]> wrote:

> Could you send sample data that would allow us to reproduce this error?
>
> -Dmitriy
>
> On Tue, Aug 24, 2010 at 1:12 PM, Wasti, Syed <[EMAIL PROTECTED]> wrote:
>
>> Hi,
>> I have a very simple script and seeing a very strange behavior, getting
>> wrong results when running this script from a file, while running the same
>> statements on the pig grunt shell I get accurate results.
>>
>> table =       LOAD ' sample' USING PigStorage('\t')
>>                    AS(a: long, b: int, c, id: long, e, f, g, h, i: int, j,
>> k, l, m, n, o, p, q, s, t, u, v, w, x, date: chararray, z);
>>
>> gen_table =    FOREACH table GENERATE a, id, date;
>>
>> grp_table =   GROUP gen_table BY (a, id);
>>
>> gen_grp_table =FOREACH grp_table {
>>                min_creation_date = MIN(gen_table.date);
>>                max_creation_date = MAX(gen_table.date);
>>                GENERATE group.id,
>>                (chararray)(group.a == 1?min_creation_date:null) AS
>> first_p_date,
>>                (chararray)(group.a == 1?max_creation_date:null) AS
>> last_p_date,
>>                (chararray)(group.a == 2?min_creation_date:null) AS
>> first_n_date,
>>                (chararray)(group.a == 2?max_creation_date:null) AS
>> last_n_date,
>>                (chararray)(group.a == 3?min_creation_date:null) AS
>> first_t_date,
>>                (chararray)(group.a == 3?max_creation_date:null) AS
>> last_t_date ;};
>>
>> dump gen_grp_table;
>>
>> Wrong results when running from the script, these dates belong to some
>> other
>> id's.
>> (3860,,,2010-03-24 22:49:38,1970-01-01 00:00:00,,)
>> (3509,,,2010-08-12 04:57:17,2003-05-20 17:02:54,,)
>> (5096,,,,,2010-08-20 00:43:08,1970-01-01 00:00:00)
>> (1673,,,,,2010-08-20 02:19:44,1970-01-01 00:00:00)
>>
>> Expected results, you this only when running from grunt shell
>> (3860,,,1970-01-01 00:00:00,1970-01-01 00:00:00,,)
>> (3509,,,2003-05-20 17:02:54,2003-05-20 17:02:54,,)
>> (5096,,,,,1970-01-01 00:00:00,1970-01-01 00:00:00)
>> (1673,,,,,1970-01-01 00:00:00,1970-01-01 00:00:00)
>>
>> Have someone come across a similar issue, I am using the trunk version of
>> pig and not sure why this behavior, suggestions please.
>>
>> Regards
>> Syed
>>

NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB