Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig >> mail # user >> exclude rows from group


Copy link to this message
-
Re: exclude rows from group
Marco,

What you want is a combination of COGROUP and FILTER, see:

$: cat foo.tsv
1 rich
1 happy
2 rich
3 happy
4 rich
----

A = LOAD 'foo.tsv' AS (user_id:int, user_type:chararray);

split A into happy if user_type=='happy', rich if user_type=='rich';

B = COGROUP happy by user_id, rich by user_id;

rich_and_not_happy = foreach (filter B by IsEmpty(happy) and NOT
IsEmpty(rich)) generate group as user_id;

DUMP rich_and_not_happy;

--jacob
@thedatachef

On Tue, 2012-02-28 at 16:49 +0100, Marco Cadetg wrote:
> Hi there,
>
> I try to retrieve the group of 'rich' userids which are not 'happy' .
> Something like retrieve all ids which are not in the other bags.ids.
>
> Is there a better way to exclude some rows from a group?
>
>
> Example code:
>
> A: {userid: chararray,user_type: chararray}
>
> A:
> (1,rich)
> (1,happy)
> (2,rich)
> (3,happy)
> (4,rich)
>
> RICH = FILTER A BY user_type == 'rich';
> HAPPY = FILTER A BY user_type == 'happy';
>
> dump RICH
> (1,rich)
> (2,rich)
> (4,rich)
>
> BOTH = JOIN RICH BY $0, HAPPY BY $0;
> BOTH = FOREACH (GROUP BOTH ALL) {GENERATE COUNT(BOTH) AS counter;}
>
> RICH_AND_NOT_HAPPY = FOREACH (GROUP RICH ALL) {GENERATE
> COUNT(RICH)-BOTH.counter AS total;}
> dump RICH_AND_NOT_HAPPY
> (2)
>
> Thanks for you help!
> -Marco
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB