Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig >> mail # user >> exclude rows from group


Copy link to this message
-
Re: exclude rows from group
Marco,

What you want is a combination of COGROUP and FILTER, see:

$: cat foo.tsv
1 rich
1 happy
2 rich
3 happy
4 rich
----

A = LOAD 'foo.tsv' AS (user_id:int, user_type:chararray);

split A into happy if user_type=='happy', rich if user_type=='rich';

B = COGROUP happy by user_id, rich by user_id;

rich_and_not_happy = foreach (filter B by IsEmpty(happy) and NOT
IsEmpty(rich)) generate group as user_id;

DUMP rich_and_not_happy;

--jacob
@thedatachef

On Tue, 2012-02-28 at 16:49 +0100, Marco Cadetg wrote:
> Hi there,
>
> I try to retrieve the group of 'rich' userids which are not 'happy' .
> Something like retrieve all ids which are not in the other bags.ids.
>
> Is there a better way to exclude some rows from a group?
>
>
> Example code:
>
> A: {userid: chararray,user_type: chararray}
>
> A:
> (1,rich)
> (1,happy)
> (2,rich)
> (3,happy)
> (4,rich)
>
> RICH = FILTER A BY user_type == 'rich';
> HAPPY = FILTER A BY user_type == 'happy';
>
> dump RICH
> (1,rich)
> (2,rich)
> (4,rich)
>
> BOTH = JOIN RICH BY $0, HAPPY BY $0;
> BOTH = FOREACH (GROUP BOTH ALL) {GENERATE COUNT(BOTH) AS counter;}
>
> RICH_AND_NOT_HAPPY = FOREACH (GROUP RICH ALL) {GENERATE
> COUNT(RICH)-BOTH.counter AS total;}
> dump RICH_AND_NOT_HAPPY
> (2)
>
> Thanks for you help!
> -Marco