Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig, mail # user - filter duplicates from a bag


Copy link to this message
-
Re: filter duplicates from a bag
Gianmarco De Francisci Mo... 2012-08-24, 10:19
I would say something along these lines:

B = group A by *;
C = foreach B generate group, COUNT(A) as count;
D = filter C by count > 1;
E = foreach D generate group;

Disclaimer: untested code.

Cheers,
--
Gianmarco

On Fri, Aug 24, 2012 at 11:35 AM, Marco Cadetg <[EMAIL PROTECTED]> wrote:

> Hi there,
>
> What is the best way to retrieve duplicates from a bag. I basically would
> like to do something like the opposite of DISTINCT.
>
> A: {userid: long,foo: long,bar: long}
>
> dump A
> (1,2,3)
> (1,2,3)
> (1,3,2)
> (2,3,1)
>
> Now I would like to have a bag which contains
> (1,2,3)
> (1,2,3)
>
> Thanks,
> -Marco
>