Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Plain View
Pig >> mail # user >> filter duplicates from a bag


+
Marco Cadetg 2012-08-24, 09:35
Copy link to this message
-
Re: filter duplicates from a bag
I would say something along these lines:

B = group A by *;
C = foreach B generate group, COUNT(A) as count;
D = filter C by count > 1;
E = foreach D generate group;

Disclaimer: untested code.

Cheers,
--
Gianmarco

On Fri, Aug 24, 2012 at 11:35 AM, Marco Cadetg <[EMAIL PROTECTED]> wrote:

> Hi there,
>
> What is the best way to retrieve duplicates from a bag. I basically would
> like to do something like the opposite of DISTINCT.
>
> A: {userid: long,foo: long,bar: long}
>
> dump A
> (1,2,3)
> (1,2,3)
> (1,3,2)
> (2,3,1)
>
> Now I would like to have a bag which contains
> (1,2,3)
> (1,2,3)
>
> Thanks,
> -Marco
>
+
Marco Cadetg 2012-08-24, 10:25
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB