Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig >> mail # user >> count distinct on multiple columns


Copy link to this message
-
Re: count distinct on multiple columns
Great question. There seems to be some confusion about how DISTINCT
operates. I remembered (and thankfully found) this
message<http://mail-archives.apache.org/mod_mbox/pig-user/201309.mbox/%3CCAE7pYjar3hX4Kp%2B5SQz3sr%3DvjxfQDVq_6Yi4vh9KgfOj3dzTGw%40mail.gmail.com%3E>
that
explains the behavior.

As per the other post, it looks like what you've documented is expected
behavior.
On Mon, Oct 28, 2013 at 4:15 PM, Min Zhou <[EMAIL PROTECTED]> wrote:

> Hi all,
>
> Below script is how we count distinct on columns jid and mid
>
> sjv =  LOAD '/path/of/the/data' USING AvroStorage();
> jv = FOREACH sjv GENERATE TOTUPLE(jid, mid) AS jid_mid, time;
> groupv = GROUP jv ALL;
> countv = FOREACH groupv {
>         unique = DISTINCT jv.jid_mid;
>         GENERATE COUNT(unique);
>         };
> dump countv;
> The result is 2302351.
>
> If I use code below, got another result
> sjv =  LOAD '/path/of/the/data' USING AvroStorage();
> groupv = GROUP sjv ALL;
> countv = FOREACH groupv {
>         jid_mid = sjv.(jid, mid);
>         unique = DISTINCT jid_mid;
>         GENERATE COUNT(unique);
>         };
> dump countv;
> The result is 2290003.
>
> If I concat the two columns with a delimiter never exists in jid and mid, I
> got another result which I think is the correct answer of this aggregation.
> sjv =  LOAD '/path/of/the/data' USING AvroStorage();
> jv = FOREACH sjv GENERATE CONCAT(jid, CONCAT(':', mid)) AS jid_mid, time;
> groupv = GROUP jv ALL;
> countv = FOREACH groupv {
>         unique = DISTINCT jv.jid_mid;
>         GENERATE COUNT(unique);
>         };
> dump countv;
> The result is 2386385.
>
> I did a test with below script
> sjv =  LOAD '/path/of/the/data' USING AvroStorage();
> groupv = GROUP jv BY (jid, mid);
> unique = FOREACH groupv GENERATE FLATTEN(group), MIN(time);
> store unique ....;
> The hadoop counters showed that there are 2386385 records written into
> HDFS.  The number is as same as the 3rd pig script I list above.
>
> Can anyone explain the difference among those three?  Whey they lead to
> different results?
>
> Regards,
> Min
> --
> My research interests are distributed systems, parallel computing and
> bytecode based virtual machine.
>
> My profile:
> http://www.linkedin.com/in/coderplay
> My blog:
> http://coderplay.javaeye.com
>
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB