Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig, mail # user - count distinct on multiple columns


Copy link to this message
-
Re: count distinct on multiple columns
Pradeep Gollakota 2013-10-29, 19:24
Great question. There seems to be some confusion about how DISTINCT
operates. I remembered (and thankfully found) this
message<http://mail-archives.apache.org/mod_mbox/pig-user/201309.mbox/%3CCAE7pYjar3hX4Kp%2B5SQz3sr%3DvjxfQDVq_6Yi4vh9KgfOj3dzTGw%40mail.gmail.com%3E>
that
explains the behavior.

As per the other post, it looks like what you've documented is expected
behavior.
On Mon, Oct 28, 2013 at 4:15 PM, Min Zhou <[EMAIL PROTECTED]> wrote:

> Hi all,
>
> Below script is how we count distinct on columns jid and mid
>
> sjv =  LOAD '/path/of/the/data' USING AvroStorage();
> jv = FOREACH sjv GENERATE TOTUPLE(jid, mid) AS jid_mid, time;
> groupv = GROUP jv ALL;
> countv = FOREACH groupv {
>         unique = DISTINCT jv.jid_mid;
>         GENERATE COUNT(unique);
>         };
> dump countv;
> The result is 2302351.
>
> If I use code below, got another result
> sjv =  LOAD '/path/of/the/data' USING AvroStorage();
> groupv = GROUP sjv ALL;
> countv = FOREACH groupv {
>         jid_mid = sjv.(jid, mid);
>         unique = DISTINCT jid_mid;
>         GENERATE COUNT(unique);
>         };
> dump countv;
> The result is 2290003.
>
> If I concat the two columns with a delimiter never exists in jid and mid, I
> got another result which I think is the correct answer of this aggregation.
> sjv =  LOAD '/path/of/the/data' USING AvroStorage();
> jv = FOREACH sjv GENERATE CONCAT(jid, CONCAT(':', mid)) AS jid_mid, time;
> groupv = GROUP jv ALL;
> countv = FOREACH groupv {
>         unique = DISTINCT jv.jid_mid;
>         GENERATE COUNT(unique);
>         };
> dump countv;
> The result is 2386385.
>
> I did a test with below script
> sjv =  LOAD '/path/of/the/data' USING AvroStorage();
> groupv = GROUP jv BY (jid, mid);
> unique = FOREACH groupv GENERATE FLATTEN(group), MIN(time);
> store unique ....;
> The hadoop counters showed that there are 2386385 records written into
> HDFS.  The number is as same as the 3rd pig script I list above.
>
> Can anyone explain the difference among those three?  Whey they lead to
> different results?
>
> Regards,
> Min
> --
> My research interests are distributed systems, parallel computing and
> bytecode based virtual machine.
>
> My profile:
> http://www.linkedin.com/in/coderplay
> My blog:
> http://coderplay.javaeye.com
>