Pig, mail # user - Re: Problem in understanding UDF COUNT - 2014-07-21, 15:34
 Search Hadoop and all its subprojects:

Switch to Threaded View
Copy link to this message
Re: Problem in understanding UDF COUNT

*d = foreach c generate COUNT(b), group;*

I interpret or visualize is as:

c is a structure holding or consisting of groups of words or items. Imagine
a list where each entry is the groupid and each groupid points to a
collection of objects/items belonging to that same groupid. We can call
this collection b. You can also imagine c as a nested map, where the key is
distinct groupids and the value is a collection of items (again, let us
call it b) belonging to one key.

So, now you want to count how many items exist for for each groupid in list
(or map) c. Recall that we are calling group of items for each value of c
as b.

c[0]=new york points to  [1,2,3]
c[1]=philadelphia points to  [1,2,3,4]
c[2]=boston points to  [5,6,7,8,9]

So in the above example in the c list we have 3 unique gropuids (new york,
boston and philadelphia) and each point to its own collection of items that
we are calling b. We want to know the count for each group, which is 3,4 &
5 for new york, philadelphia & boston respectively.

Now coming back to the pig statement once again:
*d = foreach c generate COUNT(b), group;*

This is exactly what we are doing....
*Counting for each c (new york, philadelphia, boston in out example), how
many b's are in there (3,4 & 5).*

The second argument to the pig statement of 'group' will give us the group
id (the c's) for each count of b as well.

On Mon, Jul 21, 2014 at 11:02 AM, <[EMAIL PROTECTED]>
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB