Shahab Thanks My doubt is why are we taking the bag b and not bag c as the arguement in the COUNT(b) function. The bag c contains the groups and not hte bag b. TThanks. On Mon, Jul 21, 2014 at 6:21 PM, Shahab Yunus <[EMAIL PROTECTED]> wrote:
This was hard for me to get when I started using pig, and it still annoys me after 1.5 year's experience with pig. In mathematics and logic, quantifiers (like "for each", "there exist") bind variables that occur in their scope: (for each x)(there exists y) [y > x]
The (for each x) binds x in (there exists y) [y > x]
But in pig the variable x in (for each x) *does not bind occurrences of x* in the following subexpression. IMO this is an unnecessary stumbling block to people learning pig, who have a background in math or logic.
Here is how you can read foreach c generate COUNT(b), group; so it makes sense: c's components are "group" and (bag) b, so: foreach (group, b) in c generate COUNT(b), group;
I would love it if the Pig syntax were extended to allow quantifiers like "foreach (group, b) in c" but I don't know how feasible that would be.
William F Dowling Senior Technologist Thomson Reuters
c is a structure holding or consisting of groups of words or items. Imagine a list where each entry is the groupid and each groupid points to a collection of objects/items belonging to that same groupid. We can call this collection b. You can also imagine c as a nested map, where the key is distinct groupids and the value is a collection of items (again, let us call it b) belonging to one key.
So, now you want to count how many items exist for for each groupid in list (or map) c. Recall that we are calling group of items for each value of c as b.
c=new york points to [1,2,3] c=philadelphia points to [1,2,3,4] c=boston points to [5,6,7,8,9]
So in the above example in the c list we have 3 unique gropuids (new york, boston and philadelphia) and each point to its own collection of items that we are calling b. We want to know the count for each group, which is 3,4 & 5 for new york, philadelphia & boston respectively.
Now coming back to the pig statement once again: *d = foreach c generate COUNT(b), group;*
This is exactly what we are doing.... *Counting for each c (new york, philadelphia, boston in out example), how many b's are in there (3,4 & 5).*
The second argument to the pig statement of 'group' will give us the group id (the c's) for each count of b as well.
Regards, Shahab On Mon, Jul 21, 2014 at 11:02 AM, <[EMAIL PROTECTED]> wrote:
In this case does the b refer to the tupples corresponding to a single group. If so I still did not get the point because b is a bag that contains all the records and not only the records of a single group
On Jul 21, 2014 8:33 PM, <[EMAIL PROTECTED]> wrote: me after 1.5 year's experience with pig. In mathematics and logic, quantifiers (like "for each", "there exist") bind variables that occur in their scope: x* in the following subexpression. IMO this is an unnecessary stumbling block to people learning pig, who have a background in math or logic. "foreach (group, b) in c" but I don't know how feasible that would be. in the COUNT(b) function.
Thanks Shahab and William I am now clear about The count functionality.But stil I have a doubt in the functioning of UDF in general. Example: a=load 'movies' using PigStorage() as (name:chararray, movid:int,stars:int,comment:varchar(300)); b=group movies by stars; c= foreach b genearte myudf(a); In this case what would be the input to the udf : the entire group or a single tupple of that group. I think the input would be a single tupple of that group for each itteration but not sure. Thanks. Ashish. On Tue, Jul 22, 2014 at 5:30 PM, Shahab Yunus <[EMAIL PROTECTED]> wrote:
The best way to get answers for such easy questions 1. read docs 2. create sample script and run
doc says that a group (bag of tuple having the same 'stars' value) would be passed to your UDF. Can't understand what confuses you. These things are really basics. 2014-07-23 16:30 GMT+04:00 Ashish Dobhal <[EMAIL PROTECTED]>: