Prashanth Pappu
20080605, 22:31
Olga Natkovich
20080605, 23:05
Chris Olston
20080605, 23:08
Chris Olston
20080606, 01:06
Prashanth Pappu
20080606, 01:36
Ted Dunning
20080606, 01:42
Prashanth Pappu
20080606, 01:56
Ted Dunning
20080606, 02:24
Chris Olston
20080606, 17:18
Santhosh Srinivasan
20080606, 17:23
Prashanth Pappu
20080606, 19:06
Ted Dunning
20080606, 19:22
Utkarsh Srivastava
20080606, 19:35
Chris Olston
20080606, 22:09
pi song
20080607, 01:01
pi song
20080607, 01:16


Dealing with empty data bags
(a) I see that at a lot of places where PIG doesn't correctly deal with
results that are empty bags. Here's an example  Counting Tuples. Let's say I want to count number of tuples in 'b' ( a subset of 'a'). I can do the following  a = load 'xyz' as (x,y,z); b = filter a by x==X; c = group b all; d = foreach c generate COUNT(b); Ideally, we want d to be (0) if b has no tuples and nonzero otherwise. Unfortuantely, if b is empty, c is also empty! This is buggy because it causes d to be empty or null and not (0). Whereas, if b is empty, c should ideally be, c = (all, {}). Which will make d = (0). (b) Is there a different way of computing the number of tuples in b that will always (irrespective of whether b is empty or not) give the correct answer? (c) I also see that PIG supports data maps. But I haven't seen any examples that illustrate how to create or manipulate data maps. Is there any such documentation? thanks, Prashanth 
RE: Dealing with empty data bags
I agree with you about the group. Could you, please, open JIRA about it.
I don't think there is a workaround for this issue. Pig does have a limitted support for maps. None of the existing expressions/operators create a map. The only way to get a map is to have them in your input data or for your UDF to generate them. If you do have a map, you can retrive individual values as followis: A = load 'data' as (map); B = foreach A generate map#'key1', map#'key2' ... where key1 and key2 are keys in the map. Olga > Original Message > From: [EMAIL PROTECTED] > [mailto:[EMAIL PROTECTED]] On Behalf Of Prashanth Pappu > Sent: Thursday, June 05, 2008 3:31 PM > To: [EMAIL PROTECTED] > Subject: Dealing with empty data bags > > (a) I see that at a lot of places where PIG doesn't correctly > deal with results that are empty bags. > > Here's an example  Counting Tuples. Let's say I want to > count number of tuples in 'b' ( a subset of 'a'). I can do > the following  > > a = load 'xyz' as (x,y,z); > b = filter a by x==X; > c = group b all; > d = foreach c generate COUNT(b); > > Ideally, we want d to be (0) if b has no tuples and nonzero > otherwise. > Unfortuantely, if b is empty, c is also empty! This is buggy > because it causes d to be empty or null and not (0). > > Whereas, if b is empty, c should ideally be, c = (all, {}). > Which will make d = (0). > > (b) Is there a different way of computing the number of > tuples in b that will always (irrespective of whether b is > empty or not) give the correct answer? > > (c) I also see that PIG supports data maps. But I haven't > seen any examples that illustrate how to create or manipulate > data maps. Is there any such documentation? > > thanks, > Prashanth > 
Re: Dealing with empty data bags
It's not "buggy" or "incorrect", it's just different from the
semantics that you were hoping for. Group and COUNT each have simple, welldefined, and correctlyimplemented semantics. If you feed an empty table into group it produces an empty table; Count over an empty table produces an empty table  hence their composition produces an empty tuple when given an empty table. The question is whether one can construct a Pig program that gives the semantics you want. Unfortunately off the top of my head the answer seems to be 'no'. If that's the case we need to look at what needs to be added/changed in the language to enable testing for empty outermost tables. (If I'm overlooking something I'm sure one of my colleagues will chime in :) Chris On Jun 5, 2008, at 3:31 PM, Prashanth Pappu wrote: > (a) I see that at a lot of places where PIG doesn't correctly deal > with > results that are empty bags. > > Here's an example  Counting Tuples. Let's say I want to count > number of > tuples in 'b' ( a subset of 'a'). I can do the following  > > a = load 'xyz' as (x,y,z); > b = filter a by x==X; > c = group b all; > d = foreach c generate COUNT(b); > > Ideally, we want d to be (0) if b has no tuples and nonzero > otherwise. > Unfortuantely, if b is empty, c is also empty! This is buggy > because it > causes d to be empty or null and not (0). > > Whereas, if b is empty, c should ideally be, c = (all, {}). Which > will make > d = (0). > > (b) Is there a different way of computing the number of tuples in b > that > will always (irrespective of whether b is empty or not) give the > correct > answer? > > (c) I also see that PIG supports data maps. But I haven't seen any > examples > that illustrate how to create or manipulate data maps. Is there any > such > documentation? > > thanks, > Prashanth  Christopher Olston, Ph.D. Sr. Research Scientist Yahoo! Research 
Re: Dealing with empty data bags
Probably the best fix is to redefine GROUP ALL so that in all cases
it outputs a table with exactly one record. In the case of an empty input table it would produce an output record containing an empty bag. Is that what you have in mind, Olga? Chris On Jun 5, 2008, at 4:05 PM, Olga Natkovich wrote: > I agree with you about the group. Could you, please, open JIRA > about it. > I don't think there is a workaround for this issue. > > Pig does have a limitted support for maps. None of the existing > expressions/operators create a map. The only way to get a map is to > have > them in your input data or for your UDF to generate them. If you do > have > a map, you can retrive individual values as followis: > > A = load 'data' as (map); > B = foreach A generate map#'key1', map#'key2' ... > > where key1 and key2 are keys in the map. > > Olga > >> Original Message >> From: [EMAIL PROTECTED] >> [mailto:[EMAIL PROTECTED]] On Behalf Of Prashanth Pappu >> Sent: Thursday, June 05, 2008 3:31 PM >> To: [EMAIL PROTECTED] >> Subject: Dealing with empty data bags >> >> (a) I see that at a lot of places where PIG doesn't correctly >> deal with results that are empty bags. >> >> Here's an example  Counting Tuples. Let's say I want to >> count number of tuples in 'b' ( a subset of 'a'). I can do >> the following  >> >> a = load 'xyz' as (x,y,z); >> b = filter a by x==X; >> c = group b all; >> d = foreach c generate COUNT(b); >> >> Ideally, we want d to be (0) if b has no tuples and nonzero >> otherwise. >> Unfortuantely, if b is empty, c is also empty! This is buggy >> because it causes d to be empty or null and not (0). >> >> Whereas, if b is empty, c should ideally be, c = (all, {}). >> Which will make d = (0). >> >> (b) Is there a different way of computing the number of >> tuples in b that will always (irrespective of whether b is >> empty or not) give the correct answer? >> >> (c) I also see that PIG supports data maps. But I haven't >> seen any examples that illustrate how to create or manipulate >> data maps. Is there any such documentation? >> >> thanks, >> Prashanth >>  Christopher Olston, Ph.D. Sr. Research Scientist Yahoo! Research 
Re: Dealing with empty data bags
Thanks Chris for the response.
That brings me to a set of questions regarding empty and null tables/bags that I've been struggling with and hopefully one of you can resolve them for me. (a) I read that PIG has four data types  atom, tuple, bag, map. But, what is a table? Is it the same as bag? How are they different? (b) What is the result data type when we first load data into a variable? For example, > a = load 'xyz' as (x,y,z); > dump a; (1, 2, 3) (2, 4, 5) What is the data type of a? Is it a bag as in a = {(1,2,3), (2,4,5)}? Or is it just a set of tuples (a table) but not a bag? And, we have a representation for an empty bag (= {}), and an empty 'set of tuples' is simply null/empty? (c) I'm trying to understand the differences between bags and tables and verifying if we have defined the semantics to deal with them 'consistently' irrespective of whether they are empty or not. For example, reference my earlier email about an implementation 'bug' in PIG execution engine when using SPLIT on an empty table. Thanks in advance! Prashanth On Thu, Jun 5, 2008 at 4:08 PM, Chris Olston <[EMAIL PROTECTED]> wrote: > It's not "buggy" or "incorrect", it's just different from the semantics > that you were hoping for. Group and COUNT each have simple, welldefined, > and correctlyimplemented semantics. If you feed an empty table into group > it produces an empty table; Count over an empty table produces an empty > table  hence their composition produces an empty tuple when given an empty > table. > > The question is whether one can construct a Pig program that gives the > semantics you want. Unfortunately off the top of my head the answer seems to > be 'no'. If that's the case we need to look at what needs to be > added/changed in the language to enable testing for empty outermost tables. > (If I'm overlooking something I'm sure one of my colleagues will chime in :) > > Chris > > > > On Jun 5, 2008, at 3:31 PM, Prashanth Pappu wrote: > > (a) I see that at a lot of places where PIG doesn't correctly deal with >> results that are empty bags. >> >> Here's an example  Counting Tuples. Let's say I want to count number of >> tuples in 'b' ( a subset of 'a'). I can do the following  >> >> a = load 'xyz' as (x,y,z); >> b = filter a by x==X; >> c = group b all; >> d = foreach c generate COUNT(b); >> >> Ideally, we want d to be (0) if b has no tuples and nonzero otherwise. >> Unfortuantely, if b is empty, c is also empty! This is buggy because it >> causes d to be empty or null and not (0). >> >> Whereas, if b is empty, c should ideally be, c = (all, {}). Which will >> make >> d = (0). >> >> (b) Is there a different way of computing the number of tuples in b that >> will always (irrespective of whether b is empty or not) give the correct >> answer? >> >> (c) I also see that PIG supports data maps. But I haven't seen any >> examples >> that illustrate how to create or manipulate data maps. Is there any such >> documentation? >> >> thanks, >> Prashanth >> > >  > Christopher Olston, Ph.D. > Sr. Research Scientist > Yahoo! Research > > > 
RE: Dealing with empty data bagsI don't thing that this is the correct semantics if pig intends to be set theoretically correct. What would the key be on this one record? An empty bag? But what if two input records have an empty bag as key? There are NO correct members of an empty set. It is empty. The empty set of records has no records in it, not one. Original Message From: Chris Olston [mailto:[EMAIL PROTECTED]] Sent: Thu 6/5/2008 6:06 PM To: [EMAIL PROTECTED] Subject: Re: Dealing with empty data bags Probably the best fix is to redefine GROUP ALL so that in all cases it outputs a table with exactly one record. In the case of an empty input table it would produce an output record containing an empty bag. Is that what you have in mind, Olga? Chris On Jun 5, 2008, at 4:05 PM, Olga Natkovich wrote: > I agree with you about the group. Could you, please, open JIRA > about it. > I don't think there is a workaround for this issue. > > Pig does have a limitted support for maps. None of the existing > expressions/operators create a map. The only way to get a map is to > have > them in your input data or for your UDF to generate them. If you do > have > a map, you can retrive individual values as followis: > > A = load 'data' as (map); > B = foreach A generate map#'key1', map#'key2' ... > > where key1 and key2 are keys in the map. > > Olga > >> Original Message >> From: [EMAIL PROTECTED] >> [mailto:[EMAIL PROTECTED]] On Behalf Of Prashanth Pappu >> Sent: Thursday, June 05, 2008 3:31 PM >> To: [EMAIL PROTECTED] >> Subject: Dealing with empty data bags >> >> (a) I see that at a lot of places where PIG doesn't correctly >> deal with results that are empty bags. >> >> Here's an example  Counting Tuples. Let's say I want to >> count number of tuples in 'b' ( a subset of 'a'). I can do >> the following  >> >> a = load 'xyz' as (x,y,z); >> b = filter a by x==X; >> c = group b all; >> d = foreach c generate COUNT(b); >> >> Ideally, we want d to be (0) if b has no tuples and nonzero >> otherwise. >> Unfortuantely, if b is empty, c is also empty! This is buggy >> because it causes d to be empty or null and not (0). >> >> Whereas, if b is empty, c should ideally be, c = (all, {}). >> Which will make d = (0). >> >> (b) Is there a different way of computing the number of >> tuples in b that will always (irrespective of whether b is >> empty or not) give the correct answer? >> >> (c) I also see that PIG supports data maps. But I haven't >> seen any examples that illustrate how to create or manipulate >> data maps. Is there any such documentation? >> >> thanks, >> Prashanth >>  Christopher Olston, Ph.D. Sr. Research Scientist Yahoo! Research 
Re: Dealing with empty data bags
>> I don't thing that this is the correct semantics if pig intends to be set
theoretically correct. What would the key be on this one record? Wouldn't it just be the atom 'all' ? > dump a; (1,2) > b = group a all; > dump b; (all, {(1,2)}) > dump a; [Empty] > b = group a all; > dump b; (all, {}) =====> Consistent irrespective of whether a is empty or not. Prashanth > > > > Original Message > From: Chris Olston [mailto:[EMAIL PROTECTED]] > Sent: Thu 6/5/2008 6:06 PM > To: [EMAIL PROTECTED] > Subject: Re: Dealing with empty data bags > > Probably the best fix is to redefine GROUP ALL so that in all cases > it outputs a table with exactly one record. In the case of an empty > input table it would produce an output record containing an empty > bag. Is that what you have in mind, Olga? > > Chris > > > On Jun 5, 2008, at 4:05 PM, Olga Natkovich wrote: > > > I agree with you about the group. Could you, please, open JIRA > > about it. > > I don't think there is a workaround for this issue. > > > > Pig does have a limitted support for maps. None of the existing > > expressions/operators create a map. The only way to get a map is to > > have > > them in your input data or for your UDF to generate them. If you do > > have > > a map, you can retrive individual values as followis: > > > > A = load 'data' as (map); > > B = foreach A generate map#'key1', map#'key2' ... > > > > where key1 and key2 are keys in the map. > > > > Olga > > > >> Original Message > >> From: [EMAIL PROTECTED] > >> [mailto:[EMAIL PROTECTED]] On Behalf Of Prashanth Pappu > >> Sent: Thursday, June 05, 2008 3:31 PM > >> To: [EMAIL PROTECTED] > >> Subject: Dealing with empty data bags > >> > >> (a) I see that at a lot of places where PIG doesn't correctly > >> deal with results that are empty bags. > >> > >> Here's an example  Counting Tuples. Let's say I want to > >> count number of tuples in 'b' ( a subset of 'a'). I can do > >> the following  > >> > >> a = load 'xyz' as (x,y,z); > >> b = filter a by x==X; > >> c = group b all; > >> d = foreach c generate COUNT(b); > >> > >> Ideally, we want d to be (0) if b has no tuples and nonzero > >> otherwise. > >> Unfortuantely, if b is empty, c is also empty! This is buggy > >> because it causes d to be empty or null and not (0). > >> > >> Whereas, if b is empty, c should ideally be, c = (all, {}). > >> Which will make d = (0). > >> > >> (b) Is there a different way of computing the number of > >> tuples in b that will always (irrespective of whether b is > >> empty or not) give the correct answer? > >> > >> (c) I also see that PIG supports data maps. But I haven't > >> seen any examples that illustrate how to create or manipulate > >> data maps. Is there any such documentation? > >> > >> thanks, > >> Prashanth > >> > >  > Christopher Olston, Ph.D. > Sr. Research Scientist > Yahoo! Research > > > > 
RE: Dealing with empty data bagsYou are correct. Sorry to misunderstand you. Original Message From: [EMAIL PROTECTED] on behalf of Prashanth Pappu Sent: Thu 6/5/2008 6:56 PM To: [EMAIL PROTECTED] Subject: Re: Dealing with empty data bags >> I don't thing that this is the correct semantics if pig intends to be set theoretically correct. What would the key be on this one record? Wouldn't it just be the atom 'all' ? > dump a; (1,2) > b = group a all; > dump b; (all, {(1,2)}) > dump a; [Empty] > b = group a all; > dump b; (all, {}) =====> Consistent irrespective of whether a is empty or not. Prashanth > > > > Original Message > From: Chris Olston [mailto:[EMAIL PROTECTED]] > Sent: Thu 6/5/2008 6:06 PM > To: [EMAIL PROTECTED] > Subject: Re: Dealing with empty data bags > > Probably the best fix is to redefine GROUP ALL so that in all cases > it outputs a table with exactly one record. In the case of an empty > input table it would produce an output record containing an empty > bag. Is that what you have in mind, Olga? > > Chris > > > On Jun 5, 2008, at 4:05 PM, Olga Natkovich wrote: > > > I agree with you about the group. Could you, please, open JIRA > > about it. > > I don't think there is a workaround for this issue. > > > > Pig does have a limitted support for maps. None of the existing > > expressions/operators create a map. The only way to get a map is to > > have > > them in your input data or for your UDF to generate them. If you do > > have > > a map, you can retrive individual values as followis: > > > > A = load 'data' as (map); > > B = foreach A generate map#'key1', map#'key2' ... > > > > where key1 and key2 are keys in the map. > > > > Olga > > > >> Original Message > >> From: [EMAIL PROTECTED] > >> [mailto:[EMAIL PROTECTED]] On Behalf Of Prashanth Pappu > >> Sent: Thursday, June 05, 2008 3:31 PM > >> To: [EMAIL PROTECTED] > >> Subject: Dealing with empty data bags > >> > >> (a) I see that at a lot of places where PIG doesn't correctly > >> deal with results that are empty bags. > >> > >> Here's an example  Counting Tuples. Let's say I want to > >> count number of tuples in 'b' ( a subset of 'a'). I can do > >> the following  > >> > >> a = load 'xyz' as (x,y,z); > >> b = filter a by x==X; > >> c = group b all; > >> d = foreach c generate COUNT(b); > >> > >> Ideally, we want d to be (0) if b has no tuples and nonzero > >> otherwise. > >> Unfortuantely, if b is empty, c is also empty! This is buggy > >> because it causes d to be empty or null and not (0). > >> > >> Whereas, if b is empty, c should ideally be, c = (all, {}). > >> Which will make d = (0). > >> > >> (b) Is there a different way of computing the number of > >> tuples in b that will always (irrespective of whether b is > >> empty or not) give the correct answer? > >> > >> (c) I also see that PIG supports data maps. But I haven't > >> seen any examples that illustrate how to create or manipulate > >> data maps. Is there any such documentation? > >> > >> thanks, > >> Prashanth > >> > >  > Christopher Olston, Ph.D. > Sr. Research Scientist > Yahoo! Research > > > > 
Re: Dealing with empty data bags
Prashanth,
You bring up a very good point about bags vs. tables. A bag is an ordered multiset of tuples. A table is an ordered multiset of tuples. (Ordered multiset is a fancy way of saying "list", unless I'm overlooking something :) To my knowledge there is no difference between the two, semantically. In our *implementation* we have a special name for bags at the outermost level of nesting: tables. And we treat tables differently from nested bags in our implementation (at present, we parallelize operations over tables, but do not parallelize operations over nested bags.) The fact that the table/bag distinction percolated up to the user level is probably a mistake  there should only be 3 uservisible types: table, tuple, atom. (I prefer the name "table" over "bag", because "bag" implies unordered, when in fact in Pig our collections are ordered.) Anyone disagree? Chris On Jun 5, 2008, at 6:36 PM, Prashanth Pappu wrote: > Thanks Chris for the response. > > That brings me to a set of questions regarding empty and null > tables/bags > that I've been struggling with and hopefully one of you can resolve > them for > me. > > (a) I read that PIG has four data types  atom, tuple, bag, map. > But, what > is a table? Is it the same as bag? How are they different? > > (b) What is the result data type when we first load data into a > variable? > For example, > >> a = load 'xyz' as (x,y,z); >> dump a; > (1, 2, 3) > (2, 4, 5) > > What is the data type of a? Is it a bag as in a = {(1,2,3), > (2,4,5)}? Or is > it just a set of tuples (a table) but not a bag? And, we have a > representation for an empty bag (= {}), and an empty 'set of > tuples' is > simply null/empty? > > (c) I'm trying to understand the differences between bags and > tables and > verifying if we have defined the semantics to deal with them > 'consistently' > irrespective of whether they are empty or not. For example, > reference my > earlier email about an implementation 'bug' in PIG execution engine > when > using SPLIT on an empty table. > > Thanks in advance! > Prashanth > > On Thu, Jun 5, 2008 at 4:08 PM, Chris Olston <[EMAIL PROTECTED]> > wrote: > >> It's not "buggy" or "incorrect", it's just different from the >> semantics >> that you were hoping for. Group and COUNT each have simple, well >> defined, >> and correctlyimplemented semantics. If you feed an empty table >> into group >> it produces an empty table; Count over an empty table produces an >> empty >> table  hence their composition produces an empty tuple when >> given an empty >> table. >> >> The question is whether one can construct a Pig program that gives >> the >> semantics you want. Unfortunately off the top of my head the >> answer seems to >> be 'no'. If that's the case we need to look at what needs to be >> added/changed in the language to enable testing for empty >> outermost tables. >> (If I'm overlooking something I'm sure one of my colleagues will >> chime in :) >> >> Chris >> >> >> >> On Jun 5, 2008, at 3:31 PM, Prashanth Pappu wrote: >> >> (a) I see that at a lot of places where PIG doesn't correctly >> deal with >>> results that are empty bags. >>> >>> Here's an example  Counting Tuples. Let's say I want to count >>> number of >>> tuples in 'b' ( a subset of 'a'). I can do the following  >>> >>> a = load 'xyz' as (x,y,z); >>> b = filter a by x==X; >>> c = group b all; >>> d = foreach c generate COUNT(b); >>> >>> Ideally, we want d to be (0) if b has no tuples and nonzero >>> otherwise. >>> Unfortuantely, if b is empty, c is also empty! This is buggy >>> because it >>> causes d to be empty or null and not (0). >>> >>> Whereas, if b is empty, c should ideally be, c = (all, {}). Which >>> will >>> make >>> d = (0). >>> >>> (b) Is there a different way of computing the number of tuples in >>> b that >>> will always (irrespective of whether b is empty or not) give the >>> correct >>> answer? >>> >>> (c) I also see that PIG supports data maps. But I haven't seen any Christopher Olston, Ph.D. Sr. Research Scientist Yahoo! Research 
RE: Dealing with empty data bags
Chris,
Did you mean unordered when you said "A bag is an ordered multiset of tuples." Further down you say "because "bag" implies unordered". Santhosh Original Message From: Chris Olston [mailto:[EMAIL PROTECTED]] Sent: Friday, June 06, 2008 10:19 AM To: [EMAIL PROTECTED] Subject: Re: Dealing with empty data bags Prashanth, You bring up a very good point about bags vs. tables. A bag is an ordered multiset of tuples. A table is an ordered multiset of tuples. (Ordered multiset is a fancy way of saying "list", unless I'm overlooking something :) To my knowledge there is no difference between the two, semantically. In our *implementation* we have a special name for bags at the outermost level of nesting: tables. And we treat tables differently from nested bags in our implementation (at present, we parallelize operations over tables, but do not parallelize operations over nested bags.) The fact that the table/bag distinction percolated up to the user level is probably a mistake  there should only be 3 uservisible types: table, tuple, atom. (I prefer the name "table" over "bag", because "bag" implies unordered, when in fact in Pig our collections are ordered.) Anyone disagree? Chris On Jun 5, 2008, at 6:36 PM, Prashanth Pappu wrote: > Thanks Chris for the response. > > That brings me to a set of questions regarding empty and null > tables/bags > that I've been struggling with and hopefully one of you can resolve > them for > me. > > (a) I read that PIG has four data types  atom, tuple, bag, map. > But, what > is a table? Is it the same as bag? How are they different? > > (b) What is the result data type when we first load data into a > variable? > For example, > >> a = load 'xyz' as (x,y,z); >> dump a; > (1, 2, 3) > (2, 4, 5) > > What is the data type of a? Is it a bag as in a = {(1,2,3), > (2,4,5)}? Or is > it just a set of tuples (a table) but not a bag? And, we have a > representation for an empty bag (= {}), and an empty 'set of > tuples' is > simply null/empty? > > (c) I'm trying to understand the differences between bags and > tables and > verifying if we have defined the semantics to deal with them > 'consistently' > irrespective of whether they are empty or not. For example, > reference my > earlier email about an implementation 'bug' in PIG execution engine > when > using SPLIT on an empty table. > > Thanks in advance! > Prashanth > > On Thu, Jun 5, 2008 at 4:08 PM, Chris Olston <[EMAIL PROTECTED]> > wrote: > >> It's not "buggy" or "incorrect", it's just different from the >> semantics >> that you were hoping for. Group and COUNT each have simple, well >> defined, >> and correctlyimplemented semantics. If you feed an empty table >> into group >> it produces an empty table; Count over an empty table produces an >> empty >> table  hence their composition produces an empty tuple when >> given an empty >> table. >> >> The question is whether one can construct a Pig program that gives >> the >> semantics you want. Unfortunately off the top of my head the >> answer seems to >> be 'no'. If that's the case we need to look at what needs to be >> added/changed in the language to enable testing for empty >> outermost tables. >> (If I'm overlooking something I'm sure one of my colleagues will >> chime in :) >> >> Chris >> >> >> >> On Jun 5, 2008, at 3:31 PM, Prashanth Pappu wrote: >> >> (a) I see that at a lot of places where PIG doesn't correctly >> deal with >>> results that are empty bags. >>> >>> Here's an example  Counting Tuples. Let's say I want to count >>> number of >>> tuples in 'b' ( a subset of 'a'). I can do the following  >>> >>> a = load 'xyz' as (x,y,z); >>> b = filter a by x==X; >>> c = group b all; >>> d = foreach c generate COUNT(b); >>> >>> Ideally, we want d to be (0) if b has no tuples and nonzero >>> otherwise. >>> Unfortuantely, if b is empty, c is also empty! This is buggy >>> because it Christopher Olston, Ph.D. Sr. Research Scientist Yahoo! Research 
Re: Dealing with empty data bags
Chris,
Thanks for the clarification. I think one reason why users do notice the distinction between the outermost tables and other tables is because of the difference in representation.  a tuple is always enclosed in '(' and ')'.  a map is always enclosed in '[' and ']'  an inner table is always enclosed in '{' and '}' but an outer table has no enclosing braces! I think enclosing even the outermost tables in '{' and '}' will make it clear that all tables are indentical, atleast, semantically. For example, > a= load '/xy' as (x,y); >dump a (1,2) ===> should be {(1,2)} and > b = filter a by x==3; > dump b; [Nothing] ===> should be {} This will definitely make things a lot easier to understand. And this also raises a second question  Why are all functions defined over tables like COUNT, SUM, AVG etc. usable only from a FOREACH statement? For example, to count the number of tuples in a table, we currently use  > a = load '/xy' as (x,y); > b = group a all; > c = foreach b generate COUNT(a); Now that we know that a is a table like any other, I'm sure many users wonder why we can't simply use > a = load 'xy' as (x,y); > c = COUNT(a); And, I think I now understand the reason  because operations over outermost tables are parallelized and operations over inner tables are not. So, the above operation would be ok, if we figure out a way to automatically parallelize table operations (like COUNT(a)). But I agree, the fact that table operations (like COUNT, AVG etc) cannot be used on outermost tables (atleast currently) shouldn't stop us from thinking that even outermost tables are simply tables. The change in representation for outermost tables will help clear the confusion. Prashanth On Fri, Jun 6, 2008 at 10:18 AM, Chris Olston <[EMAIL PROTECTED]> wrote: > Prashanth, > > You bring up a very good point about bags vs. tables. > > A bag is an ordered multiset of tuples. A table is an ordered multiset of > tuples. (Ordered multiset is a fancy way of saying "list", unless I'm > overlooking something :) > > To my knowledge there is no difference between the two, semantically. > > In our *implementation* we have a special name for bags at the outermost > level of nesting: tables. And we treat tables differently from nested bags > in our implementation (at present, we parallelize operations over tables, > but do not parallelize operations over nested bags.) > > The fact that the table/bag distinction percolated up to the user level is > probably a mistake  there should only be 3 uservisible types: table, > tuple, atom. > > (I prefer the name "table" over "bag", because "bag" implies unordered, > when in fact in Pig our collections are ordered.) > > Anyone disagree? > > Chris > > > > On Jun 5, 2008, at 6:36 PM, Prashanth Pappu wrote: > > Thanks Chris for the response. >> >> That brings me to a set of questions regarding empty and null tables/bags >> that I've been struggling with and hopefully one of you can resolve them >> for >> me. >> >> (a) I read that PIG has four data types  atom, tuple, bag, map. But, what >> is a table? Is it the same as bag? How are they different? >> >> (b) What is the result data type when we first load data into a variable? >> For example, >> >> a = load 'xyz' as (x,y,z); >>> dump a; >>> >> (1, 2, 3) >> (2, 4, 5) >> >> What is the data type of a? Is it a bag as in a = {(1,2,3), (2,4,5)}? Or >> is >> it just a set of tuples (a table) but not a bag? And, we have a >> representation for an empty bag (= {}), and an empty 'set of tuples' is >> simply null/empty? >> >> (c) I'm trying to understand the differences between bags and tables and >> verifying if we have defined the semantics to deal with them >> 'consistently' >> irrespective of whether they are empty or not. For example, reference my >> earlier email about an implementation 'bug' in PIG execution engine when >> using SPLIT on an empty table. >> >> Thanks in advance! >> Prashanth >> >> On Thu, Jun 5, 2008 at 4:08 PM, Chris Olston <[EMAIL PROTECTED]> 
RE: Dealing with empty data bagsI think bags are ordered as well, just as he said. The sentence you are mentioning is explaining why Chris thinks the word bag is a bad one (because it implies unordered while the implementation is ordered). Original Message From: Santhosh Srinivasan [mailto:[EMAIL PROTECTED]] Sent: Fri 6/6/2008 10:23 AM To: [EMAIL PROTECTED] Subject: RE: Dealing with empty data bags Chris, Did you mean unordered when you said "A bag is an ordered multiset of tuples." Further down you say "because "bag" implies unordered". Santhosh Original Message From: Chris Olston [mailto:[EMAIL PROTECTED]] Sent: Friday, June 06, 2008 10:19 AM To: [EMAIL PROTECTED] Subject: Re: Dealing with empty data bags Prashanth, You bring up a very good point about bags vs. tables. A bag is an ordered multiset of tuples. A table is an ordered multiset of tuples. (Ordered multiset is a fancy way of saying "list", unless I'm overlooking something :) To my knowledge there is no difference between the two, semantically. In our *implementation* we have a special name for bags at the outermost level of nesting: tables. And we treat tables differently from nested bags in our implementation (at present, we parallelize operations over tables, but do not parallelize operations over nested bags.) The fact that the table/bag distinction percolated up to the user level is probably a mistake  there should only be 3 uservisible types: table, tuple, atom. (I prefer the name "table" over "bag", because "bag" implies unordered, when in fact in Pig our collections are ordered.) Anyone disagree? Chris On Jun 5, 2008, at 6:36 PM, Prashanth Pappu wrote: > Thanks Chris for the response. > > That brings me to a set of questions regarding empty and null > tables/bags > that I've been struggling with and hopefully one of you can resolve > them for > me. > > (a) I read that PIG has four data types  atom, tuple, bag, map. > But, what > is a table? Is it the same as bag? How are they different? > > (b) What is the result data type when we first load data into a > variable? > For example, > >> a = load 'xyz' as (x,y,z); >> dump a; > (1, 2, 3) > (2, 4, 5) > > What is the data type of a? Is it a bag as in a = {(1,2,3), > (2,4,5)}? Or is > it just a set of tuples (a table) but not a bag? And, we have a > representation for an empty bag (= {}), and an empty 'set of > tuples' is > simply null/empty? > > (c) I'm trying to understand the differences between bags and > tables and > verifying if we have defined the semantics to deal with them > 'consistently' > irrespective of whether they are empty or not. For example, > reference my > earlier email about an implementation 'bug' in PIG execution engine > when > using SPLIT on an empty table. > > Thanks in advance! > Prashanth > > On Thu, Jun 5, 2008 at 4:08 PM, Chris Olston <[EMAIL PROTECTED]> > wrote: > >> It's not "buggy" or "incorrect", it's just different from the >> semantics >> that you were hoping for. Group and COUNT each have simple, well >> defined, >> and correctlyimplemented semantics. If you feed an empty table >> into group >> it produces an empty table; Count over an empty table produces an >> empty >> table  hence their composition produces an empty tuple when >> given an empty >> table. >> >> The question is whether one can construct a Pig program that gives >> the >> semantics you want. Unfortunately off the top of my head the >> answer seems to >> be 'no'. If that's the case we need to look at what needs to be >> added/changed in the language to enable testing for empty >> outermost tables. >> (If I'm overlooking something I'm sure one of my colleagues will >> chime in :) >> >> Chris >> >> >> >> On Jun 5, 2008, at 3:31 PM, Prashanth Pappu wrote: >> >> (a) I see that at a lot of places where PIG doesn't correctly >> deal with >>> results that are empty bags. >>> >>> Here's an example  Counting Tuples. Let's say I want to count Christopher Olston, Ph.D. Sr. Research Scientist Yahoo! Research 
RE: Dealing with empty data bags
Hi Prashanth,
You make a very good point about treating the outermost tablets symmetrically as inner tables. We have thought about the kind of symmetric syntax that you are proposing for a while now, just not had the bandwidth to get to it yet. Utkarsh Original Message From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]] On Behalf Of Prashanth Pappu Sent: Friday, June 06, 2008 12:07 PM To: [EMAIL PROTECTED] Subject: Re: Dealing with empty data bags Chris, Thanks for the clarification. I think one reason why users do notice the distinction between the outermost tables and other tables is because of the difference in representation.  a tuple is always enclosed in '(' and ')'.  a map is always enclosed in '[' and ']'  an inner table is always enclosed in '{' and '}' but an outer table has no enclosing braces! I think enclosing even the outermost tables in '{' and '}' will make it clear that all tables are indentical, atleast, semantically. For example, > a= load '/xy' as (x,y); >dump a (1,2) ===> should be {(1,2)} and > b = filter a by x==3; > dump b; [Nothing] ===> should be {} This will definitely make things a lot easier to understand. And this also raises a second question  Why are all functions defined over tables like COUNT, SUM, AVG etc. usable only from a FOREACH statement? For example, to count the number of tuples in a table, we currently use  > a = load '/xy' as (x,y); > b = group a all; > c = foreach b generate COUNT(a); Now that we know that a is a table like any other, I'm sure many users wonder why we can't simply use > a = load 'xy' as (x,y); > c = COUNT(a); And, I think I now understand the reason  because operations over outermost tables are parallelized and operations over inner tables are not. So, the above operation would be ok, if we figure out a way to automatically parallelize table operations (like COUNT(a)). But I agree, the fact that table operations (like COUNT, AVG etc) cannot be used on outermost tables (atleast currently) shouldn't stop us from thinking that even outermost tables are simply tables. The change in representation for outermost tables will help clear the confusion. Prashanth On Fri, Jun 6, 2008 at 10:18 AM, Chris Olston <[EMAIL PROTECTED]> wrote: > Prashanth, > > You bring up a very good point about bags vs. tables. > > A bag is an ordered multiset of tuples. A table is an ordered multiset of > tuples. (Ordered multiset is a fancy way of saying "list", unless I'm > overlooking something :) > > To my knowledge there is no difference between the two, semantically. > > In our *implementation* we have a special name for bags at the outermost > level of nesting: tables. And we treat tables differently from nested bags > in our implementation (at present, we parallelize operations over tables, > but do not parallelize operations over nested bags.) > > The fact that the table/bag distinction percolated up to the user level is > probably a mistake  there should only be 3 uservisible types: table, > tuple, atom. > > (I prefer the name "table" over "bag", because "bag" implies unordered, > when in fact in Pig our collections are ordered.) > > Anyone disagree? > > Chris > > > > On Jun 5, 2008, at 6:36 PM, Prashanth Pappu wrote: > > Thanks Chris for the response. >> >> That brings me to a set of questions regarding empty and null tables/bags >> that I've been struggling with and hopefully one of you can resolve them >> for >> me. >> >> (a) I read that PIG has four data types  atom, tuple, bag, map. But, what >> is a table? Is it the same as bag? How are they different? >> >> (b) What is the result data type when we first load data into a variable? >> For example, >> >> a = load 'xyz' as (x,y,z); >>> dump a; >>> >> (1, 2, 3) >> (2, 4, 5) >> >> What is the data type of a? Is it a bag as in a = {(1,2,3), (2,4,5)}? Or >> is >> it just a set of tuples (a table) but not a bag? And, we have a >> representation for an empty bag (= {}), and an empty 'set of tuples' is and my when semantics welldefined, empty an the seems chime in with number of otherwise. because it will that correct such 
Re: Dealing with empty data bags
Yes, that's right  it was *not* a typo. Pig "bags" are ordered.
By the way, the word "table" is also problematic because Pig does not require uniform schemas across tuples. Usually "table" implies that all member tuples adhere to a given tablelevel schema. Bottom line is that conceptually there is one data type that encompasses what we currently refer to as "bag" and "table". As for a good name for this type, there has been much discussion but no satisfactory outcome. Perhaps "TupleList", but that doesn't have a nice ring to it :). Or we could leave it as "table" and add an asterisk explaining that it may have a nonuniform schema (the common case is probably that there *is* schema uniformity  I would expect irregular schemas to be rare). Or ... ? Chris On Jun 6, 2008, at 12:22 PM, Ted Dunning wrote: > > I think bags are ordered as well, just as he said. > > The sentence you are mentioning is explaining why Chris thinks the > word bag is a bad one (because it implies unordered while the > implementation is ordered). > > > Original Message > From: Santhosh Srinivasan [mailto:[EMAIL PROTECTED]] > Sent: Fri 6/6/2008 10:23 AM > To: [EMAIL PROTECTED] > Subject: RE: Dealing with empty data bags > > Chris, > > Did you mean unordered when you said "A bag is an ordered multiset of > tuples." Further down you say "because "bag" implies unordered". > > Santhosh > > Original Message > From: Chris Olston [mailto:[EMAIL PROTECTED]] > Sent: Friday, June 06, 2008 10:19 AM > To: [EMAIL PROTECTED] > Subject: Re: Dealing with empty data bags > > Prashanth, > > You bring up a very good point about bags vs. tables. > > A bag is an ordered multiset of tuples. A table is an ordered > multiset of tuples. (Ordered multiset is a fancy way of saying > "list", unless I'm overlooking something :) > > To my knowledge there is no difference between the two, semantically. > > In our *implementation* we have a special name for bags at the > outermost level of nesting: tables. And we treat tables differently > from nested bags in our implementation (at present, we parallelize > operations over tables, but do not parallelize operations over nested > bags.) > > The fact that the table/bag distinction percolated up to the user > level is probably a mistake  there should only be 3 uservisible > types: table, tuple, atom. > > (I prefer the name "table" over "bag", because "bag" implies > unordered, when in fact in Pig our collections are ordered.) > > Anyone disagree? > > Chris > > > On Jun 5, 2008, at 6:36 PM, Prashanth Pappu wrote: > >> Thanks Chris for the response. >> >> That brings me to a set of questions regarding empty and null >> tables/bags >> that I've been struggling with and hopefully one of you can resolve >> them for >> me. >> >> (a) I read that PIG has four data types  atom, tuple, bag, map. >> But, what >> is a table? Is it the same as bag? How are they different? >> >> (b) What is the result data type when we first load data into a >> variable? >> For example, >> >>> a = load 'xyz' as (x,y,z); >>> dump a; >> (1, 2, 3) >> (2, 4, 5) >> >> What is the data type of a? Is it a bag as in a = {(1,2,3), >> (2,4,5)}? Or is >> it just a set of tuples (a table) but not a bag? And, we have a >> representation for an empty bag (= {}), and an empty 'set of >> tuples' is >> simply null/empty? >> >> (c) I'm trying to understand the differences between bags and >> tables and >> verifying if we have defined the semantics to deal with them >> 'consistently' >> irrespective of whether they are empty or not. For example, >> reference my >> earlier email about an implementation 'bug' in PIG execution engine >> when >> using SPLIT on an empty table. >> >> Thanks in advance! >> Prashanth >> >> On Thu, Jun 5, 2008 at 4:08 PM, Chris Olston <[EMAIL PROTECTED]> >> wrote: >> >>> It's not "buggy" or "incorrect", it's just different from the >>> semantics >>> that you were hoping for. Group and COUNT each have simple, well Christopher Olston, Ph.D. Sr. Research Scientist Yahoo! Research 
Re: Dealing with empty data bags
We should update Pig Wiki to reflect this. Even me, I have always been
thinking that our semantic of bag == multiset. The only operation that results in "ordered bag" is "ORDER" and any operation on ordered bag doesn't preserve the closure of ordered bag for example B = ORDER A BY $0 ; C = FILTER B BY $0 == 0 The "FILTER" operator doesn't preserve ordered bag closure and outputs only a bag. Also here is what I discussed with Santhosh before regarding: A = FOREACH B { GENERATE B.$0 * B.$1 ; } ; that I think is inappropriate because this operation seems to be very nondeterministic in definition unless we have the notion of order on B. (Besides that fact that we also don't have definitions of Bag x Bag operations like this) Pi On Sat, Jun 7, 2008 at 8:09 AM, Chris Olston <[EMAIL PROTECTED]> wrote: > Yes, that's right  it was *not* a typo. Pig "bags" are ordered. > > By the way, the word "table" is also problematic because Pig does not > require uniform schemas across tuples. Usually "table" implies that all > member tuples adhere to a given tablelevel schema. > > Bottom line is that conceptually there is one data type that encompasses > what we currently refer to as "bag" and "table". As for a good name for this > type, there has been much discussion but no satisfactory outcome. Perhaps > "TupleList", but that doesn't have a nice ring to it :). Or we could leave > it as "table" and add an asterisk explaining that it may have a nonuniform > schema (the common case is probably that there *is* schema uniformity  I > would expect irregular schemas to be rare). Or ... ? > > Chris > > > > On Jun 6, 2008, at 12:22 PM, Ted Dunning wrote: > > >> I think bags are ordered as well, just as he said. >> >> The sentence you are mentioning is explaining why Chris thinks the word >> bag is a bad one (because it implies unordered while the implementation is >> ordered). >> >> >> Original Message >> From: Santhosh Srinivasan [mailto:[EMAIL PROTECTED]] >> Sent: Fri 6/6/2008 10:23 AM >> To: [EMAIL PROTECTED] >> Subject: RE: Dealing with empty data bags >> >> Chris, >> >> Did you mean unordered when you said "A bag is an ordered multiset of >> tuples." Further down you say "because "bag" implies unordered". >> >> Santhosh >> >> Original Message >> From: Chris Olston [mailto:[EMAIL PROTECTED]] >> Sent: Friday, June 06, 2008 10:19 AM >> To: [EMAIL PROTECTED] >> Subject: Re: Dealing with empty data bags >> >> Prashanth, >> >> You bring up a very good point about bags vs. tables. >> >> A bag is an ordered multiset of tuples. A table is an ordered >> multiset of tuples. (Ordered multiset is a fancy way of saying >> "list", unless I'm overlooking something :) >> >> To my knowledge there is no difference between the two, semantically. >> >> In our *implementation* we have a special name for bags at the >> outermost level of nesting: tables. And we treat tables differently >> from nested bags in our implementation (at present, we parallelize >> operations over tables, but do not parallelize operations over nested >> bags.) >> >> The fact that the table/bag distinction percolated up to the user >> level is probably a mistake  there should only be 3 uservisible >> types: table, tuple, atom. >> >> (I prefer the name "table" over "bag", because "bag" implies >> unordered, when in fact in Pig our collections are ordered.) >> >> Anyone disagree? >> >> Chris >> >> >> On Jun 5, 2008, at 6:36 PM, Prashanth Pappu wrote: >> >> Thanks Chris for the response. >>> >>> That brings me to a set of questions regarding empty and null >>> tables/bags >>> that I've been struggling with and hopefully one of you can resolve >>> them for >>> me. >>> >>> (a) I read that PIG has four data types  atom, tuple, bag, map. >>> But, what >>> is a table? Is it the same as bag? How are they different? >>> >>> (b) What is the result data type when we first load data into a 
Re: Dealing with empty data bags
Question??
Is there any particular reason why we need the global "order" notion on top level? I think most SQL users should already be familiar that their tables are not ordered. By relaxing the notion of order at the top level :  Any plan in any level will have no distinction thus simplifying the implementation as rules in all level are the same.  We can easily do N nested level if we want to If users want "order", they just do "ORDER" but any operation after that will not preserve "ORDER". This is also consistent with SQL model. Whether we parallelize the job/subjob or not should always be based on the problem size (easily measured by input size). Pi On Sat, Jun 7, 2008 at 11:01 AM, pi song <[EMAIL PROTECTED]> wrote: > We should update Pig Wiki to reflect this. Even me, I have always been > thinking that our semantic of bag == multiset. The only operation that > results in "ordered bag" is "ORDER" and any operation on ordered bag doesn't > preserve the closure of ordered bag for example > > B = ORDER A BY $0 ; > C = FILTER B BY $0 == 0 > > The "FILTER" operator doesn't preserve ordered bag closure and outputs only > a bag. > > Also here is what I discussed with Santhosh before regarding: > A = FOREACH B { > GENERATE B.$0 * B.$1 ; > } ; > that I think is inappropriate because this operation seems to be very > nondeterministic in definition unless we have the notion of order on B. > (Besides that fact that we also don't have definitions of Bag x Bag > operations like this) > > Pi > > > > On Sat, Jun 7, 2008 at 8:09 AM, Chris Olston <[EMAIL PROTECTED]> wrote: > >> Yes, that's right  it was *not* a typo. Pig "bags" are ordered. >> >> By the way, the word "table" is also problematic because Pig does not >> require uniform schemas across tuples. Usually "table" implies that all >> member tuples adhere to a given tablelevel schema. >> >> Bottom line is that conceptually there is one data type that encompasses >> what we currently refer to as "bag" and "table". As for a good name for this >> type, there has been much discussion but no satisfactory outcome. Perhaps >> "TupleList", but that doesn't have a nice ring to it :). Or we could leave >> it as "table" and add an asterisk explaining that it may have a nonuniform >> schema (the common case is probably that there *is* schema uniformity  I >> would expect irregular schemas to be rare). Or ... ? >> >> Chris >> >> >> >> On Jun 6, 2008, at 12:22 PM, Ted Dunning wrote: >> >> >>> I think bags are ordered as well, just as he said. >>> >>> The sentence you are mentioning is explaining why Chris thinks the word >>> bag is a bad one (because it implies unordered while the implementation is >>> ordered). >>> >>> >>> Original Message >>> From: Santhosh Srinivasan [mailto:[EMAIL PROTECTED]] >>> Sent: Fri 6/6/2008 10:23 AM >>> To: [EMAIL PROTECTED] >>> Subject: RE: Dealing with empty data bags >>> >>> Chris, >>> >>> Did you mean unordered when you said "A bag is an ordered multiset of >>> tuples." Further down you say "because "bag" implies unordered". >>> >>> Santhosh >>> >>> Original Message >>> From: Chris Olston [mailto:[EMAIL PROTECTED]] >>> Sent: Friday, June 06, 2008 10:19 AM >>> To: [EMAIL PROTECTED] >>> Subject: Re: Dealing with empty data bags >>> >>> Prashanth, >>> >>> You bring up a very good point about bags vs. tables. >>> >>> A bag is an ordered multiset of tuples. A table is an ordered >>> multiset of tuples. (Ordered multiset is a fancy way of saying >>> "list", unless I'm overlooking something :) >>> >>> To my knowledge there is no difference between the two, semantically. >>> >>> In our *implementation* we have a special name for bags at the >>> outermost level of nesting: tables. And we treat tables differently >>> from nested bags in our implementation (at present, we parallelize >>> operations over tables, but do not parallelize operations over nested 

