Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig >> mail # user >> Dealing with empty data bags

Copy link to this message
RE: Dealing with empty data bags

Did you mean unordered when you said "A bag is an ordered multiset of
tuples." Further down you say "because "bag" implies unordered".


-----Original Message-----
From: Chris Olston [mailto:[EMAIL PROTECTED]]
Sent: Friday, June 06, 2008 10:19 AM
Subject: Re: Dealing with empty data bags


You bring up a very good point about bags vs. tables.

A bag is an ordered multiset of tuples. A table is an ordered  
multiset of tuples. (Ordered multiset is a fancy way of saying  
"list", unless I'm overlooking something :)

To my knowledge there is no difference between the two, semantically.

In our *implementation* we have a special name for bags at the  
outermost level of nesting: tables. And we treat tables differently  
from nested bags in our implementation (at present, we parallelize  
operations over tables, but do not parallelize operations over nested  

The fact that the table/bag distinction percolated up to the user  
level is probably a mistake --- there should only be 3 user-visible  
types: table, tuple, atom.

(I prefer the name "table" over "bag", because "bag" implies  
unordered, when in fact in Pig our collections are ordered.)

Anyone disagree?

On Jun 5, 2008, at 6:36 PM, Prashanth Pappu wrote:

> Thanks Chris for the response.
> That brings me to a set of questions regarding empty and null  
> tables/bags
> that I've been struggling with and hopefully one of you can resolve  
> them for
> me.
> (a) I read that PIG has four data types - atom, tuple, bag, map.  
> But, what
> is a table? Is it the same as bag? How are they different?
> (b) What is the result data type when we first load data into a  
> variable?
> For example,
>> a = load 'xyz' as (x,y,z);
>> dump a;
> (1, 2, 3)
> (2, 4, 5)
> What is the data type of a? Is it a bag as in a = {(1,2,3),  
> (2,4,5)}? Or is
> it just a set of tuples (a table) but not a bag? And, we have a
> representation for an empty bag (= {}), and an empty 'set of  
> tuples' is
> simply null/empty?
> (c) I'm trying to understand the differences between bags and  
> tables and
> verifying if we have defined the semantics to deal with them  
> 'consistently'
> irrespective of whether they are empty or not. For example,  
> reference my
> earlier email about an implementation 'bug' in PIG execution engine  
> when
> using SPLIT on an empty table.
> Thanks in advance!
> Prashanth
> On Thu, Jun 5, 2008 at 4:08 PM, Chris Olston <[EMAIL PROTECTED]>  
> wrote:
>> It's not "buggy" or "incorrect", it's just different from the  
>> semantics
>> that you were hoping for. Group and COUNT each have simple, well-
>> defined,
>> and correctly-implemented semantics. If you feed an empty table  
>> into group
>> it produces an empty table; Count over an empty table produces an  
>> empty
>> table -- hence their composition produces an empty tuple when  
>> given an empty
>> table.
>> The question is whether one can construct a Pig program that gives  
>> the
>> semantics you want. Unfortunately off the top of my head the  
>> answer seems to
>> be 'no'. If that's the case we need to look at what needs to be
>> added/changed in the language to enable testing for empty  
>> outermost tables.
>> (If I'm overlooking something I'm sure one of my colleagues will  
>> chime in :)
>> -Chris
>> On Jun 5, 2008, at 3:31 PM, Prashanth Pappu wrote:
>>  (a) I see that at a lot of places where PIG doesn't correctly  
>> deal with
>>> results that are empty bags.
>>> Here's an example - Counting Tuples. Let's say I want to count  
>>> number of
>>> tuples in 'b' ( a subset of 'a'). I can do the following -
>>> a = load 'xyz' as (x,y,z);
>>> b =  filter a by x==X;
>>> c = group b all;
>>> d = foreach c generate COUNT(b);
>>> Ideally, we want d to be (0) if b has no tuples and non-zero  
>>> otherwise.
>>> Unfortuantely, if b is empty, c is also empty! This is buggy  
>>> because it

Christopher Olston, Ph.D.
Sr. Research Scientist
Yahoo! Research