Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig >> mail # user >> Dealing with empty data bags


Copy link to this message
-
RE: Dealing with empty data bags
Chris,

Did you mean unordered when you said "A bag is an ordered multiset of
tuples." Further down you say "because "bag" implies unordered".

Santhosh

-----Original Message-----
From: Chris Olston [mailto:[EMAIL PROTECTED]]
Sent: Friday, June 06, 2008 10:19 AM
To: [EMAIL PROTECTED]
Subject: Re: Dealing with empty data bags

Prashanth,

You bring up a very good point about bags vs. tables.

A bag is an ordered multiset of tuples. A table is an ordered  
multiset of tuples. (Ordered multiset is a fancy way of saying  
"list", unless I'm overlooking something :)

To my knowledge there is no difference between the two, semantically.

In our *implementation* we have a special name for bags at the  
outermost level of nesting: tables. And we treat tables differently  
from nested bags in our implementation (at present, we parallelize  
operations over tables, but do not parallelize operations over nested  
bags.)

The fact that the table/bag distinction percolated up to the user  
level is probably a mistake --- there should only be 3 user-visible  
types: table, tuple, atom.

(I prefer the name "table" over "bag", because "bag" implies  
unordered, when in fact in Pig our collections are ordered.)

Anyone disagree?

-Chris
On Jun 5, 2008, at 6:36 PM, Prashanth Pappu wrote:

> Thanks Chris for the response.
>
> That brings me to a set of questions regarding empty and null  
> tables/bags
> that I've been struggling with and hopefully one of you can resolve  
> them for
> me.
>
> (a) I read that PIG has four data types - atom, tuple, bag, map.  
> But, what
> is a table? Is it the same as bag? How are they different?
>
> (b) What is the result data type when we first load data into a  
> variable?
> For example,
>
>> a = load 'xyz' as (x,y,z);
>> dump a;
> (1, 2, 3)
> (2, 4, 5)
>
> What is the data type of a? Is it a bag as in a = {(1,2,3),  
> (2,4,5)}? Or is
> it just a set of tuples (a table) but not a bag? And, we have a
> representation for an empty bag (= {}), and an empty 'set of  
> tuples' is
> simply null/empty?
>
> (c) I'm trying to understand the differences between bags and  
> tables and
> verifying if we have defined the semantics to deal with them  
> 'consistently'
> irrespective of whether they are empty or not. For example,  
> reference my
> earlier email about an implementation 'bug' in PIG execution engine  
> when
> using SPLIT on an empty table.
>
> Thanks in advance!
> Prashanth
>
> On Thu, Jun 5, 2008 at 4:08 PM, Chris Olston <[EMAIL PROTECTED]>  
> wrote:
>
>> It's not "buggy" or "incorrect", it's just different from the  
>> semantics
>> that you were hoping for. Group and COUNT each have simple, well-
>> defined,
>> and correctly-implemented semantics. If you feed an empty table  
>> into group
>> it produces an empty table; Count over an empty table produces an  
>> empty
>> table -- hence their composition produces an empty tuple when  
>> given an empty
>> table.
>>
>> The question is whether one can construct a Pig program that gives  
>> the
>> semantics you want. Unfortunately off the top of my head the  
>> answer seems to
>> be 'no'. If that's the case we need to look at what needs to be
>> added/changed in the language to enable testing for empty  
>> outermost tables.
>> (If I'm overlooking something I'm sure one of my colleagues will  
>> chime in :)
>>
>> -Chris
>>
>>
>>
>> On Jun 5, 2008, at 3:31 PM, Prashanth Pappu wrote:
>>
>>  (a) I see that at a lot of places where PIG doesn't correctly  
>> deal with
>>> results that are empty bags.
>>>
>>> Here's an example - Counting Tuples. Let's say I want to count  
>>> number of
>>> tuples in 'b' ( a subset of 'a'). I can do the following -
>>>
>>> a = load 'xyz' as (x,y,z);
>>> b =  filter a by x==X;
>>> c = group b all;
>>> d = foreach c generate COUNT(b);
>>>
>>> Ideally, we want d to be (0) if b has no tuples and non-zero  
>>> otherwise.
>>> Unfortuantely, if b is empty, c is also empty! This is buggy  
>>> because it

Christopher Olston, Ph.D.
Sr. Research Scientist
Yahoo! Research