Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig, mail # user - issue loading bags in the types branch


Copy link to this message
-
RE: issue loading bags in the types branch
Santhosh Srinivasan 2008-10-21, 18:44
Kevin,

By definition, bags are containers of tuples. As a result, the parser
does not allow you to declare a bag without specify the tuple inside the
bag. We need a JIRA to fix the issue regarding naming the tuple inside
the bag.

Currently, the Pig front-end is not consistent in the way schemas for
bags is handled. When columns are flattened, the expected behaviour is
to remove one level of indirection if it's a tuple and two levels of
indirection if it's a bag, i.e., access the elements of the tuple and
access the elements of the tuple inside the bag respectively. As a
result instead of seeing the contents of the tuple inside the bag (i.e.,
type and typeCount) you are seeing the tuple when you flatten the bag.

There is a JIRA to track this issue:
https://issues.apache.org/jira/browse/PIG-449 This bug has to be
resolved in order to unblock you.

Santhosh

-----Original Message-----
From: Kevin Weil [mailto:[EMAIL PROTECTED]]
Sent: Sunday, October 19, 2008 10:36 PM
To: [EMAIL PROTECTED]
Subject: issue loading bags in the types branch

Hi,

I'm trying to analyze a dataset that looks like (string, number, bag {
string, number }).   (in the pig-types branch.)

In my load function, what should the AS clause for my bag look like?
I'm
doing

... AS (site: chararray, count: int, itemCounts: bag { itemCountsTuple:
tuple (type: chararray, typeCount: int) } )

This parses, and seems to work for some things, but I think there's an
issue
down the line with naming the bag's inner tuple.  I'd rather NOT name
the
inner tuple, but saying bag { type: chararray, typeCount: int } doesn't
parse, and neither does bag:{tuple(type: chararray, typeCount:int)},
which
is what the suggested syntax is on the "TrunkToTypesChanges" wiki
page<http://wiki.apache.org/pig/TrunkToTypesChanges>.
My problem comes when I try to flatten this tuple.  If I load the data
into
'a' and do

b = FOREACH a GENERATE site, count, FLATTEN(itemCounts)

and then dump b, the data looks like a flat list of four elements as it
should.  However, my schema appears to be messed up.  The schema is

b: {site: chararray,count: integer,itemCounts::itemCountsTuple: (type:
chararray, typeCount: int)}

That is, the itemCounts::itemCountsTuple variable still appears to have
a
tuple structure!  Once again, this is NOT borne out when I dump the data
--
the data itself is flat.  However, I have to refer to the variable as
itemCounts::itemCountsTuple.type in order for any statement to parse,
and if
I ever do a FILTER b BY itemCounts::itemCountsTuple.type EQ 'blah' I get
an
exception stemming from Pig's attempt to cast a String to a Tuple in
POProject.java (result.res = (Tuple)ret; on line 277 of POProject.java
in my
checkout).  I think these are related to the strange post-flatten
schema,
because FILTER works on other cases.

This is blocking us entirely for now, and it hopefully is just user
error on
my part.  Thanks in advance for any help you can offer!

Kevin