Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig >> mail # user >> issue loading bags in the types branch

Copy link to this message
RE: issue loading bags in the types branch

By definition, bags are containers of tuples. As a result, the parser
does not allow you to declare a bag without specify the tuple inside the
bag. We need a JIRA to fix the issue regarding naming the tuple inside
the bag.

Currently, the Pig front-end is not consistent in the way schemas for
bags is handled. When columns are flattened, the expected behaviour is
to remove one level of indirection if it's a tuple and two levels of
indirection if it's a bag, i.e., access the elements of the tuple and
access the elements of the tuple inside the bag respectively. As a
result instead of seeing the contents of the tuple inside the bag (i.e.,
type and typeCount) you are seeing the tuple when you flatten the bag.

There is a JIRA to track this issue:
https://issues.apache.org/jira/browse/PIG-449 This bug has to be
resolved in order to unblock you.


-----Original Message-----
From: Kevin Weil [mailto:[EMAIL PROTECTED]]
Sent: Sunday, October 19, 2008 10:36 PM
Subject: issue loading bags in the types branch


I'm trying to analyze a dataset that looks like (string, number, bag {
string, number }).   (in the pig-types branch.)

In my load function, what should the AS clause for my bag look like?

... AS (site: chararray, count: int, itemCounts: bag { itemCountsTuple:
tuple (type: chararray, typeCount: int) } )

This parses, and seems to work for some things, but I think there's an
down the line with naming the bag's inner tuple.  I'd rather NOT name
inner tuple, but saying bag { type: chararray, typeCount: int } doesn't
parse, and neither does bag:{tuple(type: chararray, typeCount:int)},
is what the suggested syntax is on the "TrunkToTypesChanges" wiki
My problem comes when I try to flatten this tuple.  If I load the data
'a' and do

b = FOREACH a GENERATE site, count, FLATTEN(itemCounts)

and then dump b, the data looks like a flat list of four elements as it
should.  However, my schema appears to be messed up.  The schema is

b: {site: chararray,count: integer,itemCounts::itemCountsTuple: (type:
chararray, typeCount: int)}

That is, the itemCounts::itemCountsTuple variable still appears to have
tuple structure!  Once again, this is NOT borne out when I dump the data
the data itself is flat.  However, I have to refer to the variable as
itemCounts::itemCountsTuple.type in order for any statement to parse,
and if
I ever do a FILTER b BY itemCounts::itemCountsTuple.type EQ 'blah' I get
exception stemming from Pig's attempt to cast a String to a Tuple in
POProject.java (result.res = (Tuple)ret; on line 277 of POProject.java
in my
checkout).  I think these are related to the strange post-flatten
because FILTER works on other cases.

This is blocking us entirely for now, and it hopefully is just user
error on
my part.  Thanks in advance for any help you can offer!