Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Plain View
Pig, mail # user - PIG regression between 0.8.1 and 0.9.x


+
Vincent Barat 2011-09-07, 14:17
+
Vincent Barat 2011-09-07, 15:08
+
Ashutosh Chauhan 2011-09-07, 15:15
Copy link to this message
-
Re: PIG regression between 0.8.1 and 0.9.x
Corbin Hoenes 2011-09-07, 18:47
Hi there,

I think we might be seeing something related to this problem and can confirm
it's in BinStorage for us.

We stored referrer_stats_by_site using BinStorage.  Here is a describe of
the alias:
> referrer_stats_by_site: {site: chararray,{(referrerdomain: chararray,lcnt:
long,tcnt: long,{(referrer: chararray,lcnt: long,tcnt: long)})}}

Now we try to load that data:
referrers = LOAD 'mydata' USING BinStorage() AS (site:chararray,
referrerdomainlist:bag{t:tuple(referrerdomain:chararray, lcnt:long,
tcnt:long,
referrerurllist:bag{t1:tuple(referrerurl:chararray, lcnt:long,
tcnt:long)})});

but when we do we cannot find a certain 'site'.

When we don't provide the schema:
referrers = LOAD 'mydata' USING BinStorage();

It will load but referrerdomain is a bytearray instead of chararray.  Is pig
supposed to automatically cast this to a chararray for me?  Is there any
reason why this data won't load unless we change the type to bytearray?
On Wed, Sep 7, 2011 at 9:15 AM, Ashutosh Chauhan <[EMAIL PROTECTED]>wrote:

> Vincent,
>
> Thanks for your hard work in isolating the bug. Its a perfect bug report.
> Seems like its a regression. Can you please open a jira with test data and
> script (which works in 0.8.1 and fails in 0.9)
>
> Ashutosh
>
> On Wed, Sep 7, 2011 at 07:17, Vincent Barat <[EMAIL PROTECTED]>
> wrote:
>
> > Hi,
> >
> > I really need your help on this one! I've worked hard to isolate the
> > regression.
> > I'm using the 0.9.x branch (tested at 2011-09-07).
> >
> > I've an UDF function that takes a bag as input:
> >
> > public DataBag exec(Tuple input) throws IOException
> > {
> > /* Get the activity bag */
> > DataBag activityBag = (DataBag) input.get(2);
> > …
> >
> > My input data are read form a text file 'activity' (same issue when they
> > are read from HBase):
> > 00,1239698069000, <- this is the line that is not correctly handled
> > 01,1239698505000,b
> > 01,1239698369000,a
> > 02,1239698413000,b
> > 02,1239698553000,c
> > 02,1239698313000,a
> > 03,1239698316000,a
> > 03,1239698516000,c
> > 03,1239698416000,b
> > 03,1239698621000,d
> > 04,1239698417000,c
> >
> > My first script is working correctly:
> >
> > activities = LOAD 'activity' USING PigStorage(',') AS (sid:chararray,
> > timestamp:long, name:chararray);
> > activities = GROUP activities BY sid;
> > activities = FOREACH activities GENERATE group,
> > MyUDF(activities.(timestamp, name));
> > store activities;
> >
> > N.B. the name of the first activity is correctly set to null in my UDF
> > function.
> >
> > The issue occurs when I store my data into a binary file are relaod them
> > before processing (I do this to improve the computation time, since HDFS
> is
> > much faster than HBase).
> >
> > Second script that triggers an error (this script work correctly with PIG
> > 0.8.1):
> >
> > activities = LOAD 'activity' USING PigStorage(',') AS (sid:chararray,
> > timestamp:long, name:chararray);
> > activities = GROUP activities BY sid;
> > activities = FOREACH activities GENERATE group, activities.(timestamp,
> > name);
> > STORE activities INTO 'activities' USING BinStorage;
> > activities = LOAD 'activities' USING BinStorage AS (sid:chararray,
> > activities:bag { activity: (timestamp:long, name:chararray) });
> > activities = FOREACH activities GENERATE sid, MyUDF(activities);
> > store activities;
> >
> > In this script, when MyUDF is calles, activityBag is null, and a warning
> is
> > issued:
> >
> > 2011-09-07 15:24:05,365 | WARN | Thread-30 | PigHadoopLogger |
> >
> org.apache.pig.backend.hadoop.**executionengine.physicalLayer.**expressionOperators.POCast:
> > Unable to interpret value {(1239698069000,)} in field being converted to
> > type bag, caught ParseException <Cannot convert (1239698069000,) to
> > null:(timestamp:long,name:**chararray)> field discarded
> >
> > I guess that the regression is located into BinStorage
> >
> > Le 30/08/11 19:13, Daniel Dai a écrit :
> >
> >> Interesting, the log message seems to be clear, "Cannot convert
+
Kevin Burton 2011-09-07, 18:52
+
Vincent Barat 2011-09-09, 09:36