Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig >> mail # user >> PIG regression between 0.8.1 and 0.9.x


Copy link to this message
-
Re: PIG regression between 0.8.1 and 0.9.x
Hi there,

I think we might be seeing something related to this problem and can confirm
it's in BinStorage for us.

We stored referrer_stats_by_site using BinStorage.  Here is a describe of
the alias:
> referrer_stats_by_site: {site: chararray,{(referrerdomain: chararray,lcnt:
long,tcnt: long,{(referrer: chararray,lcnt: long,tcnt: long)})}}

Now we try to load that data:
referrers = LOAD 'mydata' USING BinStorage() AS (site:chararray,
referrerdomainlist:bag{t:tuple(referrerdomain:chararray, lcnt:long,
tcnt:long,
referrerurllist:bag{t1:tuple(referrerurl:chararray, lcnt:long,
tcnt:long)})});

but when we do we cannot find a certain 'site'.

When we don't provide the schema:
referrers = LOAD 'mydata' USING BinStorage();

It will load but referrerdomain is a bytearray instead of chararray.  Is pig
supposed to automatically cast this to a chararray for me?  Is there any
reason why this data won't load unless we change the type to bytearray?
On Wed, Sep 7, 2011 at 9:15 AM, Ashutosh Chauhan <[EMAIL PROTECTED]>wrote:

> Vincent,
>
> Thanks for your hard work in isolating the bug. Its a perfect bug report.
> Seems like its a regression. Can you please open a jira with test data and
> script (which works in 0.8.1 and fails in 0.9)
>
> Ashutosh
>
> On Wed, Sep 7, 2011 at 07:17, Vincent Barat <[EMAIL PROTECTED]>
> wrote:
>
> > Hi,
> >
> > I really need your help on this one! I've worked hard to isolate the
> > regression.
> > I'm using the 0.9.x branch (tested at 2011-09-07).
> >
> > I've an UDF function that takes a bag as input:
> >
> > public DataBag exec(Tuple input) throws IOException
> > {
> > /* Get the activity bag */
> > DataBag activityBag = (DataBag) input.get(2);
> > …
> >
> > My input data are read form a text file 'activity' (same issue when they
> > are read from HBase):
> > 00,1239698069000, <- this is the line that is not correctly handled
> > 01,1239698505000,b
> > 01,1239698369000,a
> > 02,1239698413000,b
> > 02,1239698553000,c
> > 02,1239698313000,a
> > 03,1239698316000,a
> > 03,1239698516000,c
> > 03,1239698416000,b
> > 03,1239698621000,d
> > 04,1239698417000,c
> >
> > My first script is working correctly:
> >
> > activities = LOAD 'activity' USING PigStorage(',') AS (sid:chararray,
> > timestamp:long, name:chararray);
> > activities = GROUP activities BY sid;
> > activities = FOREACH activities GENERATE group,
> > MyUDF(activities.(timestamp, name));
> > store activities;
> >
> > N.B. the name of the first activity is correctly set to null in my UDF
> > function.
> >
> > The issue occurs when I store my data into a binary file are relaod them
> > before processing (I do this to improve the computation time, since HDFS
> is
> > much faster than HBase).
> >
> > Second script that triggers an error (this script work correctly with PIG
> > 0.8.1):
> >
> > activities = LOAD 'activity' USING PigStorage(',') AS (sid:chararray,
> > timestamp:long, name:chararray);
> > activities = GROUP activities BY sid;
> > activities = FOREACH activities GENERATE group, activities.(timestamp,
> > name);
> > STORE activities INTO 'activities' USING BinStorage;
> > activities = LOAD 'activities' USING BinStorage AS (sid:chararray,
> > activities:bag { activity: (timestamp:long, name:chararray) });
> > activities = FOREACH activities GENERATE sid, MyUDF(activities);
> > store activities;
> >
> > In this script, when MyUDF is calles, activityBag is null, and a warning
> is
> > issued:
> >
> > 2011-09-07 15:24:05,365 | WARN | Thread-30 | PigHadoopLogger |
> >
> org.apache.pig.backend.hadoop.**executionengine.physicalLayer.**expressionOperators.POCast:
> > Unable to interpret value {(1239698069000,)} in field being converted to
> > type bag, caught ParseException <Cannot convert (1239698069000,) to
> > null:(timestamp:long,name:**chararray)> field discarded
> >
> > I guess that the regression is located into BinStorage
> >
> > Le 30/08/11 19:13, Daniel Dai a écrit :
> >
> >> Interesting, the log message seems to be clear, "Cannot convert
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB