Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig >> mail # user >> PIG regression between 0.8.1 and 0.9.x


Copy link to this message
-
Re: PIG regression between 0.8.1 and 0.9.x
Vincent,

Thanks for your hard work in isolating the bug. Its a perfect bug report.
Seems like its a regression. Can you please open a jira with test data and
script (which works in 0.8.1 and fails in 0.9)

Ashutosh

On Wed, Sep 7, 2011 at 07:17, Vincent Barat <[EMAIL PROTECTED]> wrote:

> Hi,
>
> I really need your help on this one! I've worked hard to isolate the
> regression.
> I'm using the 0.9.x branch (tested at 2011-09-07).
>
> I've an UDF function that takes a bag as input:
>
> public DataBag exec(Tuple input) throws IOException
> {
> /* Get the activity bag */
> DataBag activityBag = (DataBag) input.get(2);
> …
>
> My input data are read form a text file 'activity' (same issue when they
> are read from HBase):
> 00,1239698069000, <- this is the line that is not correctly handled
> 01,1239698505000,b
> 01,1239698369000,a
> 02,1239698413000,b
> 02,1239698553000,c
> 02,1239698313000,a
> 03,1239698316000,a
> 03,1239698516000,c
> 03,1239698416000,b
> 03,1239698621000,d
> 04,1239698417000,c
>
> My first script is working correctly:
>
> activities = LOAD 'activity' USING PigStorage(',') AS (sid:chararray,
> timestamp:long, name:chararray);
> activities = GROUP activities BY sid;
> activities = FOREACH activities GENERATE group,
> MyUDF(activities.(timestamp, name));
> store activities;
>
> N.B. the name of the first activity is correctly set to null in my UDF
> function.
>
> The issue occurs when I store my data into a binary file are relaod them
> before processing (I do this to improve the computation time, since HDFS is
> much faster than HBase).
>
> Second script that triggers an error (this script work correctly with PIG
> 0.8.1):
>
> activities = LOAD 'activity' USING PigStorage(',') AS (sid:chararray,
> timestamp:long, name:chararray);
> activities = GROUP activities BY sid;
> activities = FOREACH activities GENERATE group, activities.(timestamp,
> name);
> STORE activities INTO 'activities' USING BinStorage;
> activities = LOAD 'activities' USING BinStorage AS (sid:chararray,
> activities:bag { activity: (timestamp:long, name:chararray) });
> activities = FOREACH activities GENERATE sid, MyUDF(activities);
> store activities;
>
> In this script, when MyUDF is calles, activityBag is null, and a warning is
> issued:
>
> 2011-09-07 15:24:05,365 | WARN | Thread-30 | PigHadoopLogger |
> org.apache.pig.backend.hadoop.**executionengine.physicalLayer.**expressionOperators.POCast:
> Unable to interpret value {(1239698069000,)} in field being converted to
> type bag, caught ParseException <Cannot convert (1239698069000,) to
> null:(timestamp:long,name:**chararray)> field discarded
>
> I guess that the regression is located into BinStorage
>
> Le 30/08/11 19:13, Daniel Dai a écrit :
>
>> Interesting, the log message seems to be clear, "Cannot convert
>> (1239698069000,) to null:(timestamp:long,name:**chararray)", but I
>> cannot find an explanation to that. I verified such conversion should
>> be valid on 0.9. Can you show me the script?
>>
>> Daniel
>>
>> On Tue, Aug 30, 2011 at 5:14 AM, Vincent Barat<[EMAIL PROTECTED]>
>>  wrote:
>>
>>> Hi,
>>>
>>> I have experienced the same issue by loading the data from raw text files
>>> (using PIG server in local mode and the regular PIG loader) and from
>>> HBaseStorage.
>>> The issue is exactly the same in both cases: each time a NULL string is
>>> encountered, the cast to a data bag cannot be done.
>>>
>>> Le 29/08/11 19:12, Dmitriy Ryaboy a écrit :
>>>
>>>> How are you loading this data?
>>>>
>>>> D
>>>>
>>>> On Mon, Aug 29, 2011 at 8:05 AM, Vincent
>>>> Barat<[EMAIL PROTECTED]>**wrote:
>>>>
>>>>  I'm currently testing PIG 0.9.x branch.
>>>>> Several of my jobs that use to work correctly with PIG 0.8.1 now fail
>>>>> due
>>>>> to a cast error returning a null pointer in one of my UDF function.
>>>>>
>>>>> Apparently, PIG seems to be unable to convert some data to a bag when
>>>>> some
>>>>> of the tuple fields are null:
>>>>>
>>>>> 2011-08-29 15:50:48,720 | WARN | Thread-270 | PigHadoopLogger |
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB