Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig >> mail # user >> PIG regression between 0.8.1 and 0.9.x


Copy link to this message
-
Re: PIG regression between 0.8.1 and 0.9.x
Vincent,

Thanks for your hard work in isolating the bug. Its a perfect bug report.
Seems like its a regression. Can you please open a jira with test data and
script (which works in 0.8.1 and fails in 0.9)

Ashutosh

On Wed, Sep 7, 2011 at 07:17, Vincent Barat <[EMAIL PROTECTED]> wrote:

> Hi,
>
> I really need your help on this one! I've worked hard to isolate the
> regression.
> I'm using the 0.9.x branch (tested at 2011-09-07).
>
> I've an UDF function that takes a bag as input:
>
> public DataBag exec(Tuple input) throws IOException
> {
> /* Get the activity bag */
> DataBag activityBag = (DataBag) input.get(2);
> …
>
> My input data are read form a text file 'activity' (same issue when they
> are read from HBase):
> 00,1239698069000, <- this is the line that is not correctly handled
> 01,1239698505000,b
> 01,1239698369000,a
> 02,1239698413000,b
> 02,1239698553000,c
> 02,1239698313000,a
> 03,1239698316000,a
> 03,1239698516000,c
> 03,1239698416000,b
> 03,1239698621000,d
> 04,1239698417000,c
>
> My first script is working correctly:
>
> activities = LOAD 'activity' USING PigStorage(',') AS (sid:chararray,
> timestamp:long, name:chararray);
> activities = GROUP activities BY sid;
> activities = FOREACH activities GENERATE group,
> MyUDF(activities.(timestamp, name));
> store activities;
>
> N.B. the name of the first activity is correctly set to null in my UDF
> function.
>
> The issue occurs when I store my data into a binary file are relaod them
> before processing (I do this to improve the computation time, since HDFS is
> much faster than HBase).
>
> Second script that triggers an error (this script work correctly with PIG
> 0.8.1):
>
> activities = LOAD 'activity' USING PigStorage(',') AS (sid:chararray,
> timestamp:long, name:chararray);
> activities = GROUP activities BY sid;
> activities = FOREACH activities GENERATE group, activities.(timestamp,
> name);
> STORE activities INTO 'activities' USING BinStorage;
> activities = LOAD 'activities' USING BinStorage AS (sid:chararray,
> activities:bag { activity: (timestamp:long, name:chararray) });
> activities = FOREACH activities GENERATE sid, MyUDF(activities);
> store activities;
>
> In this script, when MyUDF is calles, activityBag is null, and a warning is
> issued:
>
> 2011-09-07 15:24:05,365 | WARN | Thread-30 | PigHadoopLogger |
> org.apache.pig.backend.hadoop.**executionengine.physicalLayer.**expressionOperators.POCast:
> Unable to interpret value {(1239698069000,)} in field being converted to
> type bag, caught ParseException <Cannot convert (1239698069000,) to
> null:(timestamp:long,name:**chararray)> field discarded
>
> I guess that the regression is located into BinStorage
>
> Le 30/08/11 19:13, Daniel Dai a écrit :
>
>> Interesting, the log message seems to be clear, "Cannot convert
>> (1239698069000,) to null:(timestamp:long,name:**chararray)", but I
>> cannot find an explanation to that. I verified such conversion should
>> be valid on 0.9. Can you show me the script?
>>
>> Daniel
>>
>> On Tue, Aug 30, 2011 at 5:14 AM, Vincent Barat<[EMAIL PROTECTED]>
>>  wrote:
>>
>>> Hi,
>>>
>>> I have experienced the same issue by loading the data from raw text files
>>> (using PIG server in local mode and the regular PIG loader) and from
>>> HBaseStorage.
>>> The issue is exactly the same in both cases: each time a NULL string is
>>> encountered, the cast to a data bag cannot be done.
>>>
>>> Le 29/08/11 19:12, Dmitriy Ryaboy a écrit :
>>>
>>>> How are you loading this data?
>>>>
>>>> D
>>>>
>>>> On Mon, Aug 29, 2011 at 8:05 AM, Vincent
>>>> Barat<[EMAIL PROTECTED]>**wrote:
>>>>
>>>>  I'm currently testing PIG 0.9.x branch.
>>>>> Several of my jobs that use to work correctly with PIG 0.8.1 now fail
>>>>> due
>>>>> to a cast error returning a null pointer in one of my UDF function.
>>>>>
>>>>> Apparently, PIG seems to be unable to convert some data to a bag when
>>>>> some
>>>>> of the tuple fields are null:
>>>>>
>>>>> 2011-08-29 15:50:48,720 | WARN | Thread-270 | PigHadoopLogger |