Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig >> mail # user >> PIG regression in 0.9.1's BinStorage()


Copy link to this message
-
Re: PIG regression in 0.9.1's BinStorage()
This was a bug in cast operation, while applying the schema specified in
the 2nd load statement.

Patch availble in the jira (PIG-2271).

-Thejas
On 10/6/11 9:09 AM, Vincent Barat wrote:
> Hi,
>
> I made more investigation and I updated the issue to provide a very easy
> way to reproduce it.
> This seems to be an important regression in BinStorage()
>
> https://issues.apache.org/jira/browse/PIG-2271
>
> Le 09/09/11 11:36, Vincent Barat a �crit :
>> Issue reported:
>>
>> https://issues.apache.org/jira/browse/PIG-2271
>>
>> Le 07/09/11 20:52, Kevin Burton a �crit :
>>> I believe that everything is byte array at first but I may be wrong� at
>>> least this has been the situation in my experiments.
>>>
>>> It is best to always specify schema though. Unless you're using Zebra
>>> which
>>> stores the schema directly (which is very handy btw).
>>>
>>> You could also try InterStorage (which you can use directly via the full
>>> classname) as it is more efficient if I recall correctly.
>>>
>>> While it probably would be nice for you to submit a bug and of course
>>> you
>>> can wait until it is fixed, it's probably faster for you to just work
>>> around
>>> it�
>>>
>>> Kevin
>>>
>>> On Wed, Sep 7, 2011 at 11:47 AM, Corbin Hoenes<[EMAIL PROTECTED]> wrote:
>>>
>>>> Hi there,
>>>>
>>>> I think we might be seeing something related to this problem and can
>>>> confirm
>>>> it's in BinStorage for us.
>>>>
>>>> We stored referrer_stats_by_site using BinStorage. Here is a
>>>> describe of
>>>> the alias:
>>>>> referrer_stats_by_site: {site: chararray,{(referrerdomain:
>>>> chararray,lcnt:
>>>> long,tcnt: long,{(referrer: chararray,lcnt: long,tcnt: long)})}}
>>>>
>>>> Now we try to load that data:
>>>> referrers = LOAD 'mydata' USING BinStorage() AS (site:chararray,
>>>> referrerdomainlist:bag{t:tuple(referrerdomain:chararray, lcnt:long,
>>>> tcnt:long,
>>>> referrerurllist:bag{t1:tuple(referrerurl:chararray, lcnt:long,
>>>> tcnt:long)})});
>>>>
>>>> but when we do we cannot find a certain 'site'.
>>>>
>>>> When we don't provide the schema:
>>>> referrers = LOAD 'mydata' USING BinStorage();
>>>>
>>>> It will load but referrerdomain is a bytearray instead of chararray. Is
>>>> pig
>>>> supposed to automatically cast this to a chararray for me? Is there any
>>>> reason why this data won't load unless we change the type to bytearray?
>>>>
>>>>
>>>> On Wed, Sep 7, 2011 at 9:15 AM, Ashutosh Chauhan<[EMAIL PROTECTED]
>>>>> wrote:
>>>>> Vincent,
>>>>>
>>>>> Thanks for your hard work in isolating the bug. Its a perfect bug
>>>>> report.
>>>>> Seems like its a regression. Can you please open a jira with test data
>>>> and
>>>>> script (which works in 0.8.1 and fails in 0.9)
>>>>>
>>>>> Ashutosh
>>>>>
>>>>> On Wed, Sep 7, 2011 at 07:17, Vincent Barat<[EMAIL PROTECTED]>
>>>>> wrote:
>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> I really need your help on this one! I've worked hard to isolate the
>>>>>> regression.
>>>>>> I'm using the 0.9.x branch (tested at 2011-09-07).
>>>>>>
>>>>>> I've an UDF function that takes a bag as input:
>>>>>>
>>>>>> public DataBag exec(Tuple input) throws IOException
>>>>>> {
>>>>>> /* Get the activity bag */
>>>>>> DataBag activityBag = (DataBag) input.get(2);
>>>>>> �
>>>>>>
>>>>>> My input data are read form a text file 'activity' (same issue when
>>>> they
>>>>>> are read from HBase):
>>>>>> 00,1239698069000,<- this is the line that is not correctly handled
>>>>>> 01,1239698505000,b
>>>>>> 01,1239698369000,a
>>>>>> 02,1239698413000,b
>>>>>> 02,1239698553000,c
>>>>>> 02,1239698313000,a
>>>>>> 03,1239698316000,a
>>>>>> 03,1239698516000,c
>>>>>> 03,1239698416000,b
>>>>>> 03,1239698621000,d
>>>>>> 04,1239698417000,c
>>>>>>
>>>>>> My first script is working correctly:
>>>>>>
>>>>>> activities = LOAD 'activity' USING PigStorage(',') AS (sid:chararray,
>>>>>> timestamp:long, name:chararray);
>>>>>> activities = GROUP activities BY sid;
>>>>>> activities = FOREACH activities GENERATE group,
>>>>>> MyUDF(activities.(timestamp, name));