Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig >> mail # user >> PIG regression in 0.9.1's BinStorage()


Copy link to this message
-
Re: PIG regression in 0.9.1's BinStorage()
This was a bug in cast operation, while applying the schema specified in
the 2nd load statement.

Patch availble in the jira (PIG-2271).

-Thejas
On 10/6/11 9:09 AM, Vincent Barat wrote:
> Hi,
>
> I made more investigation and I updated the issue to provide a very easy
> way to reproduce it.
> This seems to be an important regression in BinStorage()
>
> https://issues.apache.org/jira/browse/PIG-2271
>
> Le 09/09/11 11:36, Vincent Barat a �crit :
>> Issue reported:
>>
>> https://issues.apache.org/jira/browse/PIG-2271
>>
>> Le 07/09/11 20:52, Kevin Burton a �crit :
>>> I believe that everything is byte array at first but I may be wrong� at
>>> least this has been the situation in my experiments.
>>>
>>> It is best to always specify schema though. Unless you're using Zebra
>>> which
>>> stores the schema directly (which is very handy btw).
>>>
>>> You could also try InterStorage (which you can use directly via the full
>>> classname) as it is more efficient if I recall correctly.
>>>
>>> While it probably would be nice for you to submit a bug and of course
>>> you
>>> can wait until it is fixed, it's probably faster for you to just work
>>> around
>>> it�
>>>
>>> Kevin
>>>
>>> On Wed, Sep 7, 2011 at 11:47 AM, Corbin Hoenes<[EMAIL PROTECTED]> wrote:
>>>
>>>> Hi there,
>>>>
>>>> I think we might be seeing something related to this problem and can
>>>> confirm
>>>> it's in BinStorage for us.
>>>>
>>>> We stored referrer_stats_by_site using BinStorage. Here is a
>>>> describe of
>>>> the alias:
>>>>> referrer_stats_by_site: {site: chararray,{(referrerdomain:
>>>> chararray,lcnt:
>>>> long,tcnt: long,{(referrer: chararray,lcnt: long,tcnt: long)})}}
>>>>
>>>> Now we try to load that data:
>>>> referrers = LOAD 'mydata' USING BinStorage() AS (site:chararray,
>>>> referrerdomainlist:bag{t:tuple(referrerdomain:chararray, lcnt:long,
>>>> tcnt:long,
>>>> referrerurllist:bag{t1:tuple(referrerurl:chararray, lcnt:long,
>>>> tcnt:long)})});
>>>>
>>>> but when we do we cannot find a certain 'site'.
>>>>
>>>> When we don't provide the schema:
>>>> referrers = LOAD 'mydata' USING BinStorage();
>>>>
>>>> It will load but referrerdomain is a bytearray instead of chararray. Is
>>>> pig
>>>> supposed to automatically cast this to a chararray for me? Is there any
>>>> reason why this data won't load unless we change the type to bytearray?
>>>>
>>>>
>>>> On Wed, Sep 7, 2011 at 9:15 AM, Ashutosh Chauhan<[EMAIL PROTECTED]
>>>>> wrote:
>>>>> Vincent,
>>>>>
>>>>> Thanks for your hard work in isolating the bug. Its a perfect bug
>>>>> report.
>>>>> Seems like its a regression. Can you please open a jira with test data
>>>> and
>>>>> script (which works in 0.8.1 and fails in 0.9)
>>>>>
>>>>> Ashutosh
>>>>>
>>>>> On Wed, Sep 7, 2011 at 07:17, Vincent Barat<[EMAIL PROTECTED]>
>>>>> wrote:
>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> I really need your help on this one! I've worked hard to isolate the
>>>>>> regression.
>>>>>> I'm using the 0.9.x branch (tested at 2011-09-07).
>>>>>>
>>>>>> I've an UDF function that takes a bag as input:
>>>>>>
>>>>>> public DataBag exec(Tuple input) throws IOException
>>>>>> {
>>>>>> /* Get the activity bag */
>>>>>> DataBag activityBag = (DataBag) input.get(2);
>>>>>> �
>>>>>>
>>>>>> My input data are read form a text file 'activity' (same issue when
>>>> they
>>>>>> are read from HBase):
>>>>>> 00,1239698069000,<- this is the line that is not correctly handled
>>>>>> 01,1239698505000,b
>>>>>> 01,1239698369000,a
>>>>>> 02,1239698413000,b
>>>>>> 02,1239698553000,c
>>>>>> 02,1239698313000,a
>>>>>> 03,1239698316000,a
>>>>>> 03,1239698516000,c
>>>>>> 03,1239698416000,b
>>>>>> 03,1239698621000,d
>>>>>> 04,1239698417000,c
>>>>>>
>>>>>> My first script is working correctly:
>>>>>>
>>>>>> activities = LOAD 'activity' USING PigStorage(',') AS (sid:chararray,
>>>>>> timestamp:long, name:chararray);
>>>>>> activities = GROUP activities BY sid;
>>>>>> activities = FOREACH activities GENERATE group,
>>>>>> MyUDF(activities.(timestamp, name));
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB