Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig >> mail # dev >> Modifying databag on the fly


Copy link to this message
-
Re: Modifying databag on the fly

On Sep 5, 2012, at 6:30 PM, Prasanth J wrote:

> Ahh.. Now it makes more sense.
>
> I think I got the solution. I was adding to List<Tuple> and then finally creating a DataBag with that list.. Instead I should create a bag and keep adding to it..!! Is that correct?
Yes.

Alan.

> Thanks Alan.
>
> Thanks
> -- Prasanth
>
> On Sep 5, 2012, at 9:24 PM, Alan Gates <[EMAIL PROTECTED]> wrote:
>
>> You cannot modify a bag once it is written.  The implementation is written around the assumption that bags are immutable after they are written.  
>>
>> Creating a new bag should not create an OOM exception, as bags are built to spill when they grow too large.  In fact it's this spilling feature that makes in place modification impossible.
>>
>> Alan.
>>
>> On Sep 5, 2012, at 6:08 PM, Prasanth J wrote:
>>
>>> Hello devs
>>>
>>> I have specific case where I need to modify the contents (remove a field from each tuples) of Databag but I want to do it in-place and do not want to create another databag with new set of tuples.
>>> The situation is, say I have the following input tuple for an UDF
>>>
>>> {(111,222,3,121), (112,223,2,131), (113,224,4,141)}
>>>
>>> I want to iterate through this bag and generate an output bag removing the 3rd the of each tuples in the bag to get the following output
>>> {(111,222,121), (112,223,131), (113,224,141)}
>>>
>>> Since the number of tuples in this bag are expected to be large I cannot create new set of tuples and create a bag, as this will cause OOM exception.
>>>
>>> Also I do not want to flatten this bag as this bag will be passed to DISTINCT operator for computing distinct elements in the bag.
>>> As seen from the javadocs for DataBag, there is no way to convert a bag on the fly. I wonder if there is any other way to solve this?
>>>
>>> Thanks
>>> -- Prasanth
>>>
>>
>
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB